Yesterday in the gap between Debug time, the product manager proposed an idea, hoping to perform fuzzy matching of text between two Excel files to convert a many-to-many relationship into a one-to-many result.
Thinking
This task can be split into two parts:
Read and write excel files.
Make the fuzzy matching progress.
Part one, excel operation
As we know, Excel is part of the office software from Microsoft. It is not open-source software. So we should choose a computer language that can easily operate excel, which means that its community is active and the library is mature. After hesitating between Python and Java, I finally chose Python.
We can use xlrd and xlwt libraries to achieve that.
xlrd is used for reading and xlwt is used for writing. Here are some examples.
There are many fuzzy string matching algorithms, the most famous one is Levenshtein distance computing algorithm, it can calculate the minimum number of single-character edits require one to other between two string. It looks like the algorithm works.
There are several variant algorithms, like Damerau-Levenshtein distance or Hamming distance, but both of them are not suitable for this scenario, because the sequence of characters in the data text is relatively standard, and the string lengths of raw and reference data are not equal.
# Get origin name list poi_of_we_have_list = read_poi_text(originExcelFilePath, 4, 0)
# Get reference data list wb = xlrd.open_workbook(filename=referenceExcelFilePath) sheet_necessary = wb.sheet_by_index(0) poi_of_reference_list_id = sheet_necessary.col_values(0) poi_of_reference_list_name = sheet_necessary.col_values(1) poi_of_reference_list_longitude = sheet_necessary.col_values(3) poi_of_reference_list_latitude = sheet_necessary.col_values(4)
# To assure the size of all list is the same print("id size is " + str(len(poi_of_reference_list_id))) print("name size is " + str(len(poi_of_reference_list_name))) print("longitude size is " + str(len(poi_of_reference_list_longitude))) print("latitude size is " + str(len(poi_of_reference_list_latitude)))
for index_of_origin_poi in range(1, len(poi_of_we_have_list)): poiName = poi_of_we_have_list[index_of_origin_poi] print("Matching for position " + str(index_of_origin_poi) + ", Poi name:" + poiName) result = process.extract(poiName, poi_of_reference_list_name, limit=choice_limit)
for index_of_match_size in range(0, choice_limit): matchName = result[index_of_match_size][0] matchValue = result[index_of_match_size][1] # print("matchName is "+str(matchName)) # print("matchValue is "+str(matchValue)) for name_index in range(0, len(poi_of_reference_list_name)): # This is a workaround to solve the problem that progress does not provide the index of matching result. if poi_of_reference_list_name[name_index] == matchName: # print(name_index) writeOriginName = poiName writeMatchValue = matchValue writeReferenceId = poi_of_reference_list_id[name_index] writeReferenceName = poi_of_reference_list_name[name_index] writeReferenceLongitude = poi_of_reference_list_longitude[name_index] writeReferenceLatitude = poi_of_reference_list_latitude[name_index]
It turned out to be acceptable but not enough. Using string segmentation to clean the data first, and then filter out the common suffixes seems to be more reliable.