Yesterday in the gap between Debug time, the product manager proposed an idea, hoping to perform fuzzy matching of text between two Excel files to convert a many-to-many relationship into a one-to-many result.

Thinking

This task can be split into two parts:

  1. Read and write excel files.
  2. Make the fuzzy matching progress.

Part one, excel operation

As we know, Excel is part of the office software from Microsoft. It is not open-source software. So we should choose a computer language that can easily operate excel, which means that its community is active and the library is mature. After hesitating between Python and Java, I finally chose Python.

We can use xlrd and xlwt libraries to achieve that.

xlrd is used for reading and xlwt is used for writing. Here are some examples.

1
2
3
4
5
6
import xlrd

file = xlrd.open_workbook('fileName.xls')
table = file.sheet_by_index(0)
print(table.col_values(0))
print(table.row_values(0))
1
2
3
4
5
import xlwt

workbook = xlwt.Workbook(encoding='utf-8')
table1 = workbook.add_sheet('table1')
table1.write(1,1,"value")

Part two, the fuzzy matching algorithm

There are many fuzzy string matching algorithms, the most famous one is Levenshtein distance computing algorithm, it can calculate the minimum number of single-character edits require one to other between two string. It looks like the algorithm works.

There are several variant algorithms, like Damerau-Levenshtein distance or Hamming distance, but both of them are not suitable for this scenario, because the sequence of characters in the data text is relatively standard, and the string lengths of raw and reference data are not equal.

We can use fuzzywuzzy library to achieve that.

Result

The final script is as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
#!/usr/bin/python
# -*- coding: UTF-8 -*-

"""
Date:
2020-01-17
Author:
symeonchen#gmail.com
Desc:
Use for fuzzy matching between excel files.
Environment:
Version:Python3
Usage:
Install necessary libraries:
pip3 install xlrd
pip3 install xlwt
pip3 install fuzzywuzzy
Install optional libraries:
pip3 install python-Levenshtein
Other:
Fuzzy matching algorithm principle:
Levenshtein distance algorithm.
Can think of it as a minimal change of string algorithm
"""

import xlrd
import xlwt

from fuzzywuzzy import process

# Only contain name
originExcelFilePath = "originPoi.xlsx"

# Contain detail message
referenceExcelFilePath = "referencePoi.xls"

# Where result saved
dataFile = "data.xls"

# Nums of fuzzy matching result
choice_limit = 5


def read_poi_text(file_name, index_of_origin_data_sheet_of_file, col_of_name_of_sheet):
_wb = xlrd.open_workbook(filename=file_name)
_sheet_necessary = _wb.sheet_by_index(index_of_origin_data_sheet_of_file)
return _sheet_necessary.col_values(col_of_name_of_sheet)


# Get origin name list
poi_of_we_have_list = read_poi_text(originExcelFilePath, 4, 0)

# Get reference data list
wb = xlrd.open_workbook(filename=referenceExcelFilePath)
sheet_necessary = wb.sheet_by_index(0)
poi_of_reference_list_id = sheet_necessary.col_values(0)
poi_of_reference_list_name = sheet_necessary.col_values(1)
poi_of_reference_list_longitude = sheet_necessary.col_values(3)
poi_of_reference_list_latitude = sheet_necessary.col_values(4)

# To assure the size of all list is the same
print("id size is " + str(len(poi_of_reference_list_id)))
print("name size is " + str(len(poi_of_reference_list_name)))
print("longitude size is " + str(len(poi_of_reference_list_longitude)))
print("latitude size is " + str(len(poi_of_reference_list_latitude)))

writePoi = xlwt.Workbook(encoding='utf-8')
sheetWrite = writePoi.add_sheet('data')

for index_of_origin_poi in range(1, len(poi_of_we_have_list)):
poiName = poi_of_we_have_list[index_of_origin_poi]
print("Matching for position " + str(index_of_origin_poi) + ", Poi name:" + poiName)
result = process.extract(poiName, poi_of_reference_list_name, limit=choice_limit)

for index_of_match_size in range(0, choice_limit):
matchName = result[index_of_match_size][0]
matchValue = result[index_of_match_size][1]
# print("matchName is "+str(matchName))
# print("matchValue is "+str(matchValue))
for name_index in range(0, len(poi_of_reference_list_name)):
# This is a workaround to solve the problem that progress does not provide the index of matching result.
if poi_of_reference_list_name[name_index] == matchName:
# print(name_index)
writeOriginName = poiName
writeMatchValue = matchValue
writeReferenceId = poi_of_reference_list_id[name_index]
writeReferenceName = poi_of_reference_list_name[name_index]
writeReferenceLongitude = poi_of_reference_list_longitude[name_index]
writeReferenceLatitude = poi_of_reference_list_latitude[name_index]

position_of_row = choice_limit * index_of_origin_poi + index_of_match_size
sheetWrite.write(position_of_row, 0, writeOriginName)
sheetWrite.write(position_of_row, 1, writeMatchValue)
sheetWrite.write(position_of_row, 2, writeReferenceId)
sheetWrite.write(position_of_row, 3, writeReferenceName)
sheetWrite.write(position_of_row, 4, writeReferenceLongitude)
sheetWrite.write(position_of_row, 5, writeReferenceLatitude)
break

writePoi.save(dataFile)
print("Data saved successfully! ")

More

It turned out to be acceptable but not enough. Using string segmentation to clean the data first, and then filter out the common suffixes seems to be more reliable.