使用编辑距离查找相似字符串
这篇文章最初发布在我的网站上。欢迎访问我的网站,了解更多精彩内容!
假设你正在编写一款移动应用,你的用户正在搜索单词kitten
。不幸的是,你预期他们输入的搜索词只有以下几种:
smitten
mitten
kitty
fitting
written
我们如何知道用户想要输入哪个单词?
编辑距离
编辑距离a
是两个序列和之间的距离b
。如果a
和b
是字符串,则编辑距离是将其中一个字符串更改为另一个字符串所需的最小字符编辑量。允许的编辑类型有三种:
- 插入:添加一个字符
a
- 删除:从中删除一个字符
b
a
替换:在或中替换一个字符b
例如,如果第一个字符串a = 'abc'
和第二个字符串分别为b = 'abc'
,则两个字符串之间的编辑距离为 ,0
因为a
和b
相等。如果a = 'abcd'
和b = 'a'
,则距离为3
。如果a = 'abcd'
和b = 'aacc'
,则距离为2
。
a
长度为 的字符串与长度为 的i
字符串的Levenshtein距离的定义是:b
j
此定义是一个递归函数。第一部分,max(i, j) if min(i, j) = 0
是第一个字符串或第二个字符串为空的基准情况。
1_(ai != bi)
第三个最小元素末尾的函数是代价函数。如果 a[i] != b[i],则代价为 1,否则代价为 0。第一个最小元素是从 中删除a
,第二个是插入,第三个是替换。
一个简单的实现
首先,让我们用 Swift 实现一个简单的实现。我们将创建一个名为 的函数levenshtein_distance
,并编写基本条件来检查字符串是否为空:
func levenshtein_distance(a: String, b: String) -> Int {
// If either array is empty, return the length of the other array
if (a.count == 0) {
return b.count
}
if (b.count == 0) {
return a.count
}
}
然后我们添加递归部分。我们计算替换的成本,然后找到三种不同的可能编辑(删除、插入或替换)之间的最小距离:
func levenshtein_distance(a: String, b: String) -> Int {
// ...
// Check whether the last items are the same before testing the other items
let cost = (a.last == b.last) ? 0 : 1
let a_dropped = String(a.dropLast())
let b_dropped = String(b.dropLast())
return min(
// Find the distance if an item in a is removed
levenshtein_distance(a: a_dropped, b: b) + 1,
// Find the distance if an item is removed from b (i.e. added to a)
levenshtein_distance(a: a, b: b_dropped) + 1,
// Find the distance if an item is removed from a and b (i.e. substituted)
levenshtein_distance(a: a_dropped, b: b_dropped) + cost
)
}
让我们用一个简单的测试用例来测试我们的距离函数:
print(opti_leven_distance(a: "123", b: "12"))
1
更多示例测试用例可以在最终文件中找到。现在我们可以比较单词与字符串的距离,kitten
以确定用户可能想要输入哪个单词:
// Print out the distances for our test case
let first_word = "kitten"
let test_words = ["smitten", "mitten", "kitty", "fitting", "written"]
for word in test_words {
let dist = opti_leven_distance(a: first_word, b: word)
print("Distance between \(first_word) and \(word): \(dist)")
}
Distance between kitten and smitten: 2
Distance between kitten and mitten: 1
Distance between kitten and kitty: 2
Distance between kitten and fitting: 3
Distance between kitten and written: 2
用户可能想输入 mitten 而不是 kitten!
改进的实现
上述编辑距离的递归实现对于较大的字符串来说扩展性不佳。如果我们需要计算一千个字符串之间的距离,每个字符串包含数百个字符,该怎么办?
计算编辑距离的一种改进方法是使用距离矩阵来“记住”之前计算的距离。首先,距离函数应该检查是否存在空字符串。然后,我们创建一个矩阵来保存距离计算结果:
func opti_leven_distance(a: String, b: String) -> Int {
// Check for empty strings first
if (a.count == 0) {
return b.count
}
if (b.count == 0) {
return a.count
}
// Create an empty distance matrix with dimensions len(a)+1 x len(b)+1
var dists = Array(repeating: Array(repeating: 0, count: b.count+1), count: a.count+1)
}
距离矩阵的第一列和第一行都为零,作为初始化步骤。下一列从 1 开始递增到 的长度,a
表示删除每个字符得到空字符串;下一行从 1 开始递增到 的长度,b
表示添加(或插入)每个字符得到 的值b
:
func opti_leven_distance(a: String, b: String) -> Int {
//...
// a's default distances are calculated by removing each character
for i in 1...(a.count) {
dists[i][0] = i
}
// b's default distances are calulated by adding each character
for j in 1...(b.count) {
dists[0][j] = j
}
}
与我们的简单实现类似,我们将检查距离矩阵中剩余的索引。不过,这次我们将使用矩阵中存储的先前值来计算最小距离,而不是递归调用距离函数。最终距离是距离矩阵中的最后一个元素(位于右下角):
func opti_leven_distance(a: String, b: String) -> Int {
//...
// Find the remaining distances using previous distances
for i in 1...(a.count) {
for j in 1...(b.count) {
// Calculate the substitution cost
let cost = (a[i-1] == b[j-1]) ? 0 : 1
dists[i][j] = min(
// Removing a character from a
dists[i-1][j] + 1,
// Adding a character to b
dists[i][j-1] + 1,
// Substituting a character from a to b
dists[i-1][j-1] + cost
)
}
}
return dists.last!.last!
}
我们可以再次使用我们的测试用例来验证我们改进的实现是正确的:
print(opti_leven_distance(a: "123", b: "12"))
// Print out the distances for our test case
let first_word = "kitten"
let test_words = ["smitten", "mitten", "kitty", "fitting", "written"]
for word in test_words {
let dist = opti_leven_distance(a: first_word, b: word)
print("Distance between \(first_word) and \(word): \(dist)")
}
1
Distance between kitten and smitten: 2
Distance between kitten and mitten: 1
Distance between kitten and kitty: 2
Distance between kitten and fitting: 3
Distance between kitten and written: 2
Swift 和 Python 实现
import Foundation
/// Calculates the Levenshtein distance between two strings
/// - Parameter a: The first string
/// - Parameter b: The second string
func levenshtein_distance(a: String, b: String) -> Int {
// If either array is empty, return the length of the other array
if (a.count == 0) {
return b.count
}
if (b.count == 0) {
return a.count
}
// Check whether the last items are the same before testing the other items
let cost = (a.last == b.last) ? 0 : 1
let a_dropped = String(a.dropLast())
let b_dropped = String(b.dropLast())
return min(
// Find the distance if an item in a is removed
levenshtein_distance(a: a_dropped, b: b) + 1,
// Find the distance if an item is removed from b (i.e. added to a)
levenshtein_distance(a: a, b: b_dropped) + 1,
// Find the distance if an item is removed from a and b (i.e. substituted)
levenshtein_distance(a: a_dropped, b: b_dropped) + cost
)
}
/// String extension to add substring by Int (such as a[i-1])
extension String {
subscript (i: Int) -> Character {
return self[index(startIndex, offsetBy: i)]
}
}
/// A more optimized version of the Levenshtein distance function using an array of previously calculated distances
/// - Parameter a: The first string
/// - Parameter b: The second string
func opti_leven_distance(a: String, b: String) -> Int {
// Check for empty strings first
if (a.count == 0) {
return b.count
}
if (b.count == 0) {
return a.count
}
// Create an empty distance matrix with dimensions len(a)+1 x len(b)+1
var dists = Array(repeating: Array(repeating: 0, count: b.count+1), count: a.count+1)
// a's default distances are calculated by removing each character
for i in 1...(a.count) {
dists[i][0] = i
}
// b's default distances are calulated by adding each character
for j in 1...(b.count) {
dists[0][j] = j
}
// Find the remaining distances using previous distances
for i in 1...(a.count) {
for j in 1...(b.count) {
// Calculate the substitution cost
let cost = (a[i-1] == b[j-1]) ? 0 : 1
dists[i][j] = min(
// Removing a character from a
dists[i-1][j] + 1,
// Adding a character to b
dists[i][j-1] + 1,
// Substituting a character from a to b
dists[i-1][j-1] + cost
)
}
}
return dists.last!.last!
}
/// Function to test whether the distance function is working correctly
/// - Parameter a: The first test string
/// - Parameter b: The second test string
/// - Parameter answer: The expected answer to be returned by the distance function
func test_distance(a: String, b: String, answer: Int) -> Bool {
let d = opti_leven_distance(a: a, b: b)
if (d != answer) {
print("a: \(a)")
print("b: \(b)")
print("expected: \(answer)")
print("distance: \(d)")
return false
} else {
return true
}
}
// Test the distance function with many different examples
test_distance(a: "", b: "", answer: 0)
test_distance(a: "1", b: "1", answer: 0)
test_distance(a: "1", b: "2", answer: 1)
test_distance(a: "12", b: "12", answer: 0)
test_distance(a: "123", b: "12", answer: 1)
test_distance(a: "1234", b: "1", answer: 3)
test_distance(a: "1234", b: "1233", answer: 1)
test_distance(a: "1248", b: "1349", answer: 2)
test_distance(a: "", b: "12345", answer: 5)
test_distance(a: "5677", b: "1234", answer: 4)
test_distance(a: "123456", b: "12345", answer: 1)
test_distance(a: "13579", b: "12345", answer: 4)
test_distance(a: "123", b: "", answer: 3)
test_distance(a: "kitten", b: "mittens", answer: 2)
print(opti_leven_distance(a: "123", b: "12"))
// Print out the distances for our test case
let first_word = "kitten"
let test_words = ["smitten", "mitten", "kitty", "fitting", "written"]
for word in test_words {
let dist = opti_leven_distance(a: first_word, b: word)
print("Distance between \(first_word) and \(word): \(dist)")
}
下面是上面 Swift 代码的 Python 实现,即distance.py。Python 版本也可以处理 anylist
以及 anystr
# Calculates the Levenshtein distance between two strings
def levenshtein_distance(a, b):
# If either array is empty, return the length of the other array
if not len(a):
return len(b)
if not len(b):
return len(a)
# Check whether the last items are the same before testing the other items
if a[-1] == b[-1]:
cost = 0
else:
cost = 1
return min(
# Find the distance if an item in a is removed
levenshtein_distance(a[:-1], b) + 1,
# Find the distance if an item is removed from b (i.e. added to a)
levenshtein_distance(a, b[:-1]) + 1,
# Find the distance if an item is removed from a and b (i.e. substituted)
levenshtein_distance(a[:-1], b[:-1]) + cost
)
# A more optimized version of the Levenshtein distance function using an array of previously calculated distances
def opti_leven_distance(a, b):
# Create an empty distance matrix with dimensions len(a)+1 x len(b)+1
dists = [ [0 for _ in range(len(b)+1)] for _ in range(len(a)+1) ]
# a's default distances are calculated by removing each character
for i in range(1, len(a)+1):
dists[i][0] = i
# b's default distances are calulated by adding each character
for j in range(1, len(b)+1):
dists[0][j] = j
# Find the remaining distances using previous distances
for i in range(1, len(a)+1):
for j in range(1, len(b)+1):
# Calculate the substitution cost
if a[i-1] == b[j-1]:
cost = 0
else:
cost = 1
dists[i][j] = min(
# Removing a character from a
dists[i-1][j] + 1,
# Adding a character to b
dists[i][j-1] + 1,
# Substituting a character from a to b
dists[i-1][j-1] + cost
)
return dists[-1][-1]
# Function to test whether the distance function is working correctly
def test_distance(a, b, answer):
dist = opti_leven_distance(a, b)
if dist != answer:
print('a:', a)
print('b:', b)
print('expected:', answer)
print('distance:', dist)
print()
if __name__ == '__main__':
# Test the distance function with many different examples
test_distance('', '', 0)
test_distance('1', '1', 0)
test_distance('1', '2', 1)
test_distance('12', '12', 0)
test_distance('123', '12', 1)
test_distance('1234', '1', 3)
test_distance('1234', '1233', 1)
test_distance([1, 2, 4, 8], [1, 3, 4, 16], 2)
test_distance('', '12345', 5)
test_distance([5, 6, 7, 7], [1, 2, 3, 4], 4)
test_distance([1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5], 1)
test_distance([1, 3, 5, 7, 9], [1, 2, 3, 4, 5], 4)
test_distance([1, 2, 3], [], 3)
test_distance('kitten', 'mittens', 2)
first_word = 'kitten'
test_words = ['smitten', 'mitten', 'kitty', 'fitting', 'written']
for word in test_words:
dist = opti_leven_distance(first_word, word)
print(f'Distance between {first_word} and {word}: {dist}')