The Levenshtein (edit) distance is a string metric to measure the
difference between two strings/sequences s1 and s2.
It’s defined as the minimum number of insertions, deletions or
substitutions required to transform s1 into s2.
This implementation supports the usage of different weights for
Insertion/Deletion/Substitution. The uniform Levenshtein distance refers to weights=(1,1,1)
and the Indel distance refers to weights=(1,1,2). All other weights are referred to
as generic Levenshtein distance.
Calculates the minimum number of insertions, deletions, and substitutions
required to change one sequence into the other according to Levenshtein with custom
costs for insertion, deletion and substitution
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form
(insertion, deletion, substitution). Default is (1, 1, 1),
which gives all three operations a weight of 1.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (int, optional) – Maximum distance between s1 and s2, that is
considered as a result. If the distance is bigger than score_cutoff,
score_cutoff + 1 is returned instead. Default is None, which deactivates
this behaviour.
score_hint (int, optional) – Expected distance between s1 and s2. This is used to select a
faster implementation. Default is None, which deactivates this behaviour.
Returns:
distance – distance between s1 and s2
Return type:
int
Raises:
ValueError – If unsupported weights are provided a ValueError is thrown
Examples
Find the Levenshtein distance between two strings:
Calculates a normalized levenshtein distance in the range [1, 0] using custom
costs for insertion, deletion and substitution.
This is calculated as distance/max, where max is the maximal possible
Levenshtein distance given the lengths of the sequences s1/s2 and the weights.
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form
(insertion, deletion, substitution). Default is (1, 1, 1),
which gives all three operations a weight of 1.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0.
For norm_dist > score_cutoff 1.0 is returned instead. Default is None,
which deactivates this behaviour.
score_hint (float, optional) – Expected normalized distance between s1 and s2. This is used to select a
faster implementation. Default is None, which deactivates this behaviour.
Returns:
norm_dist – normalized distance between s1 and s2 as a float between 1.0 and 0.0
Return type:
float
Raises:
ValueError – If unsupported weights are provided a ValueError is thrown
Calculates the levenshtein similarity in the range [max, 0] using custom
costs for insertion, deletion and substitution.
This is calculated as max-distance, where max is the maximal possible
Levenshtein distance given the lengths of the sequences s1/s2 and the weights.
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form
(insertion, deletion, substitution). Default is (1, 1, 1),
which gives all three operations a weight of 1.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (int, optional) – Maximum distance between s1 and s2, that is
considered as a result. If the similarity is smaller than score_cutoff,
0 is returned instead. Default is None, which deactivates
this behaviour.
score_hint (int, optional) – Expected similarity between s1 and s2. This is used to select a
faster implementation. Default is None, which deactivates this behaviour.
Returns:
similarity – similarity between s1 and s2
Return type:
int
Raises:
ValueError – If unsupported weights are provided a ValueError is thrown
Calculates a normalized levenshtein similarity in the range [0, 1] using custom
costs for insertion, deletion and substitution.
This is calculated as 1-normalized_distance
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form
(insertion, deletion, substitution). Default is (1, 1, 1),
which gives all three operations a weight of 1.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0.
For norm_sim < score_cutoff 0 is returned instead. Default is None,
which deactivates this behaviour.
score_hint (int, optional) – Expected normalized similarity between s1 and s2. This is used to select a
faster implementation. Default is None, which deactivates this behaviour.
Returns:
norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0
Return type:
float
Raises:
ValueError – If unsupported weights are provided a ValueError is thrown
Examples
Find the normalized Levenshtein similarity between two strings:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_hint (int, optional) – Expected distance between s1 and s2. This is used to select a
faster implementation. Default is None, which deactivates this behaviour.
Returns:
editops – edit operations required to turn s1 into s2
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_hint (int, optional) – Expected distance between s1 and s2. This is used to select a
faster implementation. Default is None, which deactivates this behaviour.
Returns:
opcodes – edit operations required to turn s1 into s2
Since the Levenshtein module uses different implementations based on the weights
used, this leads to different performance characteristics. The following sections
show the performance for the different possible weights.
The following image shows a benchmark of the uniform Levenshtein distance in
multiple Python libraries. All of them are implemented either in C/C++ or Cython.
The graph shows, that python-Levenshtein is the only library with a time
complexity of O(NM), while all other libraries have a time complexity of
O([N/64]M). Especially for long strings RapidFuzz is a lot faster than
all the other tested libraries.
The following image shows a benchmark of the Indel distance in RapidFuzz
and python-Levenshtein. Similar to the normal Levenshtein distance
python-Levenshtein uses an implementation with a time complexity of O(NM),
while RapidFuzz has a time complexity of O([N/64]M).
Depending on the used input parameters, different optimized implementation are used
to improve the performance. These implementations are described in the following
sections.
The implementation for the uniform Levenshtein distance has a worst-case
performance of O([N/64]M). It uses the following optimized implementations:
if score_cutoff is 0 the similarity can be calculated using a direct comparison,
since no difference between the strings is allowed. The time complexity of
this algorithm is O(N).
A common prefix/suffix of the two compared strings does not affect
the Levenshtein distance, so the affix is removed before calculating the
similarity.
If score_cutoff is ≤ 3 the mbleven algorithm is used. This algorithm
checks all possible edit operations that are possible under
the threshold score_cutoff. The time complexity of this algorithm is O(N).
If the length of the shorter string is ≤ 64 after removing the common affix
Hyyrös’ algorithm is used, which calculates the Levenshtein distance in
parallel. The algorithm is described by Hyyrö [Hyyro03]. The time complexity of this
algorithm is O(N).
If the length of the shorter string is ≥ 64 after removing the common affix
a blockwise implementation of Hyyrös’ algorithm is used, which calculates
the Levenshtein distance in parallel (64 characters at a time).
The algorithm is described by Hyyrö [Hyyro03]. The time complexity of this
algorithm is O([N/64]M).
The implementation for other weights is based on Wagner-Fischer.
It has a performance of O(N*M) and has a memory usage of O(N).
Further details can be found in Wagner and Fischer [WF74].