Calculates the minimum number of insertions and deletions
required to change one sequence into the other. This is equivalent to the
Levenshtein distance with a substitution weight of 2.
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (int, optional) – Maximum distance between s1 and s2, that is
considered as a result. If the distance is bigger than score_cutoff,
score_cutoff + 1 is returned instead. Default is None, which deactivates
this behaviour.
Calculates a normalized levenshtein similarity in the range [1, 0].
This is calculated as distance/(len1+len2).
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0.
For norm_dist > score_cutoff 1.0 is returned instead. Default is 1.0,
which deactivates this behaviour.
Returns:
norm_dist – normalized distance between s1 and s2 as a float between 0 and 1.0
Calculates the Indel similarity in the range [max, 0].
This is calculated as (len1+len2)-distance.
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (int, optional) – Maximum distance between s1 and s2, that is
considered as a result. If the similarity is smaller than score_cutoff,
0 is returned instead. Default is None, which deactivates
this behaviour.
Calculates a normalized indel similarity in the range [0, 1].
This is calculated as 1-normalized_distance
Parameters:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0.
For norm_sim < score_cutoff 0 is returned instead. Default is 0,
which deactivates this behaviour.
Returns:
norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0
Return type:
float
Examples
Find the normalized Indel similarity between two strings:
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
Returns:
editops – edit operations required to turn s1 into s2
s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before
comparing them. Default is None, which deactivates this behaviour.
Returns:
opcodes – edit operations required to turn s1 into s2
The following image shows a benchmark of the Indel distance in RapidFuzz
and python-Levenshtein. Similar to the normal Levenshtein distance
python-Levenshtein uses an implementation with a time complexity of O(NM),
while RapidFuzz has a time complexity of O([score_cutoff/64]M) (and never worse than O([N/64]M)).
The following implementation is used with a worst-case performance of O([N/64]M).
if max is 0 the similarity can be calculated using a direct comparison,
since no difference between the strings is allowed. The time complexity of
this algorithm is O(N).
if max is 1 and the two strings have a similar length, the similarity can be
calculated using a direct comparison as well, since a substitution would cause
a edit distance higher than max. The time complexity of this algorithm
is O(N).
A common prefix/suffix of the two compared strings does not affect
the Levenshtein distance, so the affix is removed before calculating the
similarity.
If max is ≤ 4 the mbleven algorithm is used. This algorithm
checks all possible edit operations that are possible under
the threshold max. As a difference to the normal Levenshtein distance this
algorithm can even be used up to a threshold of 4 here, since the higher weight
of substitutions decreases the amount of possible edit operations.
The time complexity of this algorithm is O(N).
If the length of the shorter string is ≤ 64 after removing the common affix
Hyyrös’ lcs algorithm is used, which calculates the Indel distance in
parallel. The algorithm is described by Hyyro [Hyy04] and is extended with support
for UTF32 in this implementation. The time complexity of this
algorithm is O(N).
If the length of the shorter string is ≥ 64 after removing the common affix
a blockwise implementation of the Hyyrös’ lcs algorithm is used, which calculates
the Levenshtein distance in parallel (64 characters at a time).
The algorithm is described by Hyyro [Hyy04]. The time complexity of this
algorithm is O([score_cutoff/64]M).