Levenshtein

The Levenshtein (edit) distance is a string metric to measure the difference between two strings/sequences s1 and s2. It’s defined as the minimum number of insertions, deletions or substitutions required to transform s1 into s2.

This implementation supports the usage of different weights for Insertion/Deletion/Substitution. The uniform Levenshtein distance refers to weights=(1,1,1) and the Indel distance refers to weights=(1,1,2). All other weights are referred to as generic Levenshtein distance.

Functions

distance

rapidfuzz.distance.Levenshtein.distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None)

Calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into the other according to Levenshtein with custom costs for insertion, deletion and substitution

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than score_cutoff, score_cutoff + 1 is returned instead. Default is None, which deactivates this behaviour.

  • score_hint (int, optional) – Expected distance between s1 and s2. This is used to select a faster implementation. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

Examples

Find the Levenshtein distance between two strings:

>>> from rapidfuzz.distance import Levenshtein
>>> Levenshtein.distance("lewenstein", "levenshtein")
2

Setting a maximum distance allows the implementation to select a more efficient implementation:

>>> Levenshtein.distance("lewenstein", "levenshtein", score_cutoff=1)
2

It is possible to select different weights by passing a weight tuple.

>>> Levenshtein.distance("lewenstein", "levenshtein", weights=(1,1,2))
3

normalized_distance

rapidfuzz.distance.Levenshtein.normalized_distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None)

Calculates a normalized levenshtein distance in the range [1, 0] using custom costs for insertion, deletion and substitution.

This is calculated as distance / max, where max is the maximal possible Levenshtein distance given the lengths of the sequences s1/s2 and the weights.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_dist > score_cutoff 1.0 is returned instead. Default is None, which deactivates this behaviour.

  • score_hint (float, optional) – Expected normalized distance between s1 and s2. This is used to select a faster implementation. Default is None, which deactivates this behaviour.

Returns:

norm_dist – normalized distance between s1 and s2 as a float between 1.0 and 0.0

Return type:

float

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

similarity

rapidfuzz.distance.Levenshtein.similarity(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None)

Calculates the levenshtein similarity in the range [max, 0] using custom costs for insertion, deletion and substitution.

This is calculated as max - distance, where max is the maximal possible Levenshtein distance given the lengths of the sequences s1/s2 and the weights.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the similarity is smaller than score_cutoff, 0 is returned instead. Default is None, which deactivates this behaviour.

  • score_hint (int, optional) – Expected similarity between s1 and s2. This is used to select a faster implementation. Default is None, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2

Return type:

int

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

normalized_similarity

rapidfuzz.distance.Levenshtein.normalized_similarity(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None)

Calculates a normalized levenshtein similarity in the range [0, 1] using custom costs for insertion, deletion and substitution.

This is calculated as 1 - normalized_distance

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_sim < score_cutoff 0 is returned instead. Default is None, which deactivates this behaviour.

  • score_hint (int, optional) – Expected normalized similarity between s1 and s2. This is used to select a faster implementation. Default is None, which deactivates this behaviour.

Returns:

norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0

Return type:

float

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

Examples

Find the normalized Levenshtein similarity between two strings:

>>> from rapidfuzz.distance import Levenshtein
>>> Levenshtein.normalized_similarity("lewenstein", "levenshtein")
0.81818181818181

Setting a score_cutoff allows the implementation to select a more efficient implementation:

>>> Levenshtein.normalized_similarity("lewenstein", "levenshtein", score_cutoff=0.85)
0.0

It is possible to select different weights by passing a weight tuple.

>>> Levenshtein.normalized_similarity("lewenstein", "levenshtein", weights=(1,1,2))
0.85714285714285

When a different processor is used s1 and s2 do not have to be strings

>>> Levenshtein.normalized_similarity(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])
0.81818181818181

editops

rapidfuzz.distance.Levenshtein.editops(s1, s2, *, processor=None, score_hint=None)

Return Editops describing how to turn s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_hint (int, optional) – Expected distance between s1 and s2. This is used to select a faster implementation. Default is None, which deactivates this behaviour.

Returns:

editops – edit operations required to turn s1 into s2

Return type:

Editops

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described [8]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import Levenshtein
>>> for tag, src_pos, dest_pos in Levenshtein.editops("qabxcd", "abycdf"):
...    print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
 delete s1[1] s2[0]
replace s1[3] s2[2]
 insert s1[6] s2[5]

opcodes

rapidfuzz.distance.Levenshtein.opcodes(s1, s2, *, processor=None, score_hint=None)

Return Opcodes describing how to turn s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_hint (int, optional) – Expected distance between s1 and s2. This is used to select a faster implementation. Default is None, which deactivates this behaviour.

Returns:

opcodes – edit operations required to turn s1 into s2

Return type:

Opcodes

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described [9]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import Levenshtein
>>> a = "qabxcd"
>>> b = "abycdf"
>>> for tag, i1, i2, j1, j2 in Levenshtein.opcodes("qabxcd", "abycdf"):
...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
 delete a[0:1] (q) b[0:0] ()
  equal a[1:3] (ab) b[0:2] (ab)
replace a[3:4] (x) b[2:3] (y)
  equal a[4:6] (cd) b[3:5] (cd)
 insert a[6:6] () b[5:6] (f)

Performance

Since the Levenshtein module uses different implementations based on the weights used, this leads to different performance characteristics. The following sections show the performance for the different possible weights.

Uniform

The following image shows a benchmark of the uniform Levenshtein distance in multiple Python libraries. All of them are implemented either in C/C++ or Cython. The graph shows, that python-Levenshtein is the only library with a time complexity of O(NM), while all other libraries have a time complexity of O([N/64]M). Especially for long strings RapidFuzz is a lot faster than all the other tested libraries.

../../_images/uniform_levenshtein.svg

Indel

The following image shows a benchmark of the Indel distance in RapidFuzz and python-Levenshtein. Similar to the normal Levenshtein distance python-Levenshtein uses an implementation with a time complexity of O(NM), while RapidFuzz has a time complexity of O([N/64]M).

../../_images/indel_levenshtein.svg

Implementation Notes

Depending on the used input parameters, different optimized implementation are used to improve the performance. These implementations are described in the following sections.

Uniform

The implementation for the uniform Levenshtein distance has a worst-case performance of O([N/64]M). It uses the following optimized implementations:

  • if score_cutoff is 0 the similarity can be calculated using a direct comparison, since no difference between the strings is allowed. The time complexity of this algorithm is O(N).

  • A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity.

  • If score_cutoff is ≤ 3 the mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold score_cutoff. The time complexity of this algorithm is O(N).

  • If the length of the shorter string is ≤ 64 after removing the common affix Hyyrös’ algorithm is used, which calculates the Levenshtein distance in parallel. The algorithm is described by Hyyrö [Hyyro03]. The time complexity of this algorithm is O(N).

  • If the length of the shorter string is ≥ 64 after removing the common affix a blockwise implementation of Hyyrös’ algorithm is used, which calculates the Levenshtein distance in parallel (64 characters at a time). The algorithm is described by Hyyrö [Hyyro03]. The time complexity of this algorithm is O([N/64]M).

Indel

The Indel distance is available as a stand alone implementation. Further details can be found in here.

Generic

The implementation for other weights is based on Wagner-Fischer. It has a performance of O(N * M) and has a memory usage of O(N). Further details can be found in Wagner and Fischer [WF74].