Levenshtein ----------- .. automodule:: rapidfuzz.distance.Levenshtein This implementation supports the usage of different weights for Insertion/Deletion/Substitution. The uniform Levenshtein distance refers to ``weights=(1,1,1)`` and the Indel distance refers to ``weights=(1,1,2)``. All other weights are referred to as generic Levenshtein distance. Functions ^^^^^^^^^ distance ~~~~~~~~ .. autofunction:: rapidfuzz.distance.Levenshtein.distance normalized_distance ~~~~~~~~~~~~~~~~~~~ .. autofunction:: rapidfuzz.distance.Levenshtein.normalized_distance similarity ~~~~~~~~~~ .. autofunction:: rapidfuzz.distance.Levenshtein.similarity normalized_similarity ~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: rapidfuzz.distance.Levenshtein.normalized_similarity editops ~~~~~~~ .. autofunction:: rapidfuzz.distance.Levenshtein.editops opcodes ~~~~~~~ .. autofunction:: rapidfuzz.distance.Levenshtein.opcodes Performance ^^^^^^^^^^^ Since the Levenshtein module uses different implementations based on the weights used, this leads to different performance characteristics. The following sections show the performance for the different possible weights. Uniform ~~~~~~~ The following image shows a benchmark of the uniform Levenshtein distance in multiple Python libraries. All of them are implemented either in C/C++ or Cython. The graph shows, that python-Levenshtein is the only library with a time complexity of ``O(NM)``, while all other libraries have a time complexity of ``O([N/64]M)``. Especially for long strings RapidFuzz is a lot faster than all the other tested libraries. .. image:: img/uniform_levenshtein.svg :align: center Indel ~~~~~ The following image shows a benchmark of the Indel distance in RapidFuzz and python-Levenshtein. Similar to the normal Levenshtein distance python-Levenshtein uses an implementation with a time complexity of ``O(NM)``, while RapidFuzz has a time complexity of ``O([N/64]M)``. .. image:: img/indel_levenshtein.svg :align: center Implementation Notes ^^^^^^^^^^^^^^^^^^^^ Depending on the used input parameters, different optimized implementation are used to improve the performance. These implementations are described in the following sections. Uniform ~~~~~~~ The implementation for the uniform Levenshtein distance has a worst-case performance of ``O([N/64]M)``. It uses the following optimized implementations: - if score_cutoff is 0 the similarity can be calculated using a direct comparison, since no difference between the strings is allowed. The time complexity of this algorithm is ``O(N)``. - A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity. - If score_cutoff is ≤ 3 the mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold ``score_cutoff``. The time complexity of this algorithm is ``O(N)``. - If the length of the shorter string is ≤ 64 after removing the common affix Hyyrös' algorithm is used, which calculates the Levenshtein distance in parallel. The algorithm is described by :cite:t:`2003:hyrroe`. The time complexity of this algorithm is ``O(N)``. - If the length of the shorter string is ≥ 64 after removing the common affix a blockwise implementation of Hyyrös' algorithm is used, which calculates the Levenshtein distance in parallel (64 characters at a time). The algorithm is described by :cite:t:`2003:hyrroe`. The time complexity of this algorithm is ``O([N/64]M)``. Indel ~~~~~ The Indel distance is available as a stand alone implementation. Further details can be found in `here `__. Generic ~~~~~~~ The implementation for other weights is based on Wagner-Fischer. It has a performance of ``O(N * M)`` and has a memory usage of ``O(N)``. Further details can be found in :cite:t:`1974:WagFisch`.