Longest Common Subsequence¶

Functions¶

distance¶

rapidfuzz.distance.LCSseq.distance(s1, s2, *, processor=None, score_cutoff=None)¶

Calculates the LCS distance in the range [0, max].

This is calculated as max(len1, len2) - similarity.

Parameters:

s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.
score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than score_cutoff, score_cutoff + 1 is returned instead. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Examples

Find the LCS distance between two strings:

>>> from rapidfuzz.distance import LCSseq
>>> LCSseq.distance("lewenstein", "levenshtein")
2

Setting a maximum distance allows the implementation to select a more efficient implementation:

>>> LCSseq.distance("lewenstein", "levenshtein", score_cutoff=1)
2

normalized_distance¶

rapidfuzz.distance.LCSseq.normalized_distance(s1, s2, *, processor=None, score_cutoff=None)¶

Calculates a normalized LCS similarity in the range [1, 0].

This is calculated as distance / max(len1, len2).

Parameters:

s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.
score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_dist > score_cutoff 1.0 is returned instead. Default is 1.0, which deactivates this behaviour.

Returns:

norm_dist – normalized distance between s1 and s2 as a float between 0 and 1.0

Return type:

float

similarity¶

rapidfuzz.distance.LCSseq.similarity(s1, s2, *, processor=None, score_cutoff=None)¶

Calculates the length of the longest common subsequence

Parameters:

s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.
score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the similarity is smaller than score_cutoff, 0 is returned instead. Default is None, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2

Return type:

int

normalized_similarity¶

rapidfuzz.distance.LCSseq.normalized_similarity(s1, s2, *, processor=None, score_cutoff=None)¶

Calculates a normalized LCS similarity in the range [0, 1].

This is calculated as 1 - normalized_distance

Parameters:

s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.
score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_sim < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0

Return type:

float

Examples

Find the normalized LCS similarity between two strings:

>>> from rapidfuzz.distance import LCSseq
>>> LCSseq.normalized_similarity("lewenstein", "levenshtein")
0.8181818181818181

Setting a score_cutoff allows the implementation to select a more efficient implementation:

>>> LCSseq.normalized_similarity("lewenstein", "levenshtein", score_cutoff=0.9)
0.0

When a different processor is used s1 and s2 do not have to be strings

>>> LCSseq.normalized_similarity(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])
0.81818181818181

editops¶

rapidfuzz.distance.LCSseq.editops(s1, s2, *, processor=None)¶

Return Editops describing how to turn s1 into s2.

Parameters:

s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

editops – edit operations required to turn s1 into s2

Return type:

Editops

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described in [6]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import LCSseq
>>> for tag, src_pos, dest_pos in LCSseq.editops("qabxcd", "abycdf"):
...    print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
 delete s1[0] s2[0]
 delete s1[3] s2[2]
 insert s1[4] s2[2]
 insert s1[6] s2[5]

opcodes¶

rapidfuzz.distance.LCSseq.opcodes(s1, s2, *, processor=None)¶

Return Opcodes describing how to turn s1 into s2.

Parameters:

s1 (Sequence[Hashable]) – First string to compare.
s2 (Sequence[Hashable]) – Second string to compare.
processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

opcodes – edit operations required to turn s1 into s2

Return type:

Opcodes

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described in [7]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import LCSseq

>>> a = "qabxcd"
>>> b = "abycdf"
>>> for tag, i1, i2, j1, j2 in LCSseq.opcodes(a, b):
...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
 delete a[0:1] (q) b[0:0] ()
  equal a[1:3] (ab) b[0:2] (ab)
 delete a[3:4] (x) b[2:2] ()
 insert a[4:4] () b[2:3] (y)
  equal a[4:6] (cd) b[3:5] (cd)
 insert a[6:6] () b[5:6] (f)