Alger logo
𝔸𝕝𝕘𝕖𝕣
About

Longest Common Subsequence

"""
LCS Problem Statement: Given two sequences, find the length of longest subsequence
present in both of them.  A subsequence is a sequence that appears in the same relative
order, but not necessarily continuous.
Example:"abc", "abg" are subsequences of "abcdefgh".
"""


def longest_common_subsequence(x: str, y: str):
    """
    Finds the longest common subsequence between two strings. Also returns the
    The subsequence found

    Parameters
    ----------

    x: str, one of the strings
    y: str, the other string

    Returns
    -------
    L[m][n]: int, the length of the longest subsequence. Also equal to len(seq)
    Seq: str, the subsequence found

    >>> longest_common_subsequence("programming", "gaming")
    (6, 'gaming')
    >>> longest_common_subsequence("physics", "smartphone")
    (2, 'ph')
    >>> longest_common_subsequence("computer", "food")
    (1, 'o')
    """
    # find the length of strings

    assert x is not None
    assert y is not None

    m = len(x)
    n = len(y)

    # declaring the array for storing the dp values
    L = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i - 1] == y[j - 1]:
                match = 1
            else:
                match = 0

            L[i][j] = max(L[i - 1][j], L[i][j - 1], L[i - 1][j - 1] + match)

    seq = ""
    i, j = m, n
    while i > 0 and j > 0:
        if x[i - 1] == y[j - 1]:
            match = 1
        else:
            match = 0

        if L[i][j] == L[i - 1][j - 1] + match:
            if match == 1:
                seq = x[i - 1] + seq
            i -= 1
            j -= 1
        elif L[i][j] == L[i - 1][j]:
            i -= 1
        else:
            j -= 1

    return L[m][n], seq


if __name__ == "__main__":
    a = "AGGTAB"
    b = "GXTXAYB"
    expected_ln = 4
    expected_subseq = "GTAB"

    ln, subseq = longest_common_subsequence(a, b)
    print("len =", ln, ", sub-sequence =", subseq)
    import doctest

    doctest.testmod()
About this Algorithm

Problem Statement

Given two strings S and T, find the length of the longest common subsequence (LCS).

Approach

Let the dp[i][j] be the length of the longest common subsequence of prefixes S[1..i] and T[1..j]. Our answer (the length of LCS) is dp[|S|][|T|] since the prefix of the length of string is the string itself.

Both dp[0][i] and dp[i][0] are 0 for any i since the LCS of empty prefix and anything else is an empty string.

Now let's try to calculate dp[i][j] for any i, j. Let's say S[1..i] = *A and T[1..j] = *B where * stands for any sequence of letters (could be different for S and T), A stands for any letter and B stands for any letter different from A. Since A != B, our LCS doesn't include A or B as a last character. So we could try to throw away A or B character. If we throw A, our LCS length will be dp[i - 1][j] (since we have prefixes S[1..i - 1] and T[1..j]). If we try to throw B character, we will have prefixes S[1..i] and T[1..j - 1] so the length of LCS will be dp[i][j - 1]. As we are looking for the Longest common subsequence, we will pick the maximum value from dp[i][j - 1] and dp[i - 1][j].

But what if S[1..i] = *A and T[1..j] = *A? We could say that the LCS of our prefixes is LCS of prefixes S[1..i - 1] and T[1..j - 1] plus the letter A. So dp[i][j] = dp[i - 1][j - 1] + 1 if S[i] = T[j].

We could see that we can fill our dp table row by row, column by column. So our algorithm will be like:

  • Let's say that we have strings S of the length N and T of the length M (numbered from 1). Let's create the table dp of size (N + 1) x (M + 1) numbered from 0.
  • Let's fill the 0th row and the 0th column of dp with 0.
  • Then, we follow the algorithm:
for i in range(1..N):
    for j in range(1..M):
        if(S[i] == T[j])
            dp[i][j] = dp[i - 1][j - 1] + 1
        else
            dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

Time Complexity

O(N * M) In any case

Space Complexity

O(N * M) - simple implementation O(min {N, M}) - two-layers implementation (as dp[i][j] depends on only i-th and i-th layers, we coudld store only two layers).

Example

Let's say we have strings ABCB and BBCB. We will build such a table:

# # A B C B
# 0 0 0 0 0
B 0 ? ? ? ?
B 0 ? ? ? ?
C 0 ? ? ? ?
B 0 ? ? ? ?

Now we will start to fill our table from 1st row. Since S[1] = A and T[1] = B, the dp[1][1] will be the maximal value of dp[0][1] = 0 and dp[1][0] = 0. So dp[1][1] = 0. But now S[2] = B = T[1], so dp[1][2] = dp[0][1] + 1 = 1. dp[1][3] is 1 since A != C and we pick max{dp[1][2], dp[0][3]}. And dp[1][4] = dp[0][3] + 1 = 1.

# # A B C B
# 0 0 0 0 0
B 0 0 1 1 1
B 0 ? ? ? ?
C 0 ? ? ? ?
B 0 ? ? ? ?

Now let's fill the other part of the table:

# # A B C B
# 0 0 0 0 0
B 0 0 1 1 1
B 0 0 1 1 2
C 0 0 1 2 2
B 0 0 1 2 3

So the length of LCS is dp[4][4] = 3.

Code Implementation Links

Video Explanation

Video explanation by Tushar Roy

Algerlogo

© Alger 2022

About us

We are a group of programmers helping each other build new things, whether it be writing complex encryption programs, or simple ciphers. Our goal is to work together to document and model beautiful, helpful and interesting algorithms using code. We are an open-source community - anyone can contribute. We check each other's work, communicate and collaborate to solve problems. We strive to be welcoming, respectful, yet make sure that our code follows the latest programming guidelines.