# Matching with allowed mismatch (Java)

## Contents |

## [edit] Overview

An implementation of string matching with a defined number of at most *k* mismatches based on *longest common extension* (*lce*) computation as described in Gusfield 1999:200. This is a form of inexact string matching that allows no insertions or deletions but only matches and mismatches. When used with *lce* computation with suffix trees it has a runtime complexity of *O(km)*, where m is the length of the text. This implementation uses a simple computation of the *lce*.

## [edit] Implementation

The approach is very similar to matching with wildcards: For every position in the text, up to *k* lce queries are executed, and if the lce reaches the end of the pattern, then more than k mismatches are needed to match. As at most *k* lce computations are required the runtime complexity is *O(km)*, where k is the length of the text (Gusfield 1999:200). The method takes three parameters: the text, the pattern and the number of allowed mismatches. It returns a collection of strings, the matching substrings. The algorithm, as described in Gusfield 1999:200, consists of four steps:

- Step 1: Set j to 1 and h to i and count to 0.

<<init>>=intj = 0;inth = i;intcount = 0;intn = p.length();

- Step 2: Compute the length L of the longest common extension starting at positions j of P and h of T:

<<compute_lce>>=intL = SimpleLongestCommonExtension.longestCommonExtension(p, j, t, h);

- Step 3: If j + L = n + 1, then a k-mismatch of P occurs in T starting at i (in fact, only count mismatches occur); stop.

<<collect>>=if(j + 1 + L == n + 1){result.add(t.substring(i, i + n));break;}

- Step 4: If count >= k, then increment count by one, set j to j + L + 1, set h to h + L + 1 and go to step 2. If count = k + 1, then a k-mismatch of P does not occur stating at i; stop.

<<match>>=elseif(count < k){count++; j = j + L + 1; h = h + L + 1;}elseif(count == k){break;}

## [edit] Usage

- A JUnit 4 unit test to demonstrate the usage:

<<test>>=Collection<String> results = getMatches("abentbananaend", "bend", 2); assertEquals(Arrays.asList("bent", "bana", "aend"), results);

- The complete program:

<<MatchingWithAllowedMismatch.java>>=import static org.junit.Assert.assertEquals;import org.junit.Test;import java.util.ArrayList;import java.util.Arrays;import java.util.Collection;publicclassMatchingWithAllowedMismatch{publicstaticCollection<String> getMatches(String t, String p,intk){Collection<String> result =newArrayList<String>();for(inti = 0; i < t.length(); i++){initwhile(true){compute_lce collect match}}returnresult;}@TestpublicvoidtestGetMismatches(){test}}

- The required simple implementation of
*longest common extension*computation:

<<SimpleLongestCommonExtension.java>>=Longest common extension (Java)#8162#SimpleLongestCommonExtension.java

## [edit] References

- Gusfield, Dan (1999),
*Algorithms on Strings, Sequences and Trees*. Cambridge: University Press.

Download code |