Matching with allowed mismatch (Java)

From LiteratePrograms
Jump to: navigation, search

Contents

[edit] Overview

An implementation of string matching with a defined number of at most k mismatches based on longest common extension (lce) computation as described in Gusfield 1999:200. This is a form of inexact string matching that allows no insertions or deletions but only matches and mismatches. When used with lce computation with suffix trees it has a runtime complexity of O(km), where m is the length of the text. This implementation uses a simple computation of the lce.

[edit] Implementation

The approach is very similar to matching with wildcards: For every position in the text, up to k lce queries are executed, and if the lce reaches the end of the pattern, then more than k mismatches are needed to match. As at most k lce computations are required the runtime complexity is O(km), where k is the length of the text (Gusfield 1999:200). The method takes three parameters: the text, the pattern and the number of allowed mismatches. It returns a collection of strings, the matching substrings. The algorithm, as described in Gusfield 1999:200, consists of four steps:

  • Step 1: Set j to 1 and h to i and count to 0.
<<init>>=

int j = 0;
int h = i;
int count = 0;
int n = p.length();
  • Step 2: Compute the length L of the longest common extension starting at positions j of P and h of T:
<<compute_lce>>=

int L = SimpleLongestCommonExtension.longestCommonExtension(p, j, t, h);
  • Step 3: If j + L = n + 1, then a k-mismatch of P occurs in T starting at i (in fact, only count mismatches occur); stop.
<<collect>>=

if (j + 1 + L == n + 1) {
    result.add(t.substring(i, i + n));
    break;
}
  • Step 4: If count >= k, then increment count by one, set j to j + L + 1, set h to h + L + 1 and go to step 2. If count = k + 1, then a k-mismatch of P does not occur stating at i; stop.
<<match>>=

else if (count < k) {
    count++;
    j = j + L + 1;
    h = h + L + 1;
} else if (count == k) {
    break;
}

[edit] Usage

  • A JUnit 4 unit test to demonstrate the usage:
<<test>>=

Collection<String> results = getMatches("abentbananaend", "bend", 2);
assertEquals(Arrays.asList("bent", "bana", "aend"), results);
  • The complete program:
<<MatchingWithAllowedMismatch.java>>=

import static org.junit.Assert.assertEquals;
import org.junit.Test;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;

public class MatchingWithAllowedMismatch {
	public static Collection<String> getMatches(String t, String p, int k) {
		Collection<String> result = new ArrayList<String>();
		for (int i = 0; i < t.length(); i++) {
			init
			while (true) {
				compute_lce
				collect
				match
			}
		}
		return result;
	}
	@Test
	public void testGetMismatches() {
		test
	}
}
  • The required simple implementation of longest common extension computation:
<<SimpleLongestCommonExtension.java>>=
Longest common extension (Java)#8162#SimpleLongestCommonExtension.java

[edit] References

  • Gusfield, Dan (1999), Algorithms on Strings, Sequences and Trees. Cambridge: University Press.
Download code
hijacker
hijacker
hijacker
hijacker