Finite automaton string search algorithm (Java)

From LiteratePrograms
Jump to: navigation, search
Other implementations: C++ | Java
Error creating thumbnail: /bin/bash: rsvg: command not found
DFA for searching for the string "MOMMY". Each time the double-circled state is reached, a match has been found. All unspecified transitions return to the initial state in the upper left.

The finite automaton string search algorithm works by constructing a simple state machine associated with the string being searched for. As it consumes the string being searched, the current state indicates a set of possible positions in the string being searched for. It can be generalized to search for arbitrary regular expressions. For example, to the right is the state machine associated with the search word "MOMMY". The procedure is a special case of the powerset construction procedure used to transform a nondeterministic finite automaton (NFA) into a deterministic finite automaton, since it is easy to create an NFA that recognizes a particular substring.


We will use generics so that we can operate on a list of any data type, not just strings of characters.

Contents

[edit] Generating the state machine

The first step, also the most complicated and expensive step, is generating the state machine we will use for matching.

We will have a State class to represent states.

<<State class>>=
class State<E>
{
    private static int nextStateNum = 0;
    private final int num = nextStateNum++;
    public final BitSet positions;
    additional State members

    public State(BitSet bs) { positions = bs; }
    public String toString() { return Integer.toString(num); }
}
<<imports>>=
import java.util.BitSet;

The State class contains a BitSet storing the positions in the string that the state includes. The initial position is always set. If the final position (after the last character) is set, that indicate a match. Each State also has a unique integer ID associated with it, in order of creation, starting from 0. The initial state should have ID 0. This ID is not important and is just printed for debugging purposes.

We create a simple method that returns an instance of State with the specified position set or, if no such state exists, creates it. To do this we keep track of a mapping from sets of positions to states. We also keep track of a list of all the states in order of creation.

<<DFAStringSearch fields>>=
// maps the set of string positions a state represents to the state
private final Map<BitSet, State<E>> stateMap = new HashMap<BitSet, State<E>>();
// list of states in order of creation
private final List<State<E>> states = new ArrayList<State<E>>();

<<get state method>>=
public State<E> getState(BitSet s)
{
    if (stateMap.containsKey(s))
        return stateMap.get(s);
    else
    {
        State<E> st = new State<E>(s);
        stateMap.put(s, st);
        states.add(st);
        return st;
    }
}
<<imports>>=
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;

Each state will have a transitions map which determines, given the current state and the next character in the input, which state we go to next.

<<additional State members>>=
public final Map<E,State<E>> transitions = new HashMap<E,State<E>>();

To determine how to fill in this table, we look at the characters immediately following the "cursors" in the items listed in the current state. For example, if the item "M|OMMY" is in the current state, and we see a character "O", we go to a state containing the item "MO|MMY". If we see any character other than "O", we "throw away" the item "M|OMMY" instead of advancing the position. (In our case, the transition will not be defined for any character other than "O"; and if we encounter that we will simply go back to the initial state.) In the diagram above, there is a state containing the items "|MOMMY", "MOMM|Y", and "M|OMMY"; depending on whether we see M, Y, or O next, the next state with contain just one of "M|OMMY", "MOMMY|", or "MO|MMY" (they will also contain the initial state, which all states contain).

If there are multiple items with the same next character, we keep and advance all of them when we see that character. For example, one state above contains both "|MOMMY" and "MO|MMY"; if we see M, we progress to a state containing both "M|OMMY" and "MOM|MY".

To perform this in code, we visit states in the order we discover them. As transitions maps get filled in, more states may be added; we will also iterate through those new states as the states.size() increases. When the loop is done, all states will have their transition maps filled. We could not use an Iterator here because it would complain about concurrent modifications.

For each state, we examine each position in that state and the next character at that position (if we have not already examined it at another position):

<<DFAStringSearch constructor>>=
public DFAStringSearch(E[] pattern)
{
    create initial state
    for (int i = 0; i < states.size(); i++)
    {
        State<E> s = states.get(i);
        for (int j = s.positions.nextSetBit(0); j >= 0; j = s.positions.nextSetBit(j+1))
        {
            deal with final position
            E cNext = pattern[j];
            if (!s.transitions.containsKey(cNext))
                fillTransitionTableEntry(pattern, s, cNext);
        }
    }
}

The initial state again contains only the initial position, which we create by calling getState():

<<DFAStringSearch fields>>=
public State<E> initialState;
<<create initial state>>=
BitSet initialPos = new BitSet();
initialPos.set(0);
initialState = getState(initialPos);

The one tricky position to deal with is the final position, where there are no more characters. We mark these states as "final" (a term from automata theory meaning a match is complete when this is state is reached). (It cannot be named "final" because that is a keyword in Java.)

<<additional State members>>=
public boolean finale;

<<deal with final position>>=
if (j == pattern.length)
{
    s.finale = true;
    break;
}

In fillTransitionTableEntry, we construct the set of positions for the next state and update the table entry to point to it. We do this by just taking the current state positions that have the correct next character, advancing them by 1, and throwing in the initial position:

<<fill in transition table entry>>=
private void fillTransitionTableEntry(E[] pattern, State<E> s, E cNext)
{
    BitSet newPositions = new BitSet();
    newPositions.set(0);
    for (int i = s.positions.nextSetBit(0); i >= 0 && i < pattern.length; i = s.positions.nextSetBit(i+1))
    {
        if (pattern[i].equals(cNext))
            newPositions.set(i + 1);
    }
    s.transitions.put(cNext, getState(newPositions));
    print trace message describing new entry
}

The trace message, which helps us to test our implementation, will be printed to standard err and gives the source state, character, and destination state:

<<print trace message describing new entry>>=
System.err.println("Adding edge " + s + " -" + cNext + "-> " + s.transitions.get(cNext));

[edit] Performing the search

Once we have our machine built, the actual string matching step is very simple. We feed the string being searched into the machine, following the transition maps to move between states as instructed, until we reach a final state or use up the string:

<<perform DFA-based search>>=
public int search(E[] searchFor, E[] searchIn)
{
    State<E> curState = initialState;
    int curPos;
    for (curPos = 0; curPos < searchIn.length && !curState.finale; curPos++)
    {
        curState = curState.transitions.get(searchIn[curPos]);
        if (curState == null)
            curState = initialState;
    }
    if (curState.finale)
        return curPos - searchFor.length;
    else
        return -1;
}

[edit] Files

For now, we'll just create a single source module containing the above data and code. Ideally, each match would have an associated object to enable multiply concurrent uses of the search, and this interface would be described in a header file.

<<DFAStringSearch.java>>=
imports

State class

public class DFAStringSearch<E>
{
    DFAStringSearch fields

    DFAStringSearch constructor
    get state method
    fill in transition table entry
    perform DFA-based search
    main method
}

[edit] Test driver

By adding a small test driver to the file we can run it from the command line on test inputs:

Error creating thumbnail: /bin/bash: rsvg: command not found
DFA for searching for the string "MOMMY". Each time the double-circled state is reached, a match has been found. All unspecified transitions return to the initial state in the upper left.
<<main method>>=
private static Character[] str2charArray(String str) {
    Character[] result = new Character[str.length()];
    for (int i = 0; i < str.length(); i++)
        result[i] = str.charAt(i);
    return result;
}

public static void main(String[] args)
{
    Character[] a = str2charArray(args[0]), b = str2charArray(args[1]);
    DFAStringSearch<Character> foo = new DFAStringSearch<Character>(a);
    int result = foo.search(a, b);
    if (result == -1)
        System.out.println("No match found.");
    else
    {
        System.out.println("Matched at position " + result + ":");
        System.out.println(args[1].substring(0, result) + "|" + args[1].substring(result));
    }
}

If we compile and run the following command line:

java DFAStringSearch MOMMY MMOMOMMOMMY

We get the following output:

Adding edge 0 -M-> 1
Adding edge 1 -M-> 1
Adding edge 1 -O-> 2
Adding edge 2 -M-> 3
Adding edge 3 -M-> 4
Adding edge 3 -O-> 2
Adding edge 4 -M-> 1
Adding edge 4 -O-> 2
Adding edge 4 -Y-> 5
Adding edge 5 -M-> 1
Matched at position 6:
MMOMOM|MOMMY

The machine generated precisely matches the machine in the above diagram, and the result is correct.

Download code
hijacker
hijacker
hijacker
hijacker