Finite automaton string search algorithm (C Plus Plus)

From LiteratePrograms
Jump to: navigation, search
Other implementations: C++ | Java
Error creating thumbnail: /bin/bash: rsvg: command not found
DFA for searching for the string "MOMMY". Each time the double-circled state is reached, a match has been found. All unspecified transitions return to the initial state in the upper left.

The finite automaton string search algorithm works by constructing a simple state machine associated with the string being searched for. As it consumes the string being searched, the current state indicates a set of possible positions in the string being searched for. It can be generalized to search for arbitrary regular expressions. For example, to the right is the state machine associated with the search word "MOMMY". The procedure is a special case of the powerset construction procedure used to transform a nondeterministic finite automaton (NFA) into a deterministic finite automaton, since it is easy to create an NFA that recognizes a particular substring.


Contents

[edit] Generating the state machine

The first step, also the most complicated and expensive step, is generating the state machine we will use for matching. We will explicitly represent the following aspects of the machine.

Each state represents a set of positions in the string. Every state contains the initial position. States containing the final position (after the last character) indicate a match. We will keep track of a two-way mapping from sets of positions to states:

<<type definitions>>=
typedef int string_pos;
typedef int state_num;

<<state machine generation data structures>>=
// Maps the set of string positions a state represents to the state number
map< set<string_pos>, state_num > state_map;
// Maps a state number to its set of string positions
vector< set<string_pos> > states;

<<header files>>=
#include <set>
#include <map>
#include <vector>

<<using statements>>=
using std::set;
using std::map;
using std::vector;

We create a simple accessor for the map that retrieves the number of the state with the specified position set or, if no such state exists, creates it and gives it a new ("fresh") state number:

<<get state function>>=
state_num get_state(set<string_pos>& s)
{
    static int next_state_num = 0;
    map< set<string_pos>, state_num >::iterator iter =
        state_map.find(s);
    if (iter == state_map.end())
    {
        state_map[s] = next_state_num;
        states.push_back(s);
        additional state initialization
        next_state_num++;
        return next_state_num - 1;
    }
    else
    {
        return iter->second;
    }
        
}

Our output will be a table called the transition table. This table determines, given the current state and the next character in the input, which state we go to next. Since it'll grow during construction, we use a vector of vectors:

<<constants>>=
const int NUM_CHARS = 256;

<<state machine generation data structures>>=
vector< vector<state_num> > transition_table;

Now we just have to figure out what to fill in this table with. By default we fill it with zeros so that all transitions go to the initial state (the state containing only the initial position):

<<additional state initialization>>=
transition_table.push_back(vector<state_num>(NUM_CHARS));

To determine how to fill in this table, we look at the characters immediately following the "cursors" in the items listed in the current state. For example, if the item "M|OMMY" is in the current state, and we see a character "O", we go to a state containing the item "MO|MMY". If we see any character other than "O", we "throw away" the item "M|OMMY" instead of advancing the position. In the diagram above, there is a state containing the items "|MOMMY", "MOMM|Y", and "M|OMMY"; depending on whether we see M, Y, or O next, the next state with contain just one of "M|OMMY", "MOMMY|", or "MO|MMY" (they will also contain the initial state, which all states contain).

If there are multiple items with the same next character, we keep and advance all of them when we see that character. For example, one state above contains both "|MOMMY" and "MO|MMY"; if we see M, we progress to a state containing both "M|OMMY" and "MOM|MY".

To perform this in code, we visit states in the order we discover them. For each one, we examine each position in that state and the next character at that position (if we have not already examined it at another position):

<<fill in transition table>>=
void fill_transition_table(string pattern)
{
    create initial state
    for (unsigned int state=0; state < states.size(); state++)
    {
    repeat_inner_loop:
        set<string_pos>::iterator iter = states[state].begin();
        for( ; iter != states[state].end(); iter++)
        {
            deal with final position
            char c_next = pattern[*iter];
            if (transition_table[state][c_next] == 0) {
                fill_transition_table_entry(pattern, state, c_next);
                goto repeat_inner_loop;
            }
        }
    }
    clean up unneeded data
}

<<header files>>=
#include <string>

<<using statements>>=
using std::string;

Why do we repeat the inner loop if we update the table? Because fill_transition_table_entry() may modify states, invalidating our iterator iter.

The initial state again contains only the initial position, which we create by calling get_state():

<<create initial state>>=
set<string_pos> initial_pos;
initial_pos.insert(0);
get_state(initial_pos);

The one tricky position to deal with is the final position, where there are no more characters. We mark these states as "final" (a term from automata theory meaning a match is complete when this is state is reached). We store the final bits in an auxilary vector, for which we also add initialization:

<<state machine generation data structures>>=
vector<bool> final;

<<additional state initialization>>=
final.push_back(false);

<<deal with final position>>=
if (*iter == (string_pos)pattern.length())
{
    final[state] = true;
    continue;
}

In fill_transition_table_entry, we construct the set of positions for the next state and update the table entry to point to it. We do this by just taking the current state positions that have the correct next character, advancing them by 1, and throwing in the initial position:

<<fill in transition table entry>>=
void fill_transition_table_entry(string pattern, int state, char c_next)
{
    set<string_pos> new_state;
    new_state.insert(0);
    set<string_pos>::iterator iter = states[state].begin();
    for( ; iter != states[state].end(); iter++)
    {
        if (pattern[*iter] == c_next)
        {
            new_state.insert(*iter + 1);
        }
    }
    state_num temp = get_state(new_state); // fix concurrent modification bug
    transition_table[state][c_next] = temp;
    print trace message describing new entry
}

<<header files>>=
#include <string>

The trace message, which helps us to test our implementation, will be shown only if we define TRACE and gives the source state, character, and destination state:

<<print trace message describing new entry>>=
#if TRACE
    cout << "Adding edge " << state << " -" << c_next << "-> "
         << transition_table[state][c_next] << endl;
#endif

At the very end of generation, we throw away the two-way association of states with sets of positions; we only need the table for matching.

<<clean up unneeded data>>=
state_map.clear();
states.clear();

[edit] Performing the search

Once we have our machine built, the actual string matching step is very simple. We feed the string being searched into the machine, following the transition table to move between states as instructed, until we reach a final state or use up the string:

<<perform DFA-based search>>=
int dfa_string_search(string search_for, string search_in)
{
    state_num cur_state = 0;
    string_pos cur_pos = 0;
    while (cur_pos < (string_pos)search_in.length() && !final[cur_state])
    {
        cur_state = transition_table[cur_state][search_in[cur_pos]];
        cur_pos++;
    }
    if (final[cur_state])
        return cur_pos - search_for.length();
    else
        return -1;
}

[edit] Files

For now, we'll just create a single source module containing the above data and code. Ideally, each match would have an associated object to enable multiply concurrent uses of the search, and this interface would be described in a header file.

<<dfa_string_search.cpp>>=
header files
using statements

type definitions
constants

state machine generation data structures

get state function
fill in transition table entry
fill in transition table
perform DFA-based search

[edit] Test driver

By appending a small test driver to the file we can run it from the command line on test inputs:

Error creating thumbnail: /bin/bash: rsvg: command not found
DFA for searching for the string "MOMMY". Each time the double-circled state is reached, a match has been found. All unspecified transitions return to the initial state in the upper left.
<<dfa_string_search.cpp>>=
int main(int argc, char* argv[])
{
    fill_transition_table(argv[1]);
    int result = dfa_string_search(argv[1], argv[2]);
    if (result == -1)
        cout << "No match found." << endl;
    else
        cout << "Matched at position " << result << ":" << endl
             << string(argv[2]).substr(0, result) << "|"
             << string(argv[2]).substr(result) << endl;
    return 0;
}

<<header files>>=
#include <iostream>

<<using statements>>=
using std::cout;
using std::endl;

If we compile with TRACE and run the following command line:

./dfa_string_search MOMMY MMOMOMMOMMY

We get the following output:

Adding edge 0 -M-> 1
Adding edge 1 -M-> 1
Adding edge 1 -O-> 2
Adding edge 2 -M-> 3
Adding edge 3 -M-> 4
Adding edge 3 -O-> 2
Adding edge 4 -M-> 1
Adding edge 4 -O-> 2
Adding edge 4 -Y-> 5
Adding edge 5 -M-> 1
Matched at position 6:
MMOMOM|MOMMY

The machine generated precisely matches the machine in the above diagram, and the result is correct.

Download code
hijacker
hijacker
hijacker
hijacker