# Suffix tree (Java)

A simple, intuitive implementation of a suffix tree as an object-oriented, recursive data structure in Java (requires version 5). The suffix tree is constructed by first constructing a simple suffix tree, which is then transformed into a suffix tree, as described in Böckenhauer & Bongartz (2003). The implementation has a runtime and memory complexity of *O*(*n*^{2}). This implementation is of educational and illustrative purpose, not for high-volume string processing, which is possible with suffix trees, if constructed differently. For construction of suffix trees with linear runtime and memory complexity consider the algorithm by Ukkonen.

## Contents |

# [edit] Usage

Sample usage for the suffix tree: Create a compact suffix tree for an input text and export it as a dot string, using a few attributes to style the generated tree. The resulting text can be rendered as an image with GraphViz.

<<sample_usage>>=CompactSuffixTree tree =newCompactSuffixTree(newSimpleSuffixTree("bananas")); String properties = "rankdir=LR; node[shape=box fillcolor=gray95 style=filled]\n"; System.out.println("digraph {\n" + properties + tree.root + "}");

# [edit] Implementation

The implementation consists of five classes: an abstract suffix tree class, two concrete suffix tree classes (one simple and one compact) and classes for nodes (which recursively contain a collection of themselves, the immediate children) and edges.

## [edit] Classes

### [edit] Abstract

The abstract superclass for the simple suffix tree and the compact suffix tree. It has a root node, which recursively contains children nodes. The constructor for the AbstractSuffixTree has one param, text: the text to be represented by this tree. A terminating "$" is appended if not already present.

<<constructor_abstract>>=AbstractSuffixTree(String text){if(text.length()> 0 && text.charAt(text.length()- 1)== '$'){this.text = text;}else{this.text = text + "$";}}

The attributes of the AbstractSuffixTree: the text represented by this tree, the root node of this tree and the size of the input alphabet.

<<attributes_abstract>>=String text =null; SuffixTreeNode root =null;intinputAlphabetSize = -1;

### [edit] Simple

Both concrete trees represent the same text, but the CompactSuffixTree has compact nodes, but no compact labels, for illustrative reasons. Actually, a suffix tree should contain indices as labels. As this is a non-optimized implementation for illustrative and educational purpose, the edges contain full labels. A simple suffix trie ist constructed with one param, text: the text to be represented by the suffix tree, a terminating "$" is appended if not already present.

<<constructor_simple>>=publicSimpleSuffixTree(String text){super(text); constructTree();}

Create the root node and insert all suffixes into this tree, counting the paths.

<<construct_tree>>=privatevoidconstructTree(){super.root =newSuffixTreeNode();char[]s =super.text.toCharArray();for(inti = 0; i < s.length; i++){List<String> suffixList =newArrayList<String>();for(intk = i; k < s.length; k++){suffixList.add(s[k]+ "");}super.root.addSuffix(suffixList, i+1);}}

### [edit] Node

The addSuffix method of the node takes two parameters:

- suffix: The suffix to insert into the suffix tree, will be inserted behind the maximum prefix of the suffix found in the tree.
- pathIndex: The path index for labeling the leaf at the end of the path of the suffix added.

Adding a suffix consists of two steps:

- Recursivley find the node to insert at.
- Insert new nodes for the suffix to insert below the node found.

<<add_suffix>>=publicvoidaddSuffix(List<String> suffix,intpathIndex){SuffixTreeNode insertAt =this; insertAt = search(this, suffix); insert(insertAt, suffix, pathIndex);}

The recursive search method takes two parameters (see below). It returns the node under which to enter the suffix, that is the node in which the path to the maximum prefix of the new suffix ends. When entering the method, the suffix size should never be 0, as a terminating "$" is always appended to the text, but if that wasn't the case, and the text would be something like "dacdac", and right where the Exception is thrown we would return the startNode, we would have an invalid suffix tree, where one path would end in an inner node.

- startNode: The node in which to start the search.
- suffix: The suffix that is intended to be inserted into the tree.

<<search>>=privateSuffixTreeNode search(SuffixTreeNode startNode, List<String> suffix){if(suffix.isEmpty()){thrownewIllegalArgumentException("Empty suffix. Probably no valid simple suffix tree exists for the input.");}Collection<SuffixTreeNode> children = startNode.children;for(SuffixTreeNode child : children){if(child.incomingEdge.label.equals(suffix.get(0))){suffix.remove(0);if(suffix.isEmpty()){returnchild;}returnsearch(child, suffix);}}returnstartNode;}

The insertion method takes three arguments:

- insertAt: The node into which the suffix should be entered.
- suffix: The suffix to enter into the node. Not all will be used, skips the maximum prefix that has already been found. For every remaining character, a node is added into the tree.
- pathIndex: the path index (for labels of leafs, which isnt implemented for illustrative purpose, but is common for suffix trees).

<<insert>>=privatevoidinsert(SuffixTreeNode insertAt, List<String> suffix,intpathIndex){for(String x : suffix){SuffixTreeNode child =newSuffixTreeNode(insertAt, x, insertAt.nodeDepth + 1, pathIndex, id); insertAt.children.add(child); insertAt = child;}}

The attributes for a node:

- The incoming edge of this node. Every node in a tree has one incoming edge, except for the root, which has none.
- The depth of this node, that is, how many edges are on a path from the root node to this node (not the string depth).
- The label of this node, for leafs only.
- The collection of nodes, the immediate children of this node.
- The parent node of this node.
- The string depth of this node.
- Attributes for saving the tree as dot.

<<attributes_node>>=SuffixTreeEdge incomingEdge =null;intnodeDepth = -1;intlabel = -1; Collection<SuffixTreeNode> children =null; SuffixTreeNode parent =null;intstringDepth;intid = 0;publicstaticintc;

Constructor for a node with a parent, that is, for any node except the root node. The constructor takes five arguments (see below). For creating the root node, another constructor without parameters is used.

- parent: The parent node of this node.
- incomingLabel: The label for the incoming edge of this node.
- depth: The node depth to be assigned to this node.
- label: The label for this node (Nodes are currently labeled with leaf numbers).
- id: An id for the node. The nodes are numbered depth-first when exported as dot.

<<constructor_node>>=publicSuffixTreeNode(SuffixTreeNode parent, String incomingLabel,intdepth,intlabel,intid){children =newArrayList<SuffixTreeNode>(); incomingEdge =newSuffixTreeEdge(incomingLabel, label); nodeDepth = depth;this.label = label;this.parent = parent; stringDepth = parent.stringDepth + incomingLabel.length();this.id = id;}publicSuffixTreeNode(){children =newArrayList<SuffixTreeNode>(); nodeDepth = 0; label = 0;}

We override toString() to recursively exports the node and all children to dot, using indentation to visualize node depth in the text file. Returns the tree as the body of a graph description in the dot language (www.graphviz.org, see sample usage above).

<<dot_export>>=publicString toString(){StringBuilder result =newStringBuilder(); String incomingLabel =this.isRoot()? "" :this.incomingEdge.label;for(inti = 1; i <=this.nodeDepth; i++)result.append("\t");if(this.isRoot()){c = 1;this.id = 1;}else{this.id = c; result.append(this.parent.id + " -> "); result.append(this.id + "[label=\"" + incomingLabel + "\"];\n");}for(SuffixTreeNode child : children){c++; child.id = c; result.append(child.toString());}returnresult.toString();}

Two helper methods, one that returns true if this node is the root node (the root node has no parent) and one that returns true if this node is a leaf (a leaf node has no children).

<<helper_methods>>=publicbooleanisRoot(){returnthis.parent ==null;}publicbooleanisLeaf(){returnchildren.size()== 0;}

### [edit] Compact

The constructor of the compact suffix tree takes one parameter, simpleSuffixTree: the simple suffix tree that should be made compact.

<<constructor_compact>>=publicCompactSuffixTree(SimpleSuffixTree simpleSuffixTree){super(simpleSuffixTree.text);super.root = compactNodes(simpleSuffixTree.root, 0);}

After we are done constructing the simple suffix tree or suffix trie, we make it compact by removing inner nodes with exactly one child node, which makes the structure a suffix tree. The method takes two parameters and returns the root node of the compact suffix tree. The method takes two parameters:

- node: The root node of the simple suffix tree to make compact.
- nodeDepth: The current node depth.

The method consists of the following steps:

- Adjust the node depth while making the tree compact.
- Remove all inner nodes with exactly one child node.
- Set the new longer label.
- Set the new string depth.
- Skip the grandchild by setting the grandchild's children as the child's children.

- For the others, continue.

<<compact_nodes>>=privateSuffixTreeNode compactNodes(SuffixTreeNode node,intnodeDepth){node.nodeDepth = nodeDepth;for(SuffixTreeNode child : node.children){while(child.children.size()== 1){SuffixTreeNode grandchild = child.children.iterator().next(); child.incomingEdge.label += ", " + grandchild.incomingEdge.label; child.stringDepth += grandchild.incomingEdge.label.length(); child.children = grandchild.children;for(SuffixTreeNode grandchild : child.children)grandchild.parent = node;}child = compactNodes(child, nodeDepth + 1);}returnnode;}

### [edit] Edge

Attributes of the edge class: The label of this edge and the index where the branch this label belongs to starts in the text represented by the tree. That is, the number that the leaf at the end of this branch will be labeled with (the index in the immediate children of the tree's root node). The constructor of the edge class takes two parameters:

- label: The label for this edge.
- branchIndex: The index where the branch this label belongs to starts in the text represented by the tree (not actually used, see above).

<<edge>>=String label =null; @SuppressWarnings("unused")privateintbranchIndex = -1;publicSuffixTreeEdge(String label,intbranchIndex){this.label = label;this.branchIndex = branchIndex;}

## [edit] Program

To complete the program, we need to import a few classes and finally put it all together, by adding the sample usage and the five classes:

<<SuffixTree.java>>=import java.util.ArrayList;import java.util.Collection;import java.util.List;import org.junit.Test;publicclassSuffixTree{@TestpublicvoidsampleUsage(){sample_usage}}abstractclassAbstractSuffixTree{attributes_abstract constructor_abstract}classSimpleSuffixTreeextendsAbstractSuffixTree{constructor_simple construct_tree}classCompactSuffixTreeextendsAbstractSuffixTree{constructor_compact compact_nodes}classSuffixTreeNode{attributes_node constructor_node add_suffix search insert dot_export helper_methods}classSuffixTreeEdge{edge}

# [edit] References

- Böckenhauer, Hans-Joachim & Dirk Bongartz (2003),
*Algorithmische Grundlagen der Bioinformatik*, Teubner.

Download code |