Huffman coding (Java)

From LiteratePrograms
Jump to: navigation, search
Other implementations: C++ | Java

Given an arbitrary set of symbols (the english alphabet is the example that will be used here), Huffman coding is a way of creating the most efficient (smallest) binary code for that set of symbols. It exploits the fact that in most data some symbols occur more frequently than others and by shortening the code for that symbol, space can be saved. There are other interesting properties of Huffman codes as well as some "gotchas" with respect to their use, so the above link is worth reading.

The algorithm itself is fairly simple; it's a special method of building a binary tree. Initially, one starts with a set of nodes, each consisting of a symbol and its associated frequency. When the tree is finished, each of these nodes will be a leaf in the tree. The algorithm then selects the two least frequent nodes, and creates a parent node with a frequency equal to the sum of the children's frequencies, and replaces the children with the parent (the parent can be thought of as representing the probabilities of either child occurring). This is done until there is only one node remaining in the set, and at that point the binary tree is complete.

Contents

[edit] The Hufftree class

The following class template implements Huffman coding and decoding. It is given a list of symbols and associated frequencies, from which it builds the Huffman tree and a map which stores the resulting encoding.

The class has a generic type parameter, E, the type of symbol (typical choice will be Character). It must implement Comparable. The Comparable is used by the implementation of Hufftree, where the stronger demand of total order is needed (the reason will be explained at the point where it is used).

The frequencies are stored as doubles.

Thus the class looks as follows:

<<Hufftree.java>>=
includes

public class Hufftree<E extends Comparable<? super E>>
{
  Hufftree public members
  Hufftree private members
}
Node class
Leaf class
Internal class

[edit] The data structures

The central data structure is, of course, the Huffman tree itself. This tree consists of nodes, which are either leaf nodes holding a frequency and a symbol, or internal nodes, holding a frequency and two child nodes.

<<Hufftree private members>>=
private Node<E> tree;

<<Node class>>=
abstract class Node<E extends Comparable<? super E>> implements Comparable<Node<E>>
{
  protected final double frequency;
  protected Node(double f) { frequency = f; }

  Node member functions
}

Leaf nodes are constructed from a data type and a frequency.

<<Leaf class>>=
class Leaf<E extends Comparable<? super E>> extends Node<E>
{
  public final E data;
  public Leaf(double f, E d) {
    super(f);
    data = d;
  }

  Leaf member functions
}

Internal nodes are constructed from their left and right child nodes.

<<Internal class>>=
class Internal<E extends Comparable<? super E>> extends Node<E>
{
  public final Node<E> leftChild, rightChild;
  public Internal(Node<E> l, Node<E> r) {
    super(l.frequency + r.frequency);
    leftChild = l;
    rightChild = r;
  }

  Internal member functions
}

The Hufftree class also holds a Map which directly maps symbols to their Huffman encoding. Since the Huffman encoding is a bit sequence of variable length, we need some kind of list of booleans to store it. Since all we will be doing is adding and removing bits from the end of the sequence as we traverse the tree, we will use a Stack<Boolean>. When we need to copy the sequence into the mapping, we will use the toArray function to copy it into an array of Booleans. This map is filled through recursing through the nodes, adding false for the left child node, and true for the right child node. That is done using the following node member function.

<<Node member functions>>=
public abstract void fill(Map<E,Boolean[]> encoding,
                          Stack<Boolean> prefix);
<<Internal member functions>>=
public void fill(Map<E,Boolean[]> encoding,
                 Stack<Boolean> prefix)
{
    prefix.push(false);
    leftChild.fill(encoding, prefix);
    prefix.pop();
    prefix.push(true);
    rightChild.fill(encoding, prefix);
    prefix.pop();
}
<<Leaf member functions>>=
public void fill(Map<E,Boolean[]> encoding,
                 Stack<Boolean> prefix)
{
    encoding.put(data, prefix.toArray(new Boolean[0]));
}

Of course, the map must be stored in the Hufftree object:

<<Hufftree private members>>=
private final Map<E,Boolean[]> encoding = new HashMap<E,Boolean[]>();

For this, we need to import some standard classes.

<<includes>>=
import java.util.Stack;
import java.util.Map;
import java.util.HashMap;

[edit] Building the Huffman tree

The constructor of the Huffman class template builds the Huffman tree and encoding map from symbols and frequencies given through a Map from the type E to Double.

<<Hufftree public members>>=
public Hufftree(Map<E,Double> frequencies)
{
  Hufftree constructor code
}

To build the Huffman tree, we have to repeatedly receive the nodes with the least frequencies. This is most easily achieved by using a PriorityQueue.

<<Hufftree constructor code>>=
PriorityQueue<Node<E>> pqueue = new PriorityQueue<Node<E>>();

For this we need to import PriorityQueue.

<<includes>>=
import java.util.PriorityQueue;

In order to use the priority queue, we made our Node class Comparable to itself so we can order them. Therefore, we need to implement a compareTo method that performs the comparison.

<<Node member functions>>=
public int compareTo(Node<E> b)
{
  Node comparison code
}

We want to return a negative, zero, or positive number if this node is less than, equal to, or greater than the other node, respectively.

<<Node comparison code>>=
if (frequency != b.frequency)
  return Double.compare(frequency, b.frequency);

Now, if we have to elements with exact same frequency, for reliable encoding it's vitally important that we have anyway a well-defined order on nodes. To do that, a first additional criterion is introduced that in that case, leaf nodes always come before internal nodes.

<<Node comparison code>>=
else if (this instanceof Leaf && b instanceof Internal)
  return -1;
else if (this instanceof Internal && b instanceof Leaf)
  return 1;

If that didn't clear the situation, we have two nodes of the same type. If both are internal nodes, lexicographically compare first the left, and then the right nodes.

<<Node comparison code>>=
else if (this instanceof Internal && b instanceof Internal)
{
  Internal<E> a = (Internal<E>)this,
              z = (Internal<E>)b;
  if (!a.leftChild.equals(z.leftChild))
    return a.leftChild.compareTo(z.leftChild);
  else
    return a.rightChild.compareTo(z.rightChild);
}

If the code flow passes up to here, both nodes are leaf nodes. Since we have demanded that the data type E is Comparable, it can used here.

<<Node comparison code>>=
else
  return ((Leaf<E>)this).data.compareTo(((Leaf<E>)b).data);

Now that we have defined the priority queue and the order used, we can actually build the tree. At first, we build the leaf nodes from the input, and put them into the priority queue.

<<Hufftree constructor code>>=
assert frequencies.size() > 1; // will not work with 0 or 1 symbols

for (Map.Entry<E,Double> e : frequencies.entrySet())
{
  pqueue.offer(new Leaf<E>(e.getValue(), e.getKey()));
}

Next, we continuously draw the two least-frequency nodes from the priority queue and replace them with a new internal node with those nodes as child nodes. If we emptied the queue, the last node drawn was the root node of the newly created Huffman tree, so we store it in the tree member.

<<Hufftree constructor code>>=
while (true)
{
  Node<E> top = pqueue.poll();
  if (pqueue.isEmpty())
  {
    tree = top;
    break;
  }
  else
  {
    Node<E> top2 = pqueue.poll();
    pqueue.offer(new Internal<E>(top, top2));
  }
}

Finally, the encoding map is built.

<<Hufftree constructor code>>=
Stack<Boolean> bitvec = new Stack<Boolean>();
tree.fill(encoding, bitvec);

[edit] Encoding data

The encodings are just looked up from the map. For convenience, there's a simple function encoding just one symbol.

<<Hufftree public members>>=
public List<Boolean> encode(E value)
{
  return Arrays.asList(encoding.get(value));
}

The encoding of data is done by concatenating the encodings of the symbols of the message. The message is given through an Iterable<E>.

<<Hufftree public members>>=
public List<Boolean> encode(Iterable<E> c)
{
  List<Boolean> result = new ArrayList<Boolean>();
  for (E x : c)
  {
    result.addAll(encode(x));
  }
  return result;
}

We import some more classes.

<<includes>>=
import java.util.Arrays;

[edit] Decoding data

To decode the data, the Huffman tree is walked as directed by the input bit sequence. Whenever a leaf node is reached, the corresponding symbol is output, and the decoding starts again at the root node. Output of the decoded message goes through an output iterator which must accept assignment from E.

<<Hufftree public members>>=
public Collection<E> decode(List<Boolean> v)
{
  Collection<E> result = new ArrayList<E>();
  Node<E> node = tree;
  for (Boolean b : v)
  {
    Internal<E> i = (Internal<E>)node;
    node = b ? i.rightChild : i.leftChild;
    if (node instanceof Leaf)
    {
      result.add(((Leaf<E>)node).data);
      node = tree;
    }
  }

  assert node == tree;
  return result;
}

We import some more classes.

<<includes>>=
import java.util.List;
import java.util.ArrayList;
import java.util.Collection;

[edit] The test program

The test program creates a Huffman code for the English letter frequencies, outputs the code, encodes a simple string, and finally decodes the just-encoded string.

<<Hufftree public members>>=
public static void main(String[] args)
{
  main code
}

The letter frequencies are conveniently stored in a map. For the frequencies, the type double is used. While this is not necessarily a good idea for a real compression program, it's fine for the demonstration. The actual numbers are taken literally from Huffman coding (Python).

<<main code>>=
Map<Character,Double> frequencies = new HashMap<Character,Double>();
frequencies.put('e', 0.124167);
frequencies.put('a', 0.0820011);
frequencies.put('t', 0.0969225);
frequencies.put('i', 0.0768052);
frequencies.put('n', 0.0764055);
frequencies.put('o', 0.0714095);
frequencies.put('s', 0.0706768);
frequencies.put('r', 0.0668132);
frequencies.put('l', 0.0448308);
frequencies.put('d', 0.0363709);
frequencies.put('h', 0.0350386);
frequencies.put('c', 0.0344391);
frequencies.put('u', 0.028777);
frequencies.put('m', 0.0281775);
frequencies.put('f', 0.0235145);
frequencies.put('p', 0.0203171);
frequencies.put('y', 0.0189182);
frequencies.put('g', 0.0181188);
frequencies.put('w', 0.0135225);
frequencies.put('v', 0.0124567);
frequencies.put('b', 0.0106581);
frequencies.put('k', 0.00393019);
frequencies.put('x', 0.00219824);
frequencies.put('j', 0.0019984);
frequencies.put('q', 0.0009325);
frequencies.put('z', 0.000599);

After the data has been set up, the Huffman tree can be built.

<<main code>>=
Hufftree<Character> hufftree = new Hufftree<Character>(frequencies);

After that, the resulting Huffman code is output.

<<main code>>=
for (char ch = 'a'; ch <= 'z'; ++ch)
{
  System.out.println(ch + ": " + hufftree.encode(ch));
}

After we've output the code, we also want to see if it works. For that we define a simple string to be encoded and then decoded again.

<<main code>>=
String source = "literateprogramming";

Now we first output the original string, encode it, output the encoded bit stream, decode that bit stream, and finally output the decoded string.

<<main code>>=
System.out.println(source);

List<Character> sourceArray = new ArrayList<Character>();
for (int i = 0; i < source.length(); i++)
  sourceArray.add(source.charAt(i));

List<Boolean> encoded = hufftree.encode(sourceArray);
System.out.println(encoded);

Collection<Character> destination = hufftree.decode(encoded);
System.out.println(destination);

The program gives the following output:

a: [true, true, true, true]
b: [false, false, false, false, false, true]
c: [true, false, false, true, false]
d: [true, true, false, false, false]
e: [false, true, true]
f: [false, true, false, false, false]
g: [true, true, false, false, true, false]
h: [true, false, false, true, true]
i: [true, true, true, false]
j: [false, false, false, false, false, false, true, true, true]
k: [false, false, false, false, false, false, false]
l: [false, false, false, true]
m: [false, true, false, true, false]
n: [true, true, false, true]
o: [true, false, true, true]
p: [false, false, false, false, true]
q: [false, false, false, false, false, false, true, true, false, true]
r: [true, false, false, false]
s: [true, false, true, false]
t: [false, false, true]
u: [false, true, false, true, true]
v: [false, true, false, false, true, false]
w: [false, true, false, false, true, true]
x: [false, false, false, false, false, false, true, false]
y: [true, true, false, false, true, true]
z: [false, false, false, false, false, false, true, true, false, false]
literateprogramming
[false, false, false, true, true, true, true, false, false, false, true, false, true, true, true, false, false, false, true, true, true, true, false, false, true, false, true, true, false, false, false, false, true, true, false, false, false, true, false, true, true, true, true, false, false, true, false, true, false, false, false, true, true, true, true, false, true, false, true, false, false, true, false, true, false, true, true, true, false, true, true, false, true, true, true, false, false, true, false]
[l, i, t, e, r, a, t, e, p, r, o, g, r, a, m, m, i, n, g]
Download code
hijacker
hijacker
hijacker
hijacker