Huffman coding (C Plus Plus)

From LiteratePrograms
Jump to: navigation, search
Other implementations: C++ | Java

Given an arbitrary set of symbols (the english alphabet is the example that will be used here), Huffman coding is a way of creating the most efficient (smallest) binary code for that set of symbols. It exploits the fact that in most data some symbols occur more frequently than others and by shortening the code for that symbol, space can be saved. There are other interesting properties of Huffman codes as well as some "gotchas" with respect to their use, so the above link is worth reading.

The algorithm itself is fairly simple; it's a special method of building a binary tree. Initially, one starts with a set of nodes, each consisting of a symbol and its associated frequency. When the tree is finished, each of these nodes will be a leaf in the tree. The algorithm then selects the two least frequent nodes, and creates a parent node with a frequency equal to the sum of the children's frequencies, and replaces the children with the parent (the parent can be thought of as representing the probabilities of either child occurring). This is done until there is only one node remaining in the set, and at that point the binary tree is complete.

The following code was successfully tested on Linux/x86 (OpenSUSE 10.0) with g++ 4.0.2.

Contents

[edit] The Hufftree class

The following class template implements Huffman coding and decoding. It is given a list of symbols and associated frequencies, from which it builds the huffman tree and a map which stores the resulting encoding. Since the class is templated, it completely lives in the header.

The header has the following structure:

<<huffman.h>>=
#ifndef HUFFMAN_H_INC
#define HUFFMAN_H_INC

includes

class Hufftree definition

class Hufftree member definitions

#endif

The class has two template parameters:

The first parameter, DataType, is type of symbol (typical choices will be char, unsigned char, or wchar_t). It must be copy-constructible and less-than-comparable giving a total order. The less-than comparability is required by the standard map holding the encoding, but it's also used by the implementation of Hufftree itself, where the stronger demand of total order is needed (the reason will be explained at the point where it is used).

The second template paramter, Frequency, is the type used for the symbol frequencies. This is templated because while float or double would be an obvious choice, for actual compression programs, one probably doesn't want to rely on subtle differences in floating point operations between platforms, where in the worst case a slightly different rounding might change the resulting Huffman code. So templating this argument allows to store the frequencies in a more portable way (e.g. as percentages in ints). The type used here must be copy-constructible, less-than-comparable and addable (i.e. if a and b are of type Frequency, then a + b must be defined, and result in a type which can be used to initialize a Frequency typed object).

Thus the class template looks as follows:

<<class Hufftree definition>>=
template<typename DataType, typename Frequency> class Hufftree
{
public:
  Hufftree public members
private:
  Hufftree private members
};

[edit] The data structures

The central data structure is, of course, the Huffman tree itself. This tree consists of nodes, which are either leaf nodes holding a frequency and a symbol, or internal nodes, holding a frequency and two child nodes. Leaf nodes can be identified by that they don't have any children; thus space can be saved by reusing the space held by the rightChild pointer to also hold the data pointer; the leftChild pointer can then be used to determine which of them is valid.

Since the nodes are part of the private Huffman tree implementation, the type is a private member of Hufftree:

<<Hufftree private members>>=
class Node;
Node* tree;

<<class Hufftree member definitions>>=
template<typename DataType, typename Frequency>
struct Hufftree<DataType, Frequency>::Node
{
  Frequency frequency;
  Node* leftChild;
  union
  {
    Node* rightChild; // if leftChild != 0
    DataType* data;  // if leftChild == 0
  };

  Node member functions
};

Leaf nodes are constructed from a data type and a frequency.

<<Node member functions>>=
Node(Frequency f, DataType d):
  frequency(f),
  leftChild(0),
  data(new DataType(d))
{
}

Internal nodes are constructed from their left and right child nodes.

<<Node member functions>>=
Node(Node* left, Node* right):
  frequency(left->frequency + right->frequency),
  leftChild(left),
  rightChild(right)
{
}

Nodes own their child nodes, that is, they delete them on destruction. The same is true for the copy of the symbol held by leaf nodes.

<<Node member functions>>=
~Node()
{
  if (leftChild)
  {
    delete leftChild;
    delete rightChild;
  }
  else
  {
    delete data;
  }
}

The Huffclass also holds a map which directly maps symbols to their Huffman encoding. Since the huffman encoding is a bit sequence of variable length, a std::vector<bool> is used to store it. This map is filled through recursing through the nodes, adding a 0 for the left child node, and a 1 for the right child node. That is done using the following node member function.

<<Node member functions>>=
void fill(std::map<DataType, std::vector<bool> >& encoding,
          std::vector<bool>& prefix)
{
  if (leftChild)
  {
    prefix.push_back(0);
    leftChild->fill(encoding, prefix);
    prefix.back() = 1;
    rightChild->fill(encoding, prefix);
    prefix.pop_back();
  }
  else
    encoding[*data] = prefix;
}

For this, we need the standard headers vector and map.

<<includes>>=
#include <vector>
#include <map>

Of course, the map must be stored in the Hufftree object:

<<Hufftree private members>>=
typedef std::map<DataType, std::vector<bool> > encodemap;
encodemap encoding;

[edit] Building the Huffman tree

The constructor of the Huffman class template builds the huffman tree and encoding map from symbols and frequencies given thrugh an STL-type input range. Thus it takes a pair of input iterators whose value_type must have the members first of type [const] DataType, and second of type [const]Frequency, like e.g. std::pair<DataType, Frequency>. Note that the iterators of std::map<DataType, Frequency> fulfill this requirement.

<<Hufftree public members>>=
  template<typename InputIterator>
   Hufftree(InputIterator begin, InputIterator end);

<<class Hufftree member definitions>>=
template<typename DataType, typename Frequency>
 template<typename InputIterator>
 Hufftree<DataType, Frequency>::Hufftree(InputIterator begin,
                                         InputIterator end):
   tree(0)
{
  Hufftree constructor code
}

To build the Huffman tree, we have to repeatedly receive the nodes with the least frequencies. This is most easily achieved by using a piority queue.

<<Hufftree constructor code>>=
std::priority_queue<Node*, std::vector<Node*>, NodeOrder> pqueue;

For this we of course need the standard header queue (and vector, but we already have included that)

<<includes>>=
#include <queue>

NodeOrder is a functor which tells the relative order of the nodes to the priority queue.

<<Hufftree private members>>=
class NodeOrder;

<<class Hufftree member definitions>>=
template<typename DataType, typename Frequency>
struct Hufftree<DataType, Frequency>::NodeOrder
{
  bool operator()(Node* a, Node* b)
  {
    Node comparison code
  }
};

The priority queue normally returns the largest Elements first. Since we need the element with the least frequency first, we essentially have to implement a greater than operation on the frequencies.

<<Node comparison code>>=
if (b->frequency < a->frequency)
  return true;
if (a->frequency < b->frequency)
  return false;

Now, if we have to elements with exact same frequency, for reliable encoding it's vitally important that we have anyway a well-defined order on nodes. To do that, a first additional criterion is introduced that in that case, leaf nodes always come before internal nodes.

<<Node comparison code>>=
if (!a->leftChild && b->leftChild)
  return true;
if (a->leftChild && !b->leftChild)
  return false;

If that didn't clear the situation, we have two nodes of the same type. If both are internal nodes, lexicographically compare first the left, and then the right nodes.

<<Node comparison code>>=
if (a->leftChild && b->leftChild)
{
  if ((*this)(a->leftChild, b->leftChild))
    return true;
  if ((*this)(b->leftChild, a->leftChild))
    return false;
  return (*this)(a->rightChild, b->rightChild);
}

If the code flow passes up to here, both nodes are leaf nodes. Since we have demanded that there's a total order on DataType, it can used here.

<<Node comparison code>>=
return *(a->data) < *(b->data);

Now that we have defined the priority queue and the order used, we can actually build the tree. At first, we build the leaf nodes from the input, and put them into the priority queue.

<<Hufftree constructor code>>=
while (begin != end)
{
  Node* dataNode = new Node(begin->second, begin->first);
  pqueue.push(dataNode);
  ++begin;
}

Next, we continuously draw the two least-frequency nodes from the priority queue and replace them with a new internal node with those nodes as child nodes. If we emptied the queue, the last node drawn was the root node of the newwly created Huffman tree, so we store it in the tree member.

<<Hufftree constructor code>>=
while (!pqueue.empty())
{
  Node* top = pqueue.top();
  pqueue.pop();
  if (pqueue.empty())
  {
    tree = top;
  }
  else
  {
    Node* top2 = pqueue.top();
    pqueue.pop();
    pqueue.push(new Node(top, top2));
  }
}

Finally, the encoding map is built.

<<Hufftree constructor code>>=
std::vector<bool> bitvec;
tree->fill(encoding, bitvec);

[edit] The destructor

The destructor is actually simple: The tree must be deleted; the map cares for itself. Since it is so simple, the destructor is implemented inline.

<<Hufftree public members>>=
~Hufftree() { delete tree; }

[edit] Encoding data

The encoding of data is done by concatenating the encodings of the symbols of the message. The message is given through a STL-style range of imput iterators, whose value_type must be convertible to DataType. The encodings are just looked up from the map.

<<Hufftree public members>>=
template<typename InputIterator>
 std::vector<bool> encode(InputIterator begin, InputIterator end);

<<class Hufftree member definitions>>=
template<typename DataType, typename Frequency>
 template<typename InputIterator>
 std::vector<bool> Hufftree<DataType, Frequency>::encode(InputIterator begin,
                                                         InputIterator end)
{
  std::vector<bool> result;
  while (begin != end)
  {
    typename encodemap::iterator i = encoding.find(*begin);
    result.insert(result.end(), i->second.begin(), i->second.end());
    ++begin;
  }
  return result;
}

For convenience, there's also a simple function encoding just one symbol.

<<Hufftree public members>>=
std::vector<bool> encode(DataType const& value)
{
  return encoding[value];
}

[edit] Decoding data

To decode the data, the Huffman tree is walked as directed by the input bit sequence. Whenever a leaf node is reached, the corresponding symbol is output, and the decoding starts again at the root node. Output of the decoded message goes through an output iterator which must accept assignment from DataType.

<<Hufftree public members>>=
template<typename OutputIterator>
 void decode(std::vector<bool> const& v, OutputIterator iter);

<<class Hufftree member definitions>>=
template<typename DataType, typename Frequency>
 template<typename OutputIterator>
 void Hufftree<DataType, Frequency>::decode(std::vector<bool> const& v,
                                            OutputIterator iter)
{
  Node* node = tree;
  for (std::vector<bool>::const_iterator i = v.begin(); i != v.end(); ++i)
  {
    node = *i? node->rightChild : node->leftChild;
    if (!node -> leftChild)
    {
      *iter++ = *(node->data);
      node = tree;
    }
  }
}

[edit] The test program

The test program creates a Huffman code for the English letter frequencies, outputs the code, encodes a simple string, and finally decodes the just-encoded string.

<<huffman.cc>>=
#include "huffman.h"
other includes

helper function

int main()
{
  main code
}

The letter frequencies are conveniently stored in a map. For the frequencies, the type double is used. While this is not necessarily a good idea for a real compression program, it's fine for the demonstration. The actual numbers are taken literally from Huffman coding (Python).

<<main code>>=
std::map<char, double> frequencies;
frequencies['e'] = 0.124167;
frequencies['a'] = 0.0820011;
frequencies['t'] = 0.0969225;
frequencies['i'] = 0.0768052;
frequencies['n'] = 0.0764055;
frequencies['o'] = 0.0714095;
frequencies['s'] = 0.0706768;
frequencies['r'] = 0.0668132;
frequencies['l'] = 0.0448308;
frequencies['d'] = 0.0363709;
frequencies['h'] = 0.0350386;
frequencies['c'] = 0.0344391;
frequencies['u'] = 0.028777;
frequencies['m'] = 0.0281775;
frequencies['f'] = 0.0235145;
frequencies['p'] = 0.0203171;
frequencies['y'] = 0.0189182;
frequencies['g'] = 0.0181188;
frequencies['w'] = 0.0135225;
frequencies['v'] = 0.0124567;
frequencies['b'] = 0.0106581;
frequencies['k'] = 0.00393019;
frequencies['x'] = 0.00219824;
frequencies['j'] = 0.0019984;
frequencies['q'] = 0.0009325;
frequencies['z'] = 0.000599;

While the header <map> is already included by huffman.h, usage of that class is not part of its interface, therefore it's a good idea to include it in the main program anyway.

<<other includes>>=
#include <map>

After the data has been set up, the Huffman tree can be built.

<<main code>>=
Hufftree<char, double> hufftree(frequencies.begin(), frequencies.end());

After that, the resulting Huffman code is output.

<<main code>>=
for (char ch = 'a'; ch <= 'z'; ++ch)
{
  std::cout << ch << ": " << hufftree.encode(ch) << "\n";
}

For this, we need the headers iostream (for std::cout) and ostream (for output operators)

<<other includes>>=
#include <iostream>
#include <ostream>

Now, hufftree.encode returns std::vector<bool>, but C++ doesn't provide an output operator for that, so we need our own version. It just copies the content of the vector to the stream.

<<helper function>>=
std::ostream& operator<<(std::ostream& os, std::vector<bool> vec)
{
  std::copy(vec.begin(), vec.end(), std::ostream_iterator<bool>(os, ""));
  return os;
}

For this code, we need some more includes: std::copy needs algorithm, std::ostream_iterator needs iterator.

<<other includes>>=
#include <algorithm>
#include <iterator>

After we've output the code, we also want to see if it works. Fot that we define a simple string to be encoded and then decoded again.

<<main code>>=
std::string source = "literateprogramming";

For this, we need to include the string header.

<<other includes>>=
#include <string>

Now we first output the original string, encode it, output the encoded bit stream, decode that bit stream, and finally output the decoded string.

<<main code>>=
std::cout << source << "\n";

std::vector<bool> encoded = hufftree.encode(source.begin(), source.end());
std::cout << encoded << "\n";

std::string destination;
hufftree.decode(encoded, std::back_inserter(destination));
std::cout << destination << "\n";

The program gives the following output:

a: 1111
b: 000001
c: 10010
d: 11000
e: 011
f: 01000
g: 110010
h: 10011
i: 1110
j: 000000111
k: 0000000
l: 0001
m: 01010
n: 1101
o: 1011
p: 00001
q: 0000001101
r: 1000
s: 1010
t: 001
u: 01011
v: 010010
w: 010011
x: 00000010
y: 110011
z: 0000001100
literateprogramming
0001111000101110001111001011000011000101111001010001111010100101011101101110010
literateprogramming
Download code
hijacker
hijacker
hijacker
hijacker