User:Wtrmute

From LiteratePrograms
Jump to: navigation, search
Creative Commons CC-Zero This user has made all of their contributions available under the Creative Commons CC0 1.0 Universal Public Domain Dedication, including those before the license migration to CC0.
This user has dedicated their contributions to the public domain by waiving all of his or her rights to them worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform any of their contributions, even for commercial purposes, all without asking permission.

Radix trie, version 2 (working)

Contents

[edit] Radix tree (C)

[edit] Introduction

A Radix Tree, or Radix Trie, or Crit-bit Tree, or PATRICIA Trie, or yet Compact Prefix Tree, is a data structure used to search a set of bit sequences (generally character strings, but optionally other bit-patterns like integers or IPv4 addresses). In this sense, they can be used for the same applications as hash tables can.

The advantages of radix trees over hash tables are that radix tries have worst-case complexity O(1), as opposed to a hash table's worst-case complexity of O(n) or O(log n) depending on whether the buckets are implemented as linked lists or red-black trees, for example. Also, since radix trees are trees, the user can walk it and return all key/value pairs in lexicographic order, or obtain all keys with a given prefix. Also, the space cost of radix trees is very deterministic: a radix tree with n items will have exactly n leaf nodes and n - 1 internal nodes, as befits a binary tree.

The advantages of a hash table over a radix tree are essentially that the buckets are all allocated at once, so for a hash table with no collisions, the data are all packed together and thus are very cache-friendly. Radix trees, having its nodes allocated on demand, may have them allocated all over the heap, so one may expect to see a lot more cache misses, unless the tree is completely filled once and then treated as read-only.

Radix trees can contain leaves with only keys, in which case they work as sets rather than dictionaries. In this article, we'll write a set of char *s and a map of char *s to void *.

[edit] Bookkeeping

The routines in this article depend only on the standard C99 headers <stdint.h> for datatype aliases and four functions from <string.h>: strlen(), strcmp(), strncmp() and memcpy(). Besides that, a heap allocator and deallocator are required, usually the system malloc() and free(), but which will be included later on if the specific implementations don't provide a different one.

<<Internal definitions>>=
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#define ALLOC   malloc
#define DEALLOC free

We will be generating two sets of files: a header and an implementation file for each of the versions of the code, and a further test harness to check the implementation of each type, as below:

<<string_set.h>>=
#ifndef STRING_SET_H_DEFINED
#define STRING_SET_H_DEFINED

Set type declarations

Set method declarations

#endif
<<string_map.h>>=
#ifndef STRING_MAP_H_DEFINED
#define STRING_MAP_H_DEFINED

Map type declarations

Map method declarations

#endif
<<string_set.c>>=
Internal definitions

Types

Set specific types

Set specific static methods

Set specific public methods

<<string_map.c>>=
Internal definitions

Types

Map specific types

Map specific static methods

Map specific public methods

[edit] A bit of theory

A Radix tree is a kind of trie, which is a specialised tree designed for information retrieval. Notionally, a trie is a tree where each node has as many children as there are elements in the input alphabet (26 for strictly alphabetical tries, 36 for alphanumeric, 95 for printable ASCII, 245 for UTF-8, etc.). Then a given word is said to be in the trie if one can, starting from the root, follow the ith child (where i is the value of each character of the word in turn) and reach a valid node at the end, where an arbitrary value connected to each word may be placed.

For example, if a trie contains the word "tea", then one may follow the child corresponding to "t" from the root node, and from there follow the child corresponding to "e", and from there the child corresponding to "a" may be followed, and this last node will contain the information keyed to the word "tea".
An example of a trie

As one can see, this is extremely space-inefficient, as a word of length n taken from an alphabet with s characters will need n nodes to be stored, each having s children, which necessitates ns words of storage at least. This leads to the first simplification: any string representable in a computer can be viewed as a sequence of bits, which allows the use of binary nodes.

The second simplification can be understood by realising that, for any given binary node, it either has one child or two. If the nodes with two children store the ordinal number of the bit where its two children first differ, the other nodes with only one child become redundant: now one only needs to walk the tree, checking the value of the bits which actually differentiate between two sets of words in the tree. At the end, one must compare the full key again to verify that the key is the expected one, and not one that simply happens to have the correct bits set or reset, but it is still an O(1) operation on the number of keys in the trie.
Radix tree (by Claudio Rocchini)

A radix tree will then consist of a root node which may be:

  • empty, if the tree itself is empty;
  • a leaf node, if the tree contains only a single leaf;
  • an internal node, if the tree contains two or more leaves.

A leaf node is simply an array of characters which we will compare our search keys against as the last step of a search operation, and in the case of a map also a pointer to a value.

<<Map specific types>>=
struct leaf {
        void * value;
        char   key[];
};

An internal node contains a pair of void * pointers, which will point either to another internal node or to a leaf, and the information about the critical bit, which in this particular implementation is an unsigned 32-bit integer designating the ordinal number of the critical bit starting to count from the most significant bit of the first character.

<<Types>>=
struct node {
        void *   child[2];
        uint32_t cbit;
};

The root structures, besides the types, store a count of items (so we don't have to walk the tree to count the leaves) and, in the case of the map, a pointer to a function to deallocate the value pointers (if needed) when a value is replaced or removed.

<<Set specific types>>=
typedef struct {
        void *   root;
        uint32_t count;
} string_set_t;
<<Map specific types>>=
typedef struct {
        void *   root;
        uint32_t count;
        void (* destr)(void *);
} string_map_t;

We can define now a set of methods for creating and destroying these root structures, which are quite straightforward:

<<Set method declarations>>=
string_set_t *
string_set_create(void);

void
string_set_destroy(string_set_t *);

size_t
string_set_count(string_set_t *);
<<Set specific public methods>>=
string_set_t *
string_set_create(void) {
        string_set_t * result = (string_set_t *) ALLOC(sizeof(string_set_t));

        result->root = NULL;
        result->count = 0;

        return result;
}
<<Map method declarations>>=
string_map_t *
string_map_create(void);

void
string_map_destroy(string_map_t *);

size_t
string_map_count(string_map_t *);

deallocator_t
string_map_get_value_destructor(string_map_t *);

void
string_map_set_value_destructor(string_map_t *, deallocator_t);
<<Map specific public methods>>=
string_map_t *
string_map_create(void) {
        string_map_t * result = (string_map_t *) ALLOC(sizeof(string_map_t));

        result->root = NULL;
        result->count = 0;
        result->destr = NULL;

        return result;
}

[edit] Walking the tree

Walking the tree is a basic operation which is a component in other operations: adding, removing, and testing a tree. In it, we look for a leaf node which is our candidate for the operation in question. In our case, our iteration variable is p, which starts as t->root and ends up a leaf type (struct leaf_node * for map, plain char * for set).

First, let us declare a lot of local variables for our use:

keylen
stores the length of the key so we only have to check it once;
p
is the pointer which will travel down the tree, starting as the object pointed to by t->root and finishing as a pointer to an object of the leaf type;
q
is an alias of p typed as struct node * so we don't have to pollute our code with (struct node *) cast operators;
index
is the whole-byte part of the critical bit for a node, stored in the cbit member of that structure;
mask
is the bit-part of the critical bit for a node, actually an unsigned octet where all bits but the critical bit are set;
dir
will store the direction the desired leaf lies in, either 1 or 0, and will be used as an index into the child member of the struct node * structure; and
c
will store the character at position index of the key, or 0 if that position is past the end of the string.
<<local vars>>=
        const size_t keylen = strlen(key);
        uint8_t * p = t->root;
        struct node * q;
        uint32_t index;
        uint16_t dir = 0;
        uint8_t c, mask;

Once we have determined that the tree isn't empty, then p contains a point to either an internal node or a leaf node. Since we are working with an allocator which always allocates buffers with at least two byte aligment (standard WIN32/Linux malloc() actually allocates buffers with 16 byte alignment), we can use the low-order bit (which we expect to always be zero when allocated) to store a flag telling us the type of our object: having the low-order bit set will indicate an internal node, while having it reset indicates a leaf.

The algorithm is, then, checking whether p has its low order bit set and, if so, pointing to it (with the low-order bit cleared) with q; then determine the direction to go by checking the critical bit and have p point to the correct child pointer. Repeat until the low order bit of p is reset, in which case we have our candidate leaf.

<<Walk the tree>>=
        while (1 & (intptr_t) p) {
                q = (struct node *) (p - 1);

                Determine direction;

                p = q->child[dir];
        }

Determining the direction is also conceptually simple: first, split the critical bit into index and mask parts; if index doesn't point past the end of the key, store the byte of the key which is indexed by it in c, else zero; then use the masking trick below to figure out the direction:

The mask variable will have every bit set but the one we're interested in; notice that this means that its value is thus at most 0xFE. if we or it with the critical byte in c, it will either remain unchanged, if the critical bit was reset, or become all set otherwise (in which case, its value will always be 0xFF).

We then add one to the result: if the critical bit was set, we're adding 0xFF + 1 = 0x100; if it was reset, we're incrementing some number which is at most 0xFE yielding a number smaller than 0x100. In the former case, the 9th bit will be set, and in the latter case, it will be reset. We then shift the result to the right by 8 bits so only this bit is left on the low order bit. And this will be placed in dir, which is our index into the child array, above.

<<Determine direction>>=
        index = q->cbit >> 3;
        mask = ~(0x80 >> (q->cbit & 0x07));
        c = 0;
        if (index < keylen) c = (uint8_t) key[index];
        dir = (1 + (mask | c)) >> 8;

At the end of the while statement in the body of the <<Walk the tree>> chunk we are sure that p contains our best match for the key in our tree, because all critical bits from root down to the leaf match the respective bits in the key. All that is left is to actually do the string comparison and confirm or deny the match, and do the rest of the function-specific processing.

[edit] Testing the tree

Now we reach the stage where we deal with the first of the three primary operations of a radix tree, which I'm calling testing. It implies that we have an already populated tree and a key we either wish to test for membership (set) or retrieve the corresponding value (map). Conceptually, adding comes first, but as this is a simpler operation, we'll start with it.

<<TODO>>=
hijacker
hijacker
hijacker
hijacker