# Radix tree (C)

## Contents |

## [edit] Introduction

A **Radix Tree**, or **Radix Trie**, or **Crit-bit Tree**, or **PATRICIA Trie**, or yet **Compact Prefix Tree**, is a data structure used to search a set of bit sequences (generally character strings, but optionally other bit-patterns like integers or IPv4 addresses). In this sense, they can be used for the same applications as hash tables can.

The advantages of radix trees over hash tables are that radix tries have *worst-case* complexity O(1), as opposed to a hash table's worst-case complexity of O(*n*) or O(log *n*) depending on whether the buckets are implemented as linked lists or red-black trees, for example. Also, since radix trees are trees, the user can walk it and return all key/value pairs in lexicographic order, or obtain all keys with a given prefix. Also, the space cost of radix trees is very deterministic: a radix tree with *n* items will have exactly *n* leaf nodes and *n* - 1 internal nodes, as befits a binary tree.

The advantages of a hash table over a radix tree are essentially that the buckets are all allocated at once, so for a hash table with no collisions, the data are all packed together and thus are very cache-friendly. Radix trees, having its nodes allocated on demand, may have them allocated all over the heap, so one may expect to see a lot more cache misses, unless the tree is completely filled once and then treated as read-only.

Radix trees can contain leaves with only keys, in which case they work as sets rather than dictionaries. In this article, we'll write some routines to implement radix tries for a generic reference type, and then use those routines to implement both a set of `int`s and a map of `char *`s to `void *`.

## [edit] Generic routines

The routines in this article are almost free-standing, depending only on the standard C99 header `<stdint.h>` for more some datatype aliases. Besides that, a heap allocator is required, usually the system `malloc()`, but which will be included later on if the specific implementations don't provide a different one.

<<Internal definitions>>=#include <stdint.h>

The structure of the generic part of the implementation files looks like this:

<<Generic part>>=Internal definitions Internal structures Generic helper functions Generic algorithms

### [edit] Data model

The radix tree is made up of three types of structure: one for holding the root of the tree (which will be called `ROOTSTRUCT`), one for representing the internal nodes, and one for representing the leaves. The leaf type varies according to what kind of data the tree accepts, and should be ordinarily be a pointer to a structure, but can be a regular integer or a pointer to `char` in some cases. Our generic routines will use a preprocessor token, `LEAFTYPE`, to refer to this type, which will be defined in the concrete types.

When analysing radix trees, there are three major cases to consider: when the tree has no items, when it has a single item, and when it has more than one item. In the first case, the root object holds a `NULL` pointer; in the second case, it points directly to a leaf object; and in the third case, it points to an internal node. These three cases will have to be considered separately for each operation upon the tree.

Considering the above, the root struct must contain a reference to the root node of the tree, as well as a flag to tell if it is a leaf or an internal node. We will actually maintain this flag as a count of leaves within the tree, to help determine when the tree is empty in case of integer keys where zero is a valid key.

<<Internal structures>>=structROOTSTRUCT{uint32_t leafcount;union{struct_internal_node * node; LEAFTYPE leaf;}root;};

The internal nodes will always contain references to two children which have a common prefix in their key up to a given bit *b*, called the *critical bit* for that given subtree. Each subtree can also have either one leaf, or an internal node with two leaves or more attached to it. We will need, then, two references to internal nodes, two flags to tell whether each child is a leaf or an internal node, and the length of the common prefix. Here, we also define three preprocessor macros, `IS_LEAF`, `SET_LEAF`, and `SET_NODE`, to maintain the leaf flags.

<<Internal structures>>=struct_internal_node{uint32_t is_leaf : 2; uint32_t critbit : 30;union{struct_internal_node * node; LEAFTYPE leaf;}child[2];};

<<Internal definitions>>=#define IS_LEAF(node, dir) ((node)->is_leaf & (1 << (dir)))#define SET_LEAF(node, dir) ((node)->is_leaf |= (1 << (dir)))#define SET_NODE(node, dir) ((node)->is_leaf &= ~(1 << (dir)))

Finally, we know (at this stage) very little about the leaf type, which. The only things the generic algorithms need to be able to do with it, besides receiving it as a parameter and possibly returning it, are comparing two of them to decide at which bit their keys first differ (or if it doesn't), and retrieving the `i`th bit of the key for a given leaf. Finally, the algorithms might need to refer to a non-existing leaf (for example, if the search has a key that doesn't exist in the tree), represented by the preprocessor token `NO_LEAF`.

The operation for comparison of two leaves will be called `COMPARE`, will receive two objects of type `LEAFTYPE` as parameters, and will return the first bit at which they differ or `-1` if their keys are equivalent.

The operation for extracting a bit from a leaf's key will be called `DECIDE` and will receive a leaf, a bit position, and an extra parameter of type `EXTRA_ARG`, which is passed to the generic algorithms by their callers. For leaves with `char *` keys, this is generally the `strlen()` of the key to help determine where the key ends and avoid an illegal access, but can be whatever the concrete implementation of `DECIDE` requires. The operation returns either 0 or 1, which will be the value of the bit at the position requested.

### [edit] Searching

The first operation we will consider is searching for a given key in a tree. We will assume, for now, that we already have:

- a tree which has been previously initialised and possibly populated;
- a
`LEAFTYPE`object with the key we want to search for; - a third argument of type
`EXTRA_ARG`to be passed to`DECIDE`when it is called.

The operation will return the leaf which contains the requested key or `NO_LEAF` if no leaf with that key was found.

The three cases are:

- If the tree has no nodes, then it doesn't contain the requested key, whichever it may be. Return
`NO_LEAF`; - If the tree has a single node, a leaf, then
`COMPARE`it with the key we are seeking and return it if it is the key we want, or return`NO_LEAF`if it isn't; - If the tree has more than one node, walk down it, checking the critical bit for each internal node to know which subtree to look into, until you reach a leaf. Then treat that leaf as if it were the above case,
`COMPARE`ing with the desired key and returning it if it is equivalent.

<<Generic algorithms>>=staticLEAFTYPE _get(structROOTSTRUCT * tree, LEAFTYPE leaf, EXTRA_ARG aux){LEAFTYPE result;struct_internal_node * node; uint32_t dir;if(Check if tree is empty)returnNO_LEAF;if(Check if root points to a leaf){result = tree->root.leaf; Compare leaves and return}/* root points to a node */ node = tree->root.node; Walk tree looking for leaf Compare leaves and return}

We check whether a tree is empty by verifying if its leafcount is zero. If the leafcount is one, the root is pointing to a leaf.

<<Check if tree is empty>>=tree->leafcount == 0

<<Check if root points to a leaf>>=tree->leafcount == 1

The most basic operation when dealing with radix trees is walking down the internal nodes, comparing the critical bit for that internal node to the reference key, until a leaf is reached. The DECIDE operation is used here:

<<Walk tree looking for leaf>>=while(1){dir = DECIDE(leaf, node->critbit, aux);if(IS_LEAF(node, dir)){result = node->child[dir].leaf;break;}else{node = node->child[dir].node;}}

Finally, when we have a candidate leaf in `result`, we use `COMPARE` to check whether it is the leaf we want or not, and return either it or `NO_LEAF`.

<<Compare leaves and return>>=if(COMPARE(result, leaf) == -1)returnresult;elsereturnNO_LEAF;

And with this, the generic part of the search algorithm is completed. The algorithm takes a tree, a leaf with a key somewhere within it, and returns the leaf with an equivalent key (where equivalency is defined by `COMPARE` returning -1) from the tree, if it exists.

### [edit] Inserting

Now we will consider inserting a new leaf into a tree which has already been initialised. This operation may require, in case the tree is not already completely empty, a new internal node to be allocated. This requires us to consider that:

- the system might run out of memory, in which case we need to make sure both that the tree retains a valid structure and that the exceptional condition is properly signaled;
- the user might want to use a different allocator in place of the standard
`malloc()`.

The first issue will be solved by checking for the return value of the allocator before any changes are made to the tree and, if the allocation failed, setting a special reference parameter, similarly to what the `strtof` family of functions does with its second parameter. The second one will be dealt with by expecting a preprocessor token `ALLOC` which defaults to `malloc`, as well as a matching token `DEALLOC` which defaults to `free` for the deallocator.

<<Internal definitions>>=#ifndef ALLOC# include <stdlib.h># define ALLOC malloc#endif#ifndef DEALLOC# include <stdlib.h># define DEALLOC free#endif

The generic algorithm thus takes:

- a reference to a tree;
- the leaf to be added (the responsibility for allocating memory for the leaves is up to the specific functions, as they understand the requirements of each leaf type);
- an extra parameter of type
`EXTRA_ARG`to be passed to the`DECIDE`function; - a boolean value telling whether to replace or ignore the addition of a leaf if another is found with an equivalent key (collision policy);
- a pointer to an integer which will return -1 if an allocation error happened or 0 otherwise.

The algorithm will return a `LEAFTYPE` object which will usually be `NO_LEAF`, except when there already was a leaf with the same key as the one being added. In this case, the object returned is either the one that was replaced by the new leaf or the new leaf that didn't replace the existing one, depending on the collision policy parameter.

Here, the three cases are:

- when the tree is empty, simply add the new leaf at the root and update the leaf count;
- when the tree contains a single leaf, check which is the first bit where the new leaf and the existing one differ. If they don't differ at all, depending on the collision policy, either replace or don't replace the leaf and return the one that isn't connected to the tree for destruction by the specific code, if needed. If they do differ, allocate a new internal node and set its critical bit to the number above, and put both the new leaf and the existing leaf as its children. The root now points to the internal node and the leaf count is updated;
- when the tree contains more than one leaf, walk down it until reaching a leaf. Then compare this leaf with the one to be added; if their keys are equivalent, replace it or not as per the replacement policy parameter, returning the leaf which didn't make it into the tree, as per above. If they do differ, allocate a new internal node and place the new leaf in the appropriate child slot. Then walk down the tree again, looking for the first internal node which has a larger critical bit than your new one, insert that one in the other slot of the new internal node, and insert the new node in place of the one you attached to it.

<<Generic algorithms>>=staticLEAFTYPE _add(structROOTSTRUCT * tree, LEAFTYPE leaf, EXTRA_ARG aux,intshould_replace,int* error){LEAFTYPE result;struct_internal_node * node, * parent, * child; uint32_t dir, dir2; int32_t critbit; *error = 0;if(Check if tree is empty){tree->root.leaf = leaf; tree->leafcount ++;returnNO_LEAF;}elseif(Check if root points to a leaf){result = tree->root.leaf; critbit = COMPARE(leaf, result);if(critbit == -1){Treat collision in addition}else{Allocate new internal node and insert new leaf in it result = tree->root.leaf; Place previous leaf in the new node tree->root.node = node; tree->leafcount ++;returnNO_LEAF;}}/* else */ node = tree->root.node; Walk tree looking for leaf critbit = COMPARE(leaf, result);if(critbit == -1){Treat collision in addition}else{Allocate new internal node and insert new leaf in it Walk tree and splice new node in tree->leafcount ++;returnNO_LEAF;}}

The treatment of collision in addition will depend on the value of the `should_replace` parameter. This is the only case where the addition algorithm returns something other than `NO_LEAF`.

<<Treat collision in addition>>=if(should_replace){node->child[dir].leaf = leaf;returnresult;}elsereturnleaf;

The allocation of the new node needs to check for consistency and return setting `*error` to -1 if it fails. Also, the new leaf must be inserted in the correct subtree and the correct leaf type must be set.

<<Allocate new internal node and insert new leaf in it>>=node = (struct_internal_node *) ALLOC(sizeof(struct_internal_node));if(node == NULL){*error = -1;returnNO_LEAF;}else{node->critbit = critbit; dir = DECIDE(leaf, critbit, aux); node->child[dir].leaf = leaf; SET_LEAF(node, dir);}

In the case where there is only a single leaf after the root, the newly allocated node goes between it and the root pointer.

<<Place previous leaf in the new node>>=node->child[1 - dir].leaf = tree->root.leaf; SET_LEAF(node, 1 - dir);

In the other cases, the other branch of the new node will contain an internal node.

<<Insert child node into the new node>>=node->child[1 - dir].node = child; SET_NODE(node, 1 - dir);

In those cases, we must place the new node in the correct position such that it is in the correct position in the tree regarding critical bits and the critical bits always increase as one walks down the tree.

<<Walk tree and splice new node in>>=child = tree->root.node;if(Check if node should be inserted before child){Insert child node into the new node tree->root.node = node;}elsewhile(1){parent = child; dir2 = DECIDE(leaf, parent->critbit, aux);if(IS_LEAF(parent, dir2)){result = parent->child[dir2].leaf; Place previous leaf in the new node parent->child[dir2].node = node; SET_NODE(parent, dir2);}else{child = parent->child[dir2].node;if(Check if node should be inserted before child){Insert child node into the new node parent->child[dir2].node = node;break;}}}

<<Check if node should be inserted before child>>=node->critbit < child->critbit

### [edit] Deleting

Next, we will consider deleting a leaf from a previously initialised and populated tree. As with inserting and searching, the generic algorithm receives:

- A reference to a tree;
- A leaf object containing the key to be sought and removed;
- An extra parameter of type
`EXTRA_ARG`to pass to`DECIDE`.

It will return the leaf that was removed, if there is one which matches the key in the given leaf, so it may be destroyed by the specific code.

<<Generic algorithms>>=staticLEAFTYPE _del(structROOTSTRUCT * tree, LEAFTYPE leaf, EXTRA_ARG aux){LEAFTYPE result;struct_internal_node * node, * parent; uint32_t dir, dir2;if(Check if tree is empty)returnNO_LEAF;if(Check if root points to a leaf){result = tree->root.leaf;if(COMPARE(result, leaf) == -1){tree->root.leaf = NO_LEAF; tree->leafcount --;returnresult;}elsereturnNO_LEAF;}/* else */ node = tree->root.node; Walk tree looking for leafif(COMPARE(result, leaf) == -1){parent = tree->root.node; Walk tree and get parent of node Replace node with its other child tree->leafcount --; DEALLOC(node);returnresult;}elsereturnNO_LEAF;}

When removing a leaf which isn't tied directly at the root, we would end up with an internal node with only one subtree, which would cause problems when walking. Since internal nodes represent a split in the keys of their subtrees at a given bit, we can just throw that node away and replace it by its other child, still maintaining the invariant that an internal node always has two children.

<<Replace node with its other child>>=if(IS_LEAF(node, 1 - dir)){parent->child[dir2].leaf = node->child[1 - dir].leaf; SET_LEAF(parent, dir2);}else{parent->child[dir2].node = node->child[1 - dir].node;}

To do that, however, we must find the parent of the internal node in order to replace it as a child. It is the same as walking the tree to find a leaf, but we will always deal with internal nodes only, and we will compare the children to the node we want to determine if we found the parent.

<<Walk tree and get parent of node>>=while(1){dir2 = DECIDE(leaf, parent->critbit, aux);if(parent->child[dir2].node == node)break;elseparent = parent->child[dir2].node;}

### [edit] Iterating Forwards and in Reverse

This operation is not one which is not usual for either sets or maps/dictionaries, but it is something that radix trees can provide. If the bits of the keys are indexed by the critical-bit manipulating functions `COMPARE` and `DECIDE`, they can assume in the tree a natural order which can be recovered by a simple depth-first search on the tree.

For Unicode string keys, that indexing requires the bits to be ordered in ascending order by character index, and within a character they must be sorted from most significant to least significant bit. For unsigned integer keys, they must be sorted from most significant to least significant bit.

If another collating sequence is required, then the process complicates a bit. First, before any key is added, searched or deleted, the key must be mapped into a canonical form where increasing values of the bit correspond to higher numerical values of the characters; for example, if case independence is desired, then running `toupper()` or `tolower()` on the keys will make operations independent of case. However, for iterating, if one wants the callback to show the key as the user understands it, then you have to run the reverse mapping on the key in the tree before presenting it to the callback; if the mapping is not bijective (as `toupper()` and `tolower()` are), then both the original and the canonical form must be stored in the leaf, and the original is only used when it is supplied to the iteration callbacks.

The generic algorithm, in contrast, is very simple, and doesn't even require the use of the usual operations on `LEAFTYPE`, for which reason it only takes a tree and a callback to be called for each leaf. It does, however, use recursion, and for that reason we will remove the depth-first search into a separate function so it may call itself:

<<Generic helper functions>>=staticuint32_t _dfs(struct_internal_node * node, uint32_t (*leafcb)(LEAFTYPE), uint32_t (*inodecb)(struct_internal_node *), uint32_t direction, int32_t pre_post){uint32_t cur = direction;if((pre_post == -1) && (inodecb != NULL) && (inodecb(node) == 0))return0; Search the subtree in the `cur' directionif((pre_post == 0) && (inodecb != NULL) && (inodecb(node) == 0))return0; cur = 1 - cur; /* now the other child */ Search the subtree in the `cur' directionif((pre_post == 1) && (inodecb != NULL) && (inodecb(node) == 0))return0;return1;}

The `_dfs()` function takes a callback to be called on leaves and another on internal nodes; a `direction` parameter which tells which child we should process first: 0 if we're iterating forwards, 1 if we're iterating in reverse; and a `pre_post` parameter which tells whether the internal node callback should be called before, after or in the middle of calling the processing of the subtrees.

<<Search the subtree in the `cur' direction>>=if(IS_LEAF(node, cur)){if(leafcb(node->child[cur].leaf) == 0)return0;}else{if(_dfs(node->child[cur].node, leafcb, inodecb, direction, pre_post) == 0)return0;}

Of note is the fact that the callback takes a leaf and returns an `uint32_t` which may be either 1 if the iteration should continue or 0 if the algorithm should break out of it, which is why the DFS returns 0 if either the callback or its recursive instances return 0.

<<Generic algorithms>>=staticvoid_fwd(structROOTSTRUCT * tree, uint32_t (*cb)(LEAFTYPE)){Treat empty or single leaf case _dfs(tree->root.node, cb, NULL, 0, 0);}staticvoid_rev(structROOTSTRUCT * tree, uint32_t (*cb)(LEAFTYPE)){Treat empty or single leaf case _dfs(tree->root.node, cb, NULL, 1, 0);}

The empty case and the single leaf case are identical for both forward and reverse iteration, and simply either do nothing or call the callback upon the single leaf. In both cases, we return from the function so we don't have to wrap the rest of the function in braces.

<<Treat empty or single leaf case>>=if(Check if tree is empty)return;if(Check if root points to a leaf){cb(tree->root.leaf);return;}/* else */

### [edit] Initialisation and Disposal

Initialisation of the radix tree is trivial. Allocate a new `struct ROOTSTRUCT` structure, set its `leafcount` to zero, and return it. As it depends on the allocator, it may fail, as well, in which case the algorithm returns `NULL`.

<<Generic algorithms>>=staticstructROOTSTRUCT * _new(void){structROOTSTRUCT * result; result = (structROOTSTRUCT *) ALLOC(sizeof(structROOTSTRUCT));if(result != NULL) result->leafcount = 0;returnresult;}

Destruction is more complex, since we will have to free all leaves and internal nodes, as well as the root structure. For that, we will repurpose the `_dfs()` helper function to walk the tree, freeing leaves and internal nodes as we go.

<<Generic helper functions>>=staticuint32_t _dealloc_internal_node(struct_internal_node * node){DEALLOC(node);return1;}

<<Generic algorithms>>=staticvoid_end(structROOTSTRUCT * tree, uint32_t (*cb)(LEAFTYPE)){if(! Check if tree is empty){if(Check if root points to a leaf){cb(tree->root.leaf);}else{_dfs(tree->root.node, cb, _dealloc_internal_node, 0, 1);}}DEALLOC(tree);}

## [edit] Integer Set

So far, we have written a lot of generic code which cannot be used because it depends on a few definitions which haven't been provided yet. In this section, we will provide those definitions and generate two files (a header file and an implementation file) for a set of integers. Now, sets are trees whose leaves contain only their keys, so what would ordinarily be a *get* or *search* operation in a dictionary will be called a *test* operation here.

The header file will simply declare an opaque type `uint32_set` and five operations on it: `uint32_set_new`, `uint32_set_free`, `uint32_set_add`, `uint32_set_test`, `uint32_set_remove`, and `uint32_set_each`.

Note that we cannot declare uint32_set as a pointer to `struct ROOTSTRUCT`, since this preprocessor token won't be defined on the header file, so we just sub in the actual definition.

<<uint32_set.h>>=#ifndef _UINT32_SET_H_#define _UINT32_SET_H_#include <stdint.h>typedefstruct_uint32_set * uint32_set; uint32_set uint32_set_new(void);voiduint32_set_free(uint32_set set); uint32_t uint32_set_add(uint32_set set, uint32_t val); uint32_t uint32_set_test(uint32_set set, uint32_t val); uint32_t uint32_set_remove(uint32_set set, uint32_t val);voiduint32_set_each(uint32_set set, uint32_t (*cb)(uint32_t));#endif

Notice that the opaque type isn't declared as a pointer to `struct ROOTSTRUCT`; that struct cannot be visible outside of the implementation file or two different types based on radix trees won't be useable together. Therefore, we must remember to `typedef` into a unique name like `_uint32_set`.

<<uint32_set.c>>=#include <stdlib.h>#include <stdint.h>#include "uint32_set.h"

After the includes, we will have to define the preprocessor tokens the generic part of the implementation requires: `LEAFTYPE`, `NO_LEAF`, `COMPARE`, `DECIDE`, `EXTRA_ARG` and possibly `ALLOC` and `DEALLOC`, if we don't want to use `malloc()` and `free()` for them.

<<uint32_set.c>>=#define LEAFTYPE uint32_t#define ROOTSTRUCT _uint32_set#define NO_LEAF ((uint32_t) 0)#define COMPARE integer_compare#define DECIDE(n, b, dummy) (((n) >> (31 - (b))) & 1)#define EXTRA_ARG uint32_t

Note that in this case we will simply ignore the extra argument to `DECIDE`, but we still have to define its type `EXTRA_ARG`. Since it can be anything, we give it a base integer type and pass zero into every generic algorithm call which calls for a parameter of that type.

Also, we defined `DECIDE` as a macro, but `COMPARE` was defined as a token `integer_compare`, which is the name of a function which will have to be defined. Here is its definition:

<<uint32_set.c>>=staticint32_t integer_compare(uint32_t a, uint32_t b){uint32_t xor, mask, result; xor = a ^ b;for(mask = 0x80000000, result = 0; result < 32 && ! (xor & mask); mask >>= 1, result ++);if(result == 32)return-1;elsereturnresult;}

It simply XORs the two numbers together then checks, from most significant to least significant bit, until it finds a set bit. It then reports the number of the bit or -1 if it fell off the end of the number.

Having all this defined, we can finally include the generic part and define the _uint32_set type we refer to in the header:

<<uint32_set.c>>=Generic parttypedefstructROOTSTRUCT _uint32_set;

Now we can start on the implementation of the functions we defined on the header.

`uint32_set_new()` trivially calls `_new()`:

<<uint32_set.c>>=uint32_set uint32_set_new(void){return_new();}

`uint32_set_free()` doesn't actually have to deal with freeing leaves, as they aren't allocated separately on the heap, so we can simply pass it a dummy function which takes an `uint32_t` and returns 1:

<<uint32_set.c>>=staticuint32_t dummy_free_leaf(uint32_t arg){return1;}voiduint32_set_free(uint32_set set){_end(set, dummy_free_leaf);}

The lack of need for allocating leaves makes `add` similarly trivial; it is just convenient to remember that the third parameter is ignored in this specific case, so we just put in zero, and the fourth parameter, `should_replace`, is set to zero since there's no point in replacing one number with the exact same one. Finally, the `error` parameter is passed in by reference and checked for memory exhaustion errors:

<<uint32_set.c>>=uint32_t uint32_set_add(uint32_set set, uint32_t val){interror; _add(set, val, 0, 0, &error);if(error == -1)return0;elsereturn1;}

`test` similarly just passes through to `_get` and returns 1 if it didn't return `NO_LEAF`:

<<uint32_set.c>>=uint32_t uint32_set_test(uint32_set set, uint32_t val){return_get(set, val, 0) != NO_LEAF;}

`remove` can also safely ignore the return value of `_del`, since it returns a leaf that need not be freed:

<<uint32_set.c>>=uint32_t uint32_set_remove(uint32_set set, uint32_t val){_del(set, val, 0);return1;}

Finally, `uint32_set_each` receives the callback `cb` and passes it to `_fwd` trivially --- we could set up a flag to enumerate the numbers in decreasing order, but this will do for the purpose of this example:

<<uint32_set.c>>=voiduint32_set_each(uint32_set set, uint32_t (*cb)(uint32_t)){_fwd(set, cb);}

## [edit] String to Void * Map

Download code |