Image downsampling (C)

From LiteratePrograms
Jump to: navigation, search

This program is under development.
Please help to debug it. When debugging
is complete, remove the {{develop}} tag.


This article will describe a way to efficiently downscale an image, using linear interpolation to preserve all details from the source. We will be focusing on 8bit per sample bitmap, using K, RGB or RGBA color space. It will rely exclusively on CPU operation. The resizing can handle different aspect ratio between the source and resulting image.

This algorithm is specifically dedicated to down-scaling, not up-scaling. There is a infinite way of up-scaling a bitmap, the best method is usually context-dependent. Moreover methods range from very simple algorithm, like nearest-neighbor, to very complex like vectorized de-pixelization.

The implementation will be written in C99, using SDL for on-screen rendering and SDL_image for loading bitmap (which in itself rely on libjpeg and libpng internally).

Contents

[edit] Prototypes

Before talking about the algorithm itself, we will define what the input parameters will be:

<<DownScale.h>>=

#ifndef DOWN_SAMPLE_H
#define DOWN_SAMPLE_H

#include <stdint.h>

typedef uint8_t *                DATA8;
typedef uint16_t *               DATA16;
typedef uint32_t *               DATA32;
typedef struct Image_t *         Image;

struct Image_t
{
	int   width, height, bpp;
	int   stride, format;
	DATA8 bitmap;
};

int ImageDownScale(Image source, Image ret);
#endif

The function should resize the image 'source', into the user-allocated image 'ret'. The image_t datatype is flexible enough to allow you to only resize part of the source, into part of the destination (which is one key principle for down-scaling on-the-fly very large image, using lazy-rendering).

[edit] Stategy

The main idea behind this algorithm is to compute the surface covered in the source image by one pixel of the destination image, sum each channel independently and divide each sample by the surface. A careful use of temporary buffer will allow us to just do the right amount of computation.

Check out this image. It shows the process of resizing a 9x9px source image into a 2x2px bitmap. Each pixel from the destination bitmap covers a 4.5x4.5px surface area, that is 20.25px². To get the color of each pixel of the destination, we will sum each channel over the surface covered by one pixel, then simply divide by the surface. The image above highlights four different cases:

  1. Source pixel fully covered by destination (red)
  2. Source pixel vertically clipped by destination (yellow)
  3. Source pixel horizontally clipped by destination (orange)
  4. Source pixel horizontally and vertically clipped by destination (blue)

When a pixel is partially covered, we will have to only sum a fraction of each channel value. To compute this fraction, since we are doing linear interpolation, we'll rely on our trusty DDA (there will be some slight differences though). Except that DDA will provide a non-normalized error (which will range from 0 to Δx). To be a little bit more bit-shift friendly, it would be somewhat more efficient to have a normalized value (usually of power of 2, otherwise there would be a lot of divisions scattered all around the inner loop).

This can be easily obtained using a formula like err * normalized_value / Δx, but if we need a division to avoid using a few divisions later, this is not completely useless, although not completely useful either. However, since accumulated error within DDA increases/decreases linearly (modulo Δx), we can use a DDA to approximate a normalized error of another DDA.

To do this, just look at the iteration phase of a DDA :

iter->x ++;
iter->y += iter->sx;
iter->err -= iter->dx;
if (iter->err <= 0)
    iter->y += iter->sy, iter->err += iter->xe;

(Just for the record, y is the interpolated value, dx equals Δy%xe, sx is Δy/xe and xe equals Δx.)

This means that within the [0,xe] range, the error goes from 0 to xe*dx. Ignore the fact that the error is subtracted, instead of added: this is a small optimization trick, and for now also ignore the modulo (the "if" part). But instead of increasing the accumulated error by dx, we can normalize this increment using the previous formula:

dx * normalized_value / xe(=Δx)

Then, the interpolated range for the normalized value will be:

xe * dx * normalized_value / xe = dx * normalized_value

Therefore to interpolate the normalized error, we will have to iterate over the [0,xe] range and interpolate this into the range [0,dx*normalized_value]. The normalized error will iter.err % normalized_value.

This normalized error will be useful for the "sum" buffer in the algorithm below. This buffer hold the sum of all the scanlines from the source image that is covered by one scanline of the destination. The "total" buffer does the same for one pixel of the destination image. A careful use of these buffers can avoid useless computation, like this code to account for vertically clipped pixels:

tmp = *p * yerr;       // Partial value (p = source pixel)
total[3] += tmp + *s;  // Add to current total (s = sum buffer)
*s = (*p<<BITS) - tmp; // Reset sum buffer
s ++; p ++;            // Process next sample

We actually do the same strategy for the horizontal axis, and when we have to do both, the code becomes quite tricky, albeit based on the same principles. In the algorithm below we also use a special code path when there is no error compensation to apply.

[edit] Algorithm

<<DDA.c>>=
typedef struct Iter {
  int y, ye, sy, dy,
      x, xe, sx, dx,
      err;
} Iter;

static void InitDDA(Iter * iter, int xs, int xe, int ys, int ye)
{
	/* Pre-condition: xe > xs >= 0 */
	div_t q = div(iter->dy = ye - ys, xe);
	iter->y   = ys;
	iter->ye  = ye;
	iter->x   = xs;
	iter->xe  = xe;
	iter->err = xe;
	iter->dx  = abs(q.rem);
	iter->sx  = q.quot;
	iter->sy  = (ys < ye ? 1 : -1);
	if (xs > 0)
		iter->y   = ye-(iter->dy * (xe-xs) + iter->sy * iter->err)/xe,
		iter->err = (iter->err-iter->dx*xs) % xe + xe;
}

static inline void IterDDA(Iter * iter)
{
	iter->x ++;
	iter->y += iter->sx;
	iter->err -= iter->dx;
	if (iter->err <= 0)
		iter->y += iter->sy, iter->err += iter->xe;
}
<<Resize.c>>=
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include "DownScale.h"

DDA.c

#define	BITS         8
#define	VALUES       (1 << BITS)

int ImageDownScale(Image src, Image ret)
{
    DATA32 sum;
    int    i, y, w, h, chan, sz, surf, xerr;

    DATA8 out = ret->bitmap;
    DATA8 in  = src->bitmap;
    Iter  ypos, xpos, nerr;

    sz = src->width * src->bpp;
    if (ret->width == 0 || ret->height == 0) return 0;
    if (ret->width > src->width || ret->height > src->height) return 0;
    if (ret->width == src->width && ret->height == src->height) {
        for (y = ret->height; y > 0; y --, in += src->stride, out += ret->stride)
            memcpy(out, in, sz);
        return 1;
    }

    w    = ret->width;
    h    = ret->height;
    chan = src->bpp;
    sum  = calloc(sz, sizeof *sum);
    surf = (unsigned long long) src->width * src->height * VALUES / (w * h);
    y    = 0;

    if (sum == NULL) return 0;
    memset(&nerr, 0, sizeof nerr); nerr.err = 1;
    InitDDA(&ypos, 0, h, 0, src->height);
    InitDDA(&xpos, 0, w, 0, src->width); xerr = xpos.dx > 0;

    while (ypos.x < ypos.xe)
    {
        DATA8 d = out, p;
        int x, yerr;
        int total[4], tmp;
        DATA32 s;
        IterDDA(&ypos);
        yerr = (ypos.xe - ypos.err) * VALUES / ypos.xe;

        while (y < ypos.y)
        {
            for (p = in, i = sz, s = sum; i > 0; *s++ += (*p<<BITS), p++, i --);
            y ++; in += src->stride;
        }

        InitDDA(&xpos, 0, w, 0, src->width); x = 0;      IterDDA(&xpos);
        InitDDA(&nerr, 0, xpos.xe, 0, VALUES * xpos.dx); IterDDA(&nerr);
        memset(total, 0, sizeof total);
        if (yerr > 0)
        {
            #define MAX_255(ptr, val) { int z = val; *ptr = (z >= 255 ? 255 : z); ptr ++; }
            for (p = in, i = ret->width, s = sum; i > 0; ) {
                if (x < xpos.y) {
                    /* Vertical error compensation */
                    switch (chan) {
                    case 4: tmp = *p * yerr; total[3] += tmp + *s; *s = (*p<<BITS) - tmp; s ++; p ++;
                    case 3: tmp = *p * yerr; total[2] += tmp + *s; *s = (*p<<BITS) - tmp; s ++; p ++;
                            tmp = *p * yerr; total[1] += tmp + *s; *s = (*p<<BITS) - tmp; s ++; p ++;
                    case 1: tmp = *p * yerr; total[0] += tmp + *s; *s = (*p<<BITS) - tmp; s ++; p ++;
                    }
                    x ++;
                } else {
                    int err = nerr.y & (VALUES-1);
                    if (xerr == 0 || err == 0) {
                        switch (chan) {
                        case 4: MAX_255(d, (total[3] + (surf>>1)) / surf);
                        case 3: MAX_255(d, (total[2] + (surf>>1)) / surf);
                                MAX_255(d, (total[1] + (surf>>1)) / surf);
                        case 1: MAX_255(d, (total[0] + (surf>>1)) / surf);
                        }
                        memset(total, 0, sizeof total);
                    } else {
                        int k;
                        /* Vertical and horizontal error compensation */
                        for (k = chan-1; k >= 0; k --, p ++, s++) {
                            tmp = *p * yerr;
                            int tmp2 = tmp * err >> BITS;
                            int right = *s * err >> BITS;
                            MAX_255(d, (total[k] + tmp2 + (surf>>1) + right) / surf);
                            total[k] = *s - right + tmp - tmp2;
                            *s = (*p<<BITS) - tmp;
                        }
                        x++;
                    }
                    IterDDA(&nerr);
                    IterDDA(&xpos); i --;
                }
            }
            y ++; in += src->stride;
        }
        else /* No vertical error (maybe horizontal) */
        {
            for (i = ret->width, s = sum; i > 0; ) {
                if (x < xpos.y) {
                    /* No error compensation */
                    switch (chan) {
                    case 4: total[3] += *s; s ++;
                    case 3: total[2] += *s; s ++;
                            total[1] += *s; s ++;
                    case 1: total[0] += *s; s ++;
                    }
                    x ++;
                } else {
                    int err = nerr.y & (VALUES-1);
                    if (xerr == 0 || err == 0) {
                        switch (chan) {
                        case 4: MAX_255(d, (total[3] + (surf>>1)) / surf);
                        case 3: MAX_255(d, (total[2] + (surf>>1)) / surf);
                                MAX_255(d, (total[1] + (surf>>1)) / surf);
                        case 1: MAX_255(d, (total[0] + (surf>>1)) / surf);
                        }
                    } else {
                        /* Horizontal error compensation */
                        switch (chan) {
                        case 4: tmp = *s * err >> BITS; MAX_255(d, (total[3] + tmp + (surf>>1)) / surf); *s -= tmp; s++;
                        case 3: tmp = *s * err >> BITS; MAX_255(d, (total[2] + tmp + (surf>>1)) / surf); *s -= tmp; s++;
                                tmp = *s * err >> BITS; MAX_255(d, (total[1] + tmp + (surf>>1)) / surf); *s -= tmp; s++;
                        case 1: tmp = *s * err >> BITS; MAX_255(d, (total[0] + tmp + (surf>>1)) / surf); *s -= tmp; s++;
                        }
                        s -= chan;
                    }
                    IterDDA(&nerr);
                    IterDDA(&xpos); i --;
                    memset(total, 0, sizeof total);
                }
            }
            #undef MAX_255
            memset(sum, 0, sz<<2);
        }
        out += ret->stride;
    }
    free(sum);
    return 1;
}
<<ImageTest.c>>=
#include <stdio.h>
#include <stdlib.h>
#include <SDL/SDL.h>
#include <SDL/SDL_image.h>
#include "DownScale.h"

#define WINW    500
#define WINH    500

int main(int nb, char * argv[])
{
	if (SDL_Init(SDL_INIT_VIDEO) < 0)
		return 1;

	IMG_Init(IMG_INIT_PNG|IMG_INIT_JPG);
	SDL_EnableKeyRepeat(SDL_DEFAULT_REPEAT_DELAY, SDL_DEFAULT_REPEAT_INTERVAL);

	atexit(SDL_Quit);

	SDL_Surface * screen = SDL_SetVideoMode(WINW, WINH, 32, SDL_HWSURFACE|SDL_ANYFORMAT);
	if (! screen) return 1;

	SDL_WM_SetCaption("Image down sampling", NULL);

	char * file = nb > 1 ? argv[1] : "Rings1_small.png";
	SDL_Surface * img = IMG_Load(file);

	if (img == NULL) { perror(file); return 1; }

	struct Image_t source = {
		.bitmap = img->pixels, .stride = img->pitch, .width = img->w, .height = img->h,
		.bpp = img->format->BytesPerPixel
	};
	int w = 300;
	SDL_Surface * tmp = NULL;

	memset(screen->pixels, 0xff, screen->pitch * screen->h);
	SDL_Rect rect = {.x = (screen->w - img->w) >> 1, .y = (screen->h - img->h) >> 1};
	SDL_BlitSurface(img, NULL, screen, &rect);
	SDL_UpdateRect(screen, 0, 0, 0, 0);

	while (1)
	{
		SDL_Event event;
		while (SDL_WaitEvent(&event))
		{
			int inc = 0;
			switch (event.type) {
			case SDL_QUIT: return 0;
			case SDL_KEYDOWN:
				switch (event.key.keysym.sym) {
				case SDLK_ESCAPE: return 0;
				case SDLK_LEFT: case SDLK_UP: inc = -2; break;
				case SDLK_RIGHT: case SDLK_DOWN: inc = 2; break;
				default: continue;
				}
				if (inc)
				{
					w += inc;
					if (w > img->w) { w = img->w; continue; }
					if (w < 1) { w = 1; continue; }
					int h = w * img->h / img->w;
					h /= 2;
					if (h < 1) h = 1;

//					fprintf(stderr, "new dimension = %dx%d\n", w, h);

					if (w != img->w && h != img->h)
					{
						/* Cannot use ImageReduce() directly on screen: pixel format differs */
						tmp = SDL_CreateRGBSurface(SDL_SWSURFACE, w, h, img->format->BitsPerPixel,
							img->format->Rmask, img->format->Gmask, img->format->Bmask, img->format->Amask);

						struct Image_t dest = {
							.bitmap = tmp->pixels, .stride = tmp->pitch, .width = tmp->w, .height = tmp->h,
							.bpp = tmp->format->BytesPerPixel
						};

						ImageDownScale(&source, &dest);
					}
					else tmp = img;

					SDL_Palette * old = tmp->format->palette;
					tmp->format->palette = img->format->palette;
					memset(screen->pixels, 0xff, screen->pitch * screen->h);
					rect.x = (screen->w - tmp->w) >> 1;
					rect.y = (screen->h - tmp->h) >> 1;
					SDL_BlitSurface(tmp, NULL, screen, &rect);
					SDL_UpdateRect(screen, 0, 0, 0, 0);
					tmp->format->palette = old;
					if (tmp != img) SDL_FreeSurface(tmp);
				}
			}
		}
	}
	return 0;
}

#ifdef WIN32
#include <windows.h>
int WINAPI WinMain (HINSTANCE instance,
                    HINSTANCE previnst,
                    LPSTR args,
                    int wndState)
{
	return main(0, NULL);
}
#endif

[edit] Possible improvements

There is mostly two methods that can significantly improve the performance of this algorithm, without changing much of what is already there. Since they add quite a lot of overhead, we will just briefly describe the strategies. These improvements are usually only useful for large image: if one CPU is enough to handle the resizing, don't even bother to add more overhead.

First and by far the most efficient way to improve this method is to rely on multi-threading. You can split the image into horizontal band (always follow the way pixels are organized in memory: therefore avoid vertical split). Splitting the image means, that the resizing will not be as precise as if done as a whole, although most of the time you won't be able to tell the difference. One critical part to take care is synchronization. Avoid using mutexes, because contention potentially means that your process will be put on hold for as long as 10ms. Instead simple use an active loop, since it is unlikely that the other threads are very far apart in the processing. Simply use a waiting loop like this:

while (lines_processed_by_thread < total_lines_to_process);

Another method that can cut down processing time by 10 to 20%, is to use an exponential resizing by a factor of 2. The trick is: reducing an image by a factor 2 is very easy (from the CPU viewpoint). So, before passing the pixels to the linear reduction loop, first resize the image by a series of factor-2 reductions, then once the final reduction factor is between 1 and 2, use the linear interpolation loop. A small memory buffer will have to be allocated: 2^(step+1) scanlines of source->stride/2 pixels. This exponential resize will have to be intertwined within the linear reduction loop, which significantly increase the overall overhead. Also this only works when the reduction uses the same aspect for the horizontal and vertical axis (which is by far the most common use case). Also, like with threading, this method does not work very well with small images because of successive rounding induced by dividing by 2. For large image, the rounding is next to impossible to notice though.

[edit] Nearest neighbor implementation

Just for the record, here is the nearest neighbor algorithm for resizing any image (which works both for reducing and enlarging an image, using different aspect ratio). By using DDA, we can see that the inner loop is very simple, although unsurprisingly its quality is rather poor when it comes to large reduction factor.

int ImageDownScale(Image src, Image ret)
{
	DATA8 out = ret->bitmap;
	DATA8 in  = src->bitmap;
	Iter  ypos, xpos;
	int   chan;

	if (ret->width == 0 || ret->height == 0) return 0;
	if (ret->width == src->width && ret->height == src->height) {
		memcpy(ret->bitmap, src->bitmap, ret->stride * ret->height);
		return 1;
	}
	chan = src->bpp;
	InitDDA(&ypos, 0, ret->height, 0, src->height);
	while (ypos.x < ypos.xe)
	{
		DATA8 p, s;
		InitDDA(&xpos, 0, ret->width, 0, src->width);
		for (p = out, s = in; xpos.x < xpos.xe; s = in + xpos.y*chan)
		{
			switch (chan) {
			case 4: *p++ = *s++;
			case 3: *p++ = *s++;
			        *p++ = *s++;
			case 1: *p++ = *s++;
			}
			IterDDA(&xpos);
		}
		IterDDA(&ypos);
		out += ret->stride;
		in   = src->bitmap + ypos.y * src->stride;
	}
	return 1;
}
Download code
hijacker
hijacker
hijacker
hijacker