Word count (C Plus Plus)

From LiteratePrograms
Jump to: navigation, search
Other implementations: Assembly Intel x86 Linux | C | C++ | Forth | Haskell | J | Lua | OCaml | Perl | Python | Python, functional | Rexx

An implementation of the UNIX wc tool.

The wc tool counts characters, words and lines in text files or stdin. When invoked without any options, it will print all three values. These options are supported:

  • -c - Only count characters
  • -w - Only count words
  • -l - Only count lines

If the tool is invoked without any file name parameters, it will use stdin.


The code as presented here assumes Unix file conventions; especially it assumes that lines are ended with a single newline character. Since the C++ standard guarantees the translation of any line ending into '\n', the program will also work correctly on systems using another single new line character (such as ASCII CR on classic MacOS), but it will give a wrong character count on systems which use multi-character line endings (such as Windows, which uses a CR/LF character sequence to indicate new line). Also, it assumes a single-byte encoding; no effort is done to correctly handle multi-byte encoded files.

The code was successfully tested on Linux.

The general structure of the program is as follows.

<<wc.cc>>=
includes
option struct
counter class
error reporting function
file processing function
main function

Contents

[edit] The main function

The main function first parses the options, and then loops through the files, calling a word count function for each of them. It also defines a variable holding the exit code to return.

<<main function>>=
int main(int argc, char* argv[])
{
  int exit_code = EXIT_SUCCESS;
  parse options
  process files
  return exit_code;
}

For EXIT_SUCCESS, we need the standard header <cstdlib>.

<<includes>>=
#include <cstdlib>

Now that we have all the necessary infrastructure in place, we continue with the actual implementation.

[edit] Parsing the options

The option parsing code is basically C-style. That's because the arguments are delivered in a C string array anyway, and advanced C++ language features would be overkill here.

First, some variables are defined to store what options are set. Since there are only three options, the simplest solution is to use a separate bool for each one. However to easily pass those around, we collect those bools in a struct. The default value of them is false, so that they can be set individually. The case that none of them has been set is handled explicitly after parsing the options.

<<option struct>>=
struct option_struct
{
  bool count_characters;
  bool count_words;
  bool count_lines;
  option_struct(): count_characters(false), count_words(false), count_lines(false) {}
};

An option is recogniced by starting with "-", but not being "-" itself (that's a file name argument naming standard input). Options have to come before file name arguments; after the first file name argument is found, all following arguments are considered file arguments, too, even if they start with "-". Thus, option parsing can stop as soon as a non-option is found. If an unsupported option is found, the program immediately fails with an appropriate error message.

Options can be combined, that is, wc -cl is the same as wc -c -l.

The loop condition depends on the shortcut behaviour of C++'s logical and operator. Also it depends on the zero termination of C strings.

<<parse options>>=
int current_arg = 1;
option_struct options;
while(current_arg < argc
      && argv[current_arg][0] == '-'
      && argv[current_arg][1] != '\0')
{
  char* arg = argv[current_arg];
  for (std::size_t position = 2; arg[position] != '\0'; ++position)
  {
    switch(arg[position])
    {
    case 'c':
      options.count_characters = true;
      break;
    case 'w':
      options.count_words = true;
      break;
    case 'l':
      options.count_lines = true;
      break;
    default:
      std::cerr << "wc: invalid option: " << arg[position] << "\n";
      return EXIT_FAILURE;
    }
  }
}

For the error message, we need the headers iostream (for cout) and ostream (for the output operator).

<<includes>>=
#include <iostream>
#include <ostream>

If no options were set, we actually want all of them instead.

<<parse options>>=
if (!(options.count_characters || options.count_words || options.count_lines))
  options.count_characters = options.count_words = options.count_lines = true;

Note that at the end of option parsing, current_arg is the index of the first non-option argument.

[edit] Processing the files

After having parsed all the options, the actual work can start. For simplicity, all three counts are calculated unconditionally, but only the requested counts are output.

Since the three counts all are updated together, it makes sense to collect them in a single type counter, which will be defined later. We need one counter object for the total counts, and for each file we will get another one. The actual counting, reporting of errors and output of the results are done by separate functions.

When iterating through the files, we have one special case to handle, that is if no file was given at all. In that case, wc has to read standard input. Otherwise we iterate through the files given at the command line.

<<process files>>=
counter total;

if (current_arg == argc)
{
  if (!process_file(std::cin, NULL, total, options))
    exit_code = EXIT_FAILURE;
}
else
{
  for (int file_arg = current_arg; file_arg < argc; ++file_arg)
  {
    char const* filename = argv[file_arg];
    if (filename[0] == '-' && filename[1] == '\0')
    {
      if (!process_file(std::cin, filename, total, options))
        exit_code = EXIT_FAILURE;
      std::cin.clear();
    }
    else
    {
      std::ifstream file(filename);
      if (!process_file(file, filename, total, options))
        exit_code = EXIT_FAILURE;
    }
  }
}

For this, we need the header fstream.

<<includes>>=
#include <fstream>

If there was more than one file given on the command line, also report the totals.

<<process files>>=
if (argc - current_arg > 1)
  total.report(options, "total");

[edit] The counter class

By putting them in a class, the update logic can be put there, too. For the total, we also need a way to add them up. For that, it also has to keep track of internal state. Finally, it also knows how to output the results.

<<counter class>>=
class counter
{
public:
  counter(): characters(0), words(0), lines(0), counter internal state initializers {}
  counter& operator+=(counter const& other)
  {
    characters += other.characters;
    words += other.words;
    lines += other.lines;
    return *this;
  }
  void update(char next_char);
  void report(option_struct options, char const* filename);
private:
  int characters;
  int words;
  int lines;
  counter internal state
};

void counter::update(char next_char)
{
  update code
}

void counter::report(option_struct options, char const* filename)
{
  report code
}

The update member function does the actual work of incrementing the individual counters. For the character and line counters, it's trivial: the character counter is always incremented, the line counter whenever we see a '\n'.

<<update code>>=
  ++characters;
  if (next_char == '\n')
    ++lines;

The non-trivial part is the word counter. Words are sequences of non-whitespace characters, separated by whitespace. Thus, we have a new word whenever we see a non-whitespace character which follows a whitespace character. Therefore we have to store if the previous character was whitespace.

<<counter internal state>>=
bool current_is_whitespace;

A non-whitespace character at the beginning of the file of course also starts a new word, thus we have to treat the file start as if there has been whitespace before it.

<<counter internal state initializers>>=
current_is_whitespace(true)

Now we can implement the word counter update. Note that the cast to unsigned char is necessary because std::isspace has undefined behaviour for negative arguments, which happens on implementations where char is signed for characters outside the standard ASCII range (such as 'ä').

<<update code>>=
  bool next_is_whitespace = std::isspace((unsigned char)next_char);
  if (current_is_whitespace && !next_is_whitespace)
    ++words;

Of course, we also have to update the internal state:

<<update code>>=
  current_is_whitespace = next_is_whitespace;

For std::isspace, we also need a header:

<<includes>>=
#include <cctype>

The report function just outputs the values requested by the options passed, followed by the passed filename if non-null, to std::cout.

<<report code>>=
if (options.count_lines)
  std::cout << std::setw(7) << lines << " ";
if (options.count_words)
  std::cout << std::setw(7) << words << " ";
if (options.count_characters)
  std::cout << std::setw(7) << characters << " ";
if (filename)
  std::cout << filename;
std::cout << std::endl;

For this, we need the header iomanip (for setw). The headers ostream (for the output operators) and iostream (for cout) have already been included for the main program.

<<includes>>=
#include <iomanip>

[edit] The file processing funcion

Now that we have the counting logic, we need to feed it. That is done in the function process_file. It expects an already opened input stream, the file name to be reported to the user, a counter object holding the current total, and the output options. It returns true if the file was successfully processed, and false otherwise. The filename argument is NULL, in which it is assumed the file has no name.

<<file processing function>>=
bool process_file(std::istream& file, char const* filename, counter& total, option_struct options)
{
  process_file body
}

First, process_file checks if the file was successfully opened, and returns an error if not.

<<process_file body>>=
if (!file)
{
  print_error(filename, "failed to open.");
  return false;
}

The function then tells the stream not to skip any whitespace, and then feeds each character to the counter object.

<<process_file body>>=
file >> std::noskipws;
char c;
counter result;
while (file >> c)
  result.update(c);

For this to work, we of course need the standard header istream for the input operators.

<<includes>>=
#include <istream>

The loop terminates when it fails to read another character. This can have two reasons. Either we have hit the end of file; in which case the file was successfully processed, or any problem occured before reaching the end of file. In the latter case, the code disregards the data collected so far and return an error.

<<process_file body>>=
if (!file.eof())
{
  print_error(filename, "reading failed.");
  return false;
}

After this test, it's clear that the file was correctly processed, therefore the results are reported to the user and the total is updated. Also the error state of the stream is cleared and success returned.

<<process_file body>>=
result.report(options, filename);
total += result;
file.clear();
return true;

[edit] The error output function

The function print_error just outputs a simple error message. If the context argument is NULL, it prints "wc: message" to standard error. Otherwise, it prints "wc: context: message". The message argument may not be NULL.

<<error reporting function>>=
void print_error(char const* context, char const* message)
{
  std::cerr << "wc: ";
  if (context)
    std::cerr << context << ": ";
  std::cerr << message << std::endl;
}
Download code
hijacker
hijacker
hijacker
hijacker