Word count (OCaml)

From LiteratePrograms
Jump to: navigation, search
Other implementations: Assembly Intel x86 Linux | C | C++ | Haskell | J | Lua | OCaml | Perl | Python, functional | Rexx

An implementation of the UNIX wc tool, in OCaml.

The wc tool counts characters, words and lines in text files or stdin. When invoked without any options, it will print all three values. These options are supported:

  • -c - Only count characters
  • -w - Only count words
  • -l - Only count lines

If the tool is invoked without any file name parameters, it will use stdin.

This is a somewhat more complex implementation of wc than other sample OCaml implementations on the web. This added complexity stems from two things:

  1. Adding command line options;
  2. Avoiding loading an entire file into memory to count it (and instead reading the file line-by-line).

Contents

[edit] Counting and printing

The core of wc's functionality is to count various quantities, and print the results to the screen. This task is accomplished by the wc function, with the help of several supporting functions:

<<counting and printing>>=
getCount

countFile

printCountLine

countAndPrintFile

wc

The wc function takes a group of option settings and a list of filenames, and for each file in the list prints the count totals specified by the options.

<<wc>>=
let wc opts = function

If the file list is empty, or contains only a "-", the input is assumed to arrive via stdin. We therefore simply get the count totals for the stdin handle, and print that count.

<<wc>>=
    []
  | ["-"] -> printCountLine opts (getCount stdin ("",0,0,0))

If the file list contains only a single filename, we simply open and count that file, and print the resulting count totals. The situation when the file list has multiple elements is a little more complex. In this case, we need to both find and print the count for each file in the list, and accumulate a total count for all of the files. This is accomplished by folding the function countAndPrintFile across the file list.

<<wc>>=
  | [file] -> printCountLine opts (countFile file)
  | files ->
      let totalcount = List.fold_left (countAndPrintFile opts) ("total",0,0,0) files in
        printCountLine opts totalcount

[edit] countAndPrintFile

At each step of the fold operation countAndPrintFile takes a current total count and a filename, prints the results of counting the file, and returns a new total count (which serves as an input to the next step of the fold).

<<countAndPrintFile>>=
let countAndPrintFile opts (tf,tls,tws,tcs) file =
  let (f,ls,ws,cs) as count = countFile file in
    printCountLine opts count;
    (tf, tls+ls, tws+ws, tcs+cs)

The type of the second argument to countAndPrintFile is a tuple consisting of a filename, and three integers representing the current line, word, and character counts for the corresponding file.

[edit] countFile

The countFile function takes a filename as its argument, and returns the line, word, and character counts for that file. The actual counting operation is handled by getCount, while countFile handles opening and closing the file to be counted.

<<countFile>>=
let countFile file =
  let hdl = open_in file in
  let count = getCount hdl (file,0,0,0) in
    close_in hdl;
    count

[edit] getCount

The getCount function is the workhorse of the counting operation. At each iteration, it tries to read a line from the open input channel hdl. If the attempted read shows that the end-of-file has been reached, getCount returns the total number of lines, words, and characters in the file. If a line was successfully read, the line count is incremented by 1, and the word and characters counts are respectively increased by the numbers of words and characters in the line. The character count is further incremented by 1, to account for the fact that input_line strips the newline character off of any line it returns.

We include a utility function words to split a string into words.

<<getCount>>=
let words =
  Str.split (Str.regexp "[ \t\n]+")

let rec getCount hdl (f,ls,ws,cs) =
  try
    let line = input_line hdl in
    let ls = ls + 1
    and ws = ws + List.length (words line)
    and cs = cs + String.length line + 1 in
      getCount hdl (f,ls,ws,cs)
  with End_of_file ->
    (f,ls,ws,cs)

[edit] printCountLine

The printCountLine function prints out the line, word, and character counts for the file f, in accordance with the option settings.

<<printCountLine>>=
let printCountLine opts (f,ls,ws,cs) =
  if opts.showLines then Printf.printf "\t%d" ls;
  if opts.showWords then Printf.printf "\t%d" ws;
  if opts.showChars then Printf.printf "\t%d" cs;
  Printf.printf "\t%s\n" f

[edit] Option handling

The handling of command line options makes use of the Arg module.

We first define a new record datatype opts, which contains three boolean fields representing the different options.

<<options type>>=
type opts = { showLines : bool;
              showWords : bool;
              showChars : bool }

Parsing of the command line flags makes use of the Arg.parse function, which unfortunately does not return anything, so the information parsed from the flags will have to be stored by mutating references. In the following we set up the things we need in order to call Arg.parse. We have four references, one boolean for each of the three flags and a list for the list of filenames.

We define a function process_arg, which will be used as the argument to Arg.parse for handling regular (non-flag) arguments; in this case it adds the filename to the list of files.

We also define an option description list. Each element is a tuple consisting of the short (e.g. -l) command line flag for each supported option, a specification for it to set the appropriate reference, and the usage message for each flag. We also define an additional option descriptor to handle the special argument "-", because otherwise Arg.parse will think it is a flag and not handle it correctly.

<<options setup>>=
let lines = ref false
and words = ref false
and chars = ref false
and files = ref [] in
let process_arg file =
  files := file :: !files in
let options =
  [ "-l", Set lines, "show line count";
    "-w", Set words, "show word count";
    "-c", Set chars, "show character count";
    "-", Unit (fun () -> process_arg "-"), "input via stdin" ] in

The parseOpts function is our main option-handling code. After the above, we call Arg.parse with the appropriate arguments we set up earlier. Then we need to inspect the flags to see if they are all false, and if so set them all true, as per the default behavior of wc. We construct the opts record containing the flag truth values. Then we return a tuple of the opts structure and the list of files. Note that we reverse the filenames list because that list was constructed in reverse (process_args adds new items to the front of the list).

<<parseOpts>>=
let header = "Usage: wc [OPTION...] [files...]"

let parseOpts () =
  options setup
    parse options process_arg header;
    if not !lines && not !words && not !chars then
      lines := true; words := true; chars := true;
    { showLines = !lines;
      showWords = !words;
      showChars = !chars }, List.rev !files

[edit] Putting it all together

The remainder of wc.ml is fairly straightforward. It does two things:

  1. Loads the necessary supporting libraries (the words function used functions from the Str regular-expression library, which is not loaded by default); and "opens" the Arg module, bringing its names into our namespace (otherwise we would have to prefix a lot of things with Arg.).
  2. Defines a short "main function" that gathers the command line options, and then applies wc to the list of files provided on the command line.

Note that OCaml doesn't really have an explicit "main function"; it just executes all code at the top level of the source file. One convention is shown below, where we use let () =. This accomplishes two purposes:

  1. It allows us to avoid using the ugly ;; terminator syntax, which is usually necessary to separate statements of code at the top level, but is not required before certain definition constructs, including let
  2. It pattern-matches the result of the block of code to (), the unit type, to ensure that the block of code doesn't "return" anything, since anything returned at the top level will be discarded anyway.
<<wc.ml>>=
#load "str.cma"
open Arg
  
options type

counting and printing

parseOpts            

let () =
  let opts, files = parseOpts () in
    wc opts files
Download code
hijacker
hijacker
hijacker
hijacker