How to write an awk script to create a simple tally report

What's with the name?

Unix began soon after the dawn of the computing age. Within its many stories and fables, the legend of awk comes straight from the famous Bell Labs and is named for its inventors, Aho, Weinstein, and Kernighan. (Yes, that Kernighan. The man!! You'd have to know a little something about the history of C to get that reference ...)

Awk is an interpreted language with a focus on text file analysis. Some of its notable strengths are support for regular expressions and associative arrays, similar to its successor Perl. If you have the opportunity to learn more about Perl and awk, you will see how they share similarities with their predecessor, the C programming language. Personally, I learned awk a year or two before beginning with Perl, because I found awk to be a little more approachable.

Within the GNU project, awk was rewritten to be royalty free and renamed to gawk (GNU awk) to distinguish from the original program. With a few exceptions, most awk scripts can be run by gawk and vice versa. For this tutorial, I will ignore the distinction between awk and gawk and simply refer to the interpreter as awk. Let's get started writing this awk script.

Getting started

If you use Mac OS or Linux, you have all you need to get started: awk comes preinstalled. If you're on a Windows computer, head over to Sourceforge and grab the Unxutils download. Once you get that unzipped and installed, it's time to open your favorite text editor and let's get started.

I think of awk as approaching input files with a focus on columns. (You might want to open this reference in a separate window to keep handy as you read the tutorial.) Input files are implicitly opened for reading. The structure of an awk script is

pattern { response }

where pattern can be a regular expression, or one of the keywords, BEGIN or END. As each line is read from the input file, if pattern matches the line, then awk evaluates response. In the screen cap image "Example of awk pattern and response", for every input line containing the string "JDubya" (case sensitive), the entire input string is printed to stdout.

In the awk reference, be sure to dive deeper into the idea of field separator (FS) and record separator (RS). These built-in variables tell the interpreter how to split up input files into lines (or records) and how to split those records into fields (or columns).

Built In variable
Definition
FS
field separator
RS
record separator
NF
number of fields in current record
quick list of definitions so far

Separating into records and fields

I mentioned that I think of awk as having a focus on columns. Consider an example. Leaving FS and RS at their default values (white space and newline, respectively), the following input line will be parsed into ten distinct fields, addressed as $1, $2, $3, ... $10.

Here are ten words separated by whitespace ended by newline

As awk parses this record, it stops at the newline (default value of RS). Fields are the information found in a record between the field separator (default of whitespace, such as the space or tab character). The example above yields field 1 (or $1) as "Here", $2 "are", $3 "ten" ... $9 "by", $10 "newline".

Command line settings, and how to invoke your script

To invoke your first awk script, my-first-script.awk, you need to tell the awk interpreter where to find your script with -f, and where to find your input file. Because we're dealing with Unix, input pipes are welcome. (Windows can handle pipes too, although to a lesser degree.) If you don't use redirection, you can explicitly define an input file as a command line parameter.

awk -f my-first-script.awk some-input-file

The built-in variables FS and RS change the way that awk parses input files. The keyword BEGIN tells awk to interpret the response before evaluating any input records. Built-in variables can be set in the BEGIN stanza, but that can get clunky.

A handy shortcut for setting the FS variable on the command line is to pass in -F"FS" where FS is the desired field separator. One of my favorite scenarios for using awk is to parse through CSVs, or text files of comma-separated-values. Let's say my script is called csv.awk, and my csv is called report.csv, and assume I have not adjusted any built-in variables within csv.awk.

awk -F, -f csv.awk report.csv

By invoking the -F command line switch in this way, I've told awk to interpret the instructions in csv.awk using the comma (,) as a field-separator.

team tally
team tally | Source
Team
Score
Green
313
Blue
436
Red
296
results from team tally

A little more advanced awk

In 2003 I posted to a DHCP mailing list with a link to an awk script for parsing DHCP lease files. (If you look at the archives, you'll see the link to www3. Don't bother trying to click it. That server has been offline for a while.)

In 2005 that same script got published with an update to the DHCP server bundle in the SUSE linux project.

I vaguely remember the conversation with the SUSE project maintainer, but I had completely forgotten about the DHCP mailing list. When I tried to find a link to that 2005 awk script, Google found the mailing list post from 2003. That's a long memory, Google!

I found the script and posted it in a code block at the end of this article. Post a comment if you have any questions about it, or if you're interested in seeing a full hub devoted to how it works.

Generate a simple tally from a CSV

Consider the example CSV in the code section below. Red, Blue, and Green teams are competing to capture points in the North, East, South, and West quadrants. The CSV reports how many points per team per quadrant, but who won?

Look over the data and consider whether the columns are normalized - do they change meaning between records? No. It looks like the first column is the team name, the second column is quadrant, and the third column is the score for that team-quadrant.

Variables in awk don't have to be declared or initialized. Just think up a name and throw it in there. When I tally the score, I don't really care what quadrant they scored in, because I assume that all points are equal and we just want a sum of the total points per team. (See the screen cap with the caption, "team tally".)

Note the pattern I used: NF == 3. By stating the number of fields I expect per record, I tell awk to skip over badly formed records. The statement c[$1] += $3 tells awk to accumulate the current value of column 3 in a slot named for the current value of column 1. For line 1 in the input, that is the same as saying "add 149 to the value stored in slot Red of array c". Since c didn't exist before line 1 in the input, it gets initialized with a value of 149 in a slot named Red. It turns out our scorekeepers were sloppy and gave us repeated entries for each team. We'll just assume they mean for them all to be added together. Line 2 from the input file takes the value of slot Red in array c from 149 to 183.

After all input files are processed, the script moves on to the stanza for keyword END. The for x in y operator creates a variable x to hold the values encountered as it iterates over the slots in y. Likewise, the variable team is created, and as awk walks through the named slots in c, it stores the name of each slot into team. To access the final value of each accumulator, refer to the named slot using c[team].

Try running the script against the sample input, and see if you get the same results I do. (See the table "results from team tally".)

One last thought to leave you with: we haven't yet discussed output field separators, but as it turns out, there is such a thing (OFS) with a default value of whitespace. I haven't encountered a command line option for modifying OFS apart from setting it in the BEGIN stanza. In the print statement, OFS is inserted between any variables that are separated by comma (i.e., print team, c[team] in the example). To print your results as a CSV, modify OFS to comma (,) in the BEGIN stanza.

Sample CSV data

Red,North,149
Red,North,34
Blue,North,154
Green,North,271
Green,North,6
Red,East,38
Blue,East,198
Blue,East,2
Green,East,24
Green,East,9
Red,South,1
Red,South,1
Blue,South,37
Blue,South,44
Green,South,1
Red,West,73
Blue,West,1
Green,West,2

leases.awk

#!/usr/bin/awk -f

# Author: Jeff Wilson
# Date:   2002
# License: GPL 3.0 ... no warranty, free to reuse in any way

# Expected input: /var/lib/dhcp/db/dhcpd.leases

# (For best results, pre-process the lease 
#  file with 'grep -v "uid \""')

# Usually invoked as, 'leases.awk /var/lib/dhcp/db/dhcpd.leases'

# Format of output (tab-delimited): 
#    ip,hardware[ip],compname[ip],state[ip],expiration-time-in-GMT[ip]

# set the RECORD SEPARATOR, RS, to "}" ... records span multiple lines
BEGIN {RS="}"}

# we only care about records that are greater than so-many-characters
# (why 5? I guess a CRLF may be 2 bytes ... isn't it just LF in BSD?)
length($0) > 5 { total++

  # only want record those variables we've captured,
  # so reset values to null
  endtime = ""
  hwaddr = ""
  cn = ""
  st = ""

  for(i=1;i<=NF;i++) {


    # if this field matches the word "lease"
    if($i ~ /lease/)

      # capture the next field into ipaddr
      ipaddr=$(i+1)

    # if this field matches the word "ethernet"
    else if($i ~ /ethernet/) {

      # get rid of the trailing semi-colon
      split($(i+1),arr,";")

      # and capture the hwaddr with ipaddr as key
      hwaddr=arr[1]
    }

    # if this field matches the word "client-hostname"
    else if($i ~ /client-hostname/) {

      # get rid of the enclosing quotes
      split($(i+1),arr,"\"")

      # capture the computer name
      cn=arr[2]
    }

    # if this field matches the word "binding"
    else if($i ~ /binding/) {

      # we don't care about what the next binding state is
      # so go on and process the rest of this record
      if($(i-1) ~ /next/) { # do nothing 
      }
      else {
        split($(i+2),arr,";")
        st=arr[1]
      }
    }

    # give me a timestamp or two
    else if($i ~ /ends/) {

      #if $i == "ends" then $i+2 is enddate, and $i+3 is endtime
      dtstmp = $(i+2);
      split($(i+3),arr,";")
      tmstmp = arr[1];
      endtime=sprintf("%s %s",dtstmp,tmstmp)
    }
  }
  if( length(hwaddr) > 0 )
    hardware[ipaddr]=hwaddr
  else
    hardward[ipaddr]="NONE"
  if( length(cn) > 0 )
    compname[ipaddr]=cn
  else
    compname[ipaddr]="NONE"
  if( length(st) > 0 )
    state[ipaddr]=st
  else
    state[ipaddr]="NONE"
  if( length(endtime) > 0 )
    mytime[ipaddr]=endtime
  else
    mytime[ipaddr]="NONE"
}

# for every ipaddr we captured, display ip, hardware, and compname
END { for(ip in hardware) {
       if(length(IP_ONLY)>0)
         print ip
       else
         printf("%s\t%s\t%s\t%s\t%s\n",\
                 ip,hardware[ip],compname[ip],state[ip],mytime[ip])
      }
}

More by this Author


Comments

No comments yet.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working