SAGE ;login: - effective perl programming

effective perl programming

Analysis Without Paralysis

Joseph N. Hall is the author of Effective Perl Programming (Addison-Wesley, 1998). He teaches Perl classes, consults, and plays a lot of golf in his spare time.

Perl is particularly well-suited for the analysis of log files and other similarly organized text. Using Perl, you can search files for entries meeting particular requirements (as with the grep command, but more powerfully), you can build data structures that capture and organize the contents of files, and you can summarize or restructure data in the files. In this column, I'll illustrate the techniques that Perl programmers use to perform these tasks, starting with the basics, then proceeding to more sophisticated examples.

Processing One Line at a Time

Many log files are organized so that each line is a separate "record" in the log. Generally, you want to process this type of file one line at a time. The idiom for this in Perl is the ubiquitous:

open FILEHANDLE, "/my/file" or die "couldn't open: $!";
while (<FILEHANDLE>) {
 # do something with the contents of $_
}
close (FILEHANDLE);

The while (<FILEHANDLE>) loop is a shorthand way of writing:

while (defined($_ = <FILEHANDLE>)) {
 # do something with the contents of $_
}

Both these snippets read a line at a time into $_ from the file opened as FILEHANDLE. Inside the while loop, you put whatever code is necessary to process a line of the file. For example, to print all the lines containing the word 5sigma, you could write:

while (<FILEHANDLE>) {
 print if /\b5sigma\b/;  # print and // both default to $_
}

You might choose to extract information during the loop and then print it out in some other form after the file has been completely read. Often, you will want to read data into a hash as part of this process. For example, to parse the passwd file and create hashes that map user names to user ids and vice versa -- a bit of makework, mind you, because this capability already exists in the built-in getpwnam and getpwuid operators -- you might write:

open PASSWD, "/etc/passwd" or die "couldn't open passwd: $!";
while (<PASSWD>) {
chop;
 my ($name, $dummy, $uid) = split /:/;  # split defaults
to $_
 $uid{$name} = $uid;  # add a new name/uid to
%uid
 $name{$uid} = $name;  # add a new name/uid to
%name
}
close (PASSWD);
for (sort keys %uid) { print "uid for $_ is $uid{$_}\n"}
for (sort {$a <=> $b} keys %name)
{ print "name for $_ is $name{$_}\n" }

Note that I am spelling foreach as for here. The foreach and for tokens are
interchangeable.

The split operator breaks each line of the password file into its constituent fields. We assign the first and third fields to $name and $uid, respectively, then use those values to create hashes. (Note that there is no conflict between the scalar variables $name and $uid and the hashes %name and %uid -- they are independent.) The last two lines print out the contents of the two hashes. Because the keys of %name are numeric user ids, they must be sorted in numeric order rather than the default "ASCIIbetical" (character-by-character) order; thus the sort block {$a <=> $b}.

Reading Multi-Line Records

You may occasionally encounter text files where records occupy several lines and are set off from one another by delimiting lines. Perl's scalar .. operator, also known as the "flop" operator, is sometimes helpful in dealing with this type of file. Suppose, for example, that you are parsing a file consisting of records that look like the following:

begin user joebloe
name: Joseph N. Hall
phone: 555-1212
email: [email protected]
end user

The following code will scan input one line at a time and print out only the record(s) for the user joebloe:

while (<>) {  # read from standard input or files in @ARGV
 print if /^begin\s+user\s+joebloe\b/ .. /^end\s+user/;
}

The flop operator works by maintaining a "state" that is either true or false. Each flop operator in a program has its own state. The flop operator starts out yielding false, and first yields true when the lefthand expression evaluates to true. It then yields true until the righthand expression evaluates to false. It's a slightly obscure feature of Perl, but, as you can see, when it's right for the job it can yield very succinct programs.

Reading a File All at Once

Perl programmers tend to read files one line at a time -- Perl has a lot of features that work well on "line at a time" input, and if lines have a known maximum length, you can be assured that a program reading one line at a time can handle a file of any length. However, sometimes you may want to read the entire contents of a file all at once -- to do some multi-line pattern matching, or for efficiency, or "just because." The customary way to read all of a file is to clear the line separator variable $/' If $/ has the value undef, the line input operator <> will read the entire contents of input into a $scalar rather than a single line from it. Here is an example where we read the password file all at once and create a hash of the names and user ids in one fell swoop:

{
 open PASSWD, "/etc/passwd" or die "couldn't open passwd: $!";
 my $oldfh = select PASSWD;
 local $/;   # undefs $/ for PASSWD in this block
 select $oldfh;    # restore previous default filehandle
 %uid = (<PASSWD> =~ /^(.*?):.*?:(.*?):/mg);  # all at once!
}
for (sort keys %uid) { print "uid for $_ is $uid{$_}\n" }

There is a different $/ for each filehandle. In this example we have to use the select operator to make PASSWD the current filehandle so that we can change the value of its $/. Next, we restore the previous current filehandle (probably STDIN), then read the entire contents of the password file and perform a match that returns a list of name
and user ids suitable for initializing a hash -- note the /m and /g options in the match operator.

Searching Simultaneously for Multiple Patterns

Sometimes you will want to search a file for lines matching one of several patterns. Certainly, you could write something like:

while (<FILEHANDLE>) {
 print if /\bjoseph\b/i or /\bhall\b/i;
}

You can interpolate variables into match operators if you want to specify patterns at runtime:

($pat1, $pat2) = qw((?i)\bjoseph\b (?i)\bhall\b);
while (<FILEHANDLE>) {
 print if /$pat1/ or /$pat2/;  # (?i) gives case-insensitivity
}

You have to be concerned about a couple of things when interpolating variables into match operators. First, the variables must contain legal regular-expression syntax. For example, if $pat1 in the example above contains :-), a fatal error will occur at runtime because /:-)/ is not a legal regular expression. (The quotemeta operator can be helpful in these cases -- see the perlfunc man page.) Second, when a match operator contains variables, the regular expression is recompiled each time that the match operator is used, generally resulting in slower performance. The /o ("compile once") option causes a regular expression containing variables to be compiled only once:

($pat1, $pat2) = qw((?i)\bjoseph\b (?i)\bhall\b);
while (<FILEHANDLE>) {
 print if /$pat1/o or /$pat2/o;
}

To get this to work with arbitrary lists of patterns, though, you need to resort to some trickery. The usual method is to use a string eval returning an anonymous subroutine in combination with a /o match operator. This makes it possible to construct a list of anonymous subroutines, each of which searches its argument for a particular pattern:

@pats = qw((?i)\bjoseph\b (?i)\bnathan\b (?i)\bhall\b);
@search = map { eval q{ my $pat = $_;
    sub {$_[0] =~ /$pat/o} } } @pats;
while (defined($line = <FILEHANDLE>)) {
 for (@search) {
   if ($_->($line)) { $count++; last }
 }
}
print "matches = $count\n";

You could also construct a single pattern that matches an alternation of the original list of patterns. That might appear to be more efficient at first, but in my benchmarks it doesn't seem to make a large difference.

If you are using Perl 5.005, an alternative (and more readily comprehensible) means of interpolating regular expressions is available through the qr (quote regex) operator. When 5.005 is widely adopted, qr will become the most appropriate mechanism for engineering solutions to this type of problem.

Reading Data into Nested Structures

You can handle some common tasks by reading data into one or two ordinary hashes, but for more complex analysis tasks you may need to use nested hashes and/or arrays. In order to work with nested data structures, you will need an understanding of reference syntax (too complicated to cover here, sorry!). You should also understand auto-vivification in Perl. Auto-vivification is a mechanism by which structures linked by references are created automatically. To illustrate, let's suppose that the variable $stats has the value undef. Now, consider the following line of Perl:

$stats->{$host} = {Bytes => $bytes};

We are using $stats like a hash reference. Even though $stats is undefined, when we assign a value to $stats->{$host}, Perl will automatically create the underlying hash and assign a reference to it to $stats. Now, $stats->{$host}{Bytes} will return whatever the value of $bytes was. Auto-vivification also works for arbitrarily deeply nested structures. We could have written the above as:

$stats->{$host}{Bytes} = $bytes;

And, in fact, that's the idiomatic way to do it in Perl. Nested structures are useful when you must summarize or reorganize the data in a file. As an example, let's look at analyzing httpd logs in Common Log Format (CLF). Let's create a list of all the different hosts that connected to the Web server on each day, and print total bytes for each:

my $log = "access_log";
open LOG, $log or die "Couldn't open $log: $!";
my %bytes;
while (<LOG>) {
 # split line into various fields
 my $line;
 my ($host, $date, $request, $status, $bytes) = 
    /(\S+).*?\[([^:]+).*?\]\s+"(.*?)"\s+(\S+)\s+(\S+)/;
 # truncate host name to domain.domain if necessary
 ($host) = ($host =~ /([^.\n]+(?:\.[^.\n]+)?)$/) if 
         $host =~ /[a-z]/i;
 next if $bytes =~ /\D/;  # skip if $bytes non-num, e.g. '-'
 $bytes{$date}{$host} += $bytes;
}
for my $date (sort keys %bytes) {
 print "$date:\n";
 for my $host (
 sort {$bytes{$date}{$b} <=> $bytes{$date}{$a}}
  keys %{$bytes{$date}}) {
   print "  $host: $bytes{$date}{$host} bytes\n";
 }
}

The first part of this program (the while loop) reads in the log file a line at a time, extracting the various interesting parts of each line. (We aren't using $status or $request here, but I left them in for clarity.) The hostname is cleaned up, and lines where no bytes were transferred are ignored; then the number of bytes is added to an "accumulator" in a nested hash. A transfer of 5,000 bytes on 02/Jan/1999 from a host named foo.bar would be added like this:

$bytes{"02/Jan/1999"}{"foo.bar"} += 5000;

Auto-vivification will create the appropriate underlying hashes and references anew if there is no existing entry for that date and/or host. The second part of the program sorts and prints out the dates and hostnames in a useful format, ordered first by date (alphabetically, for simplicity's sake) and then in descending order by number of bytes transferred.

I'll finish with one more example. This time, let's look through the log and print out stats for the five largest transfers:

my $log = "/etc/httpd/logs/access_log";
open LOG, $log or die "Couldn't open $log: $!";
# initialize so -w is happy
my @largest = map { +{ Bytes => 0 } } 1..5;
while (<LOG>) {
 # split line into various fields
 my ($host, $time, $request, $status, $bytes) = 
    /(\S+).*?\[(.*?)\]\s+"(.*?)"\s+(\S+)\s+(\S+)/;
 # truncate host name to domain.domain if necessary
 ($host) = ($host =~ /([^.\n]+(?:\.[^.\n]+)?)$/);
 next if $bytes =~ /\D/;  # skip if $bytes non-num, e.g. '-'
 # keep track of largest so far; re-sort if changed
 if ($largest[0]{Bytes} <= $bytes) {
 @largest = sort { $b->{Bytes} <=> $a->{Bytes} } 
 @largest[0..3], 
 { Host => $host, Time => $time, Request => $request, 
          Bytes => $bytes
}
 }
}
for (@largest) {
 print "$_->{Host}: $_->{Bytes} bytes on $_->{Time}",
     " for request $_->{Request}\n";
}

In this program we're using nested structures to keep track of information about a list of the largest transfers found so far. $largest[0] is a reference to a hash containing information (host, time, request, bytes) about the largest transfer seen so far, $largest[1] contains information about the second-largest one seen so far, and so on. Whenever a new, larger transfer is encountered, the new transfer is added to the list and the list is resorted.

Both of these programs run reasonably quickly -- under a minute on 20MB log files on an older Sparc 20.

Summary

Perl is a powerful tool for analyzing and summarizing log files and other types of text databases. I've tried to show a few simple examples as well as some meatier ones. Of course, you don't always have to construct your own analysis code from scratch. There are CPAN modules that will help you analyze Web and other logs, so if you have a more complex analysis task, be sure to check there to see whether your problem might already be partially or completely solved for you.

Need help? Use our Contacts page.

15 Apr. 1999 jr
Last changed: 15 Apr. 1999 jr

Issue index

;login: index

SAGE home