USENIX ;login: - the dark side of regular expressions

the dark side of regular expressions

hume_andrew
by Andrew Hume
President, USENIX Board of Directors
<[email protected]>

Recently, I was asked to look into a problem with a ksh script; it seemed to be hanging. The script looked innocuous, so I did the normal tracing thing via ksh -x. It seem to hang in a pipeline involving a sed command that was essentially doing a basename of the fourth field. The input file was all one line; the fields are separated by blanks. It looked like this:

systemx=; sed 1q $input
sat1a cserver
/gecko/rcv/compressed/IBMZZZZZ.YYYXXXX.IBL94.AOHOLD.G1589V00.20 30.19981119122414.mpd_emc.gz
/gecko/rcv/cn/IBMZZZZZ.YYYXXXX.IBL94.AOHOLD.G1589V00.2030.19981 119122414.mpd_emc.gz
mpd_emc 149266637 1181 Nov 19 12:24

So first I verified that the command was not hanging, but just taking a very long time. This also could give a baseline to evaluate any improvements we might make.

systemx=; sed 1000q $input | ptime sed 's:.*/$[^ ]*$.*:\1:' > /dev/null
real 2:17.423
user 2:13.372
sys 0.071

This is grisly. (The file was much larger than 1000 lines!) To reassure myself that it was a regular-expression problem, and not data related, I used nawk to filter out the fourth field.

systemx=; sed 1000q $input | ptime nawk '{print $4}' > /dev/null
real 0.079
user 0.061
sys 0.014

As we suspected, it wasn't data related. Let's try sed on the filtered text:

systemx=; sed 1000q $input | nawk '{print $4}' | ptime sed 's:.*/$[^ ]*$.*:\1:' > /dev/null
real 37.815
user 36.842
sys 0.057

Hmmm, a 4x improvement for processing 2.7x less data. This smells nonlinear. We can take advantage of the filtering to use a simpler pattern:

systemx=; sed 1000q $input | nawk '{print $4}' | ptime sed 's:.*/::' > /dev/null
real 0.165
user 0.147
sys 0.014

Yup, it looks like the backreferencing was the culprit all along. In general, backreferencing can take exponential time, but we rarely see such behavior. This time I guess we were lucky. Of course, now the pattern is so simple that we may as well do it all in nawk:

systemx=; sed 1000q $input | ptime nawk '{sub(".*/", "", $4); print $4}' > /dev/null
real 0.099
user 0.077
sys 0.014

Overall, the CPU (user) time is about 1700x faster, and as they say in the performance business, you eventually will notice factors of 1700. Ongoing, this change saved about six hours of CPU time per day. A good return on 10 minutes of real thought.