the dark side of regular expressions
systemx=; sed 1q $input
So first I verified that the command was not hanging, but just taking a very long time. This also could give a baseline to evaluate any improvements we might make.
systemx=; sed 1000q $input | ptime sed 's:.*/\([^ ]*\).*:\1:'
> /dev/null
This is grisly. (The file was much larger than 1000 lines!) To reassure myself that it was a regular-expression problem, and not data related, I used nawk to filter out the fourth field.
systemx=; sed 1000q $input | ptime nawk '{print $4}' >
/dev/null
As we suspected, it wasn't data related. Let's try sed on the filtered text:
systemx=; sed 1000q $input | nawk '{print $4}' | ptime sed
's:.*/\([^ ]*\).*:\1:' > /dev/null
Hmmm, a 4x improvement for processing 2.7x less data. This smells nonlinear. We can take advantage of the filtering to use a simpler pattern:
systemx=; sed 1000q $input | nawk '{print $4}' | ptime sed
's:.*/::' > /dev/null
Yup, it looks like the backreferencing was the culprit all along. In general, backreferencing can take exponential time, but we rarely see such behavior. This time I guess we were lucky. Of course, now the pattern is so simple that we may as well do it all in nawk:
systemx=; sed 1000q $input | ptime nawk '{sub(".*/", "", $4);
print $4}' > /dev/null
Overall, the CPU (user) time is about 1700x faster, and as they say in the performance business, you eventually will notice factors of 1700. Ongoing, this change saved about six hours of CPU time per day. A good return on 10 minutes of real thought.
|
![]() First posted: 14 Apr. 1999 jr Last changed: 14 Apr. 1999 jr |
|