the tclsh spot![]()
by Clif Flynt
Clif Flynt has been a professional programmer for almost twenty years, and a Tcl advocate for the past four. He consults on Tcl/Tk and Internet applications.
The last Tclsh Spot article showed how to write a simple HTML robot that gets the current price of a stock using the HTTP package and some simple regular expressions. The final code looked like this: package require http
foreach symbol $argv {
set id [::http::geturl $url]
This is useful information, but there is a lot more info in the report from newsalert.com that it would be nice to get, and the regexp command seems like a good tool to use. The interesting part of the page from newsalert.com looks like this:
<td align="left" class="symprice">
<a"href="/bin/charts?Symbol=SUNW">142 </a>
The obvious pattern is that all the interesting data is in between a > and a < symbol after the charts?Symbol= string. A simple brute-force approach to this regular expression would be to match each < stuff > Data < stuff > pattern with a regular expression something like this: <[^>]+>([^<]+)<[^>]+> This regular expression matches a < symbol, any other characters except a greater-than symbol up to the first > symbol (the <td align left . . . > tag), then any characters except a less-than symbol (the data) until the next < and any characters except the greater-than until the next > (the </td> tag). This style of regular expression is familiar to most folks who have used regular expressions with sed or fgrep, and it will work with all revisions of Tcl. The regular-expression code was rewritten for Tcl 8.1. Among other improvements (such as support for Unicode), a question mark after a quantifier symbol (the + or *) will change the regular-expression parsing behavior from using the maximum number of characters to match a regular expression to using the minimum number of characters to match the expression. This lets us simplify the previous pattern to this: <.+?>(.+?)<.+?> This regular expression matches a < symbol, any other characters up to the first > symbol, then any characters up to the next < and any characters until the next >. We can concatenate as many copies of this pattern as we need to collect the change, percentage change, high, low, etc. This is conceptually simple but creates a long, incomprehensible regular expression. Another new feature with the 8.1 and 8.2 Tcl interpreter is the -expanded flag. The -expanded flag causes the regular-expression parser to ignore whitespace and comments. Thus, instead of an ugly long, regular expression, we can write
regexp -expanded {arts\?Symbol=(.+?)"> # Get the symbol
This is readable, but still longer than seems necessary. The Tcl regular-expression engine can be declared with repeating atoms. An atom can be a single character, or a regular expression enclosed in parentheses. The number of times the pattern can be matched is declared by following the atom with a value inside curly braces.
{val} Match the pattern exactly val times.
This lets us shorten the regular expression by removing the repeated pattern of a tag followed by unnecessary characters.
The dummy variable is there to catch the part of the string that's matched by the duplicated regular expression. We won't be using that data, so there is really no need to collect it. A regular expression can be made "non-capturing" by following the left parenthesis with a question mark and colon, instead of continuing with the regular expression.
This technique makes a fairly comprehensible regular expression, but we've only gotten as far as the price change. The code doesn't handle high, low, date, volume, etc. More fields can be added to the regular expression by just extending this pattern, but that will get long. We can use the match count descriptor to extract successive fields from a regular expression into variables by using a loop like this:
This is fairly short and uses some nice new features of the regexp command. The only problem is that it parses the full HTML page to extract each and every item. That's only eight passes, but still . . . The data that we want to get from this page are the only characters not inside a tag. If we just strip the tag info away from the text, we'll be left with the data we want. There are other Tcl commands that use the regular-expression engine. One of these is regsub.
In this case, we can use the regexp command to strip away most of the HTML page, leaving us with several lines of HTML data that describe a single row in the table. Our script can use the regsub command to strip the tag information out of that string, and finally convert the multiple lines of data into a list with the split command.
regexp -expanded "arts.Symbol=(.+?)\"> # Get the symbol
This is a fairly small set of code for parsing the information out of these HTML pages. A robot using this parsing code would resemble this: package require http ::http::config -proxyhost 56.0.0.2 -proxyport 8000
puts [format \
foreach symbol $argv {
set id [::http::geturl $url]
regexp -expanded "arts.Symbol=(.+?)\"> # Get the symbol
puts [eval format \
This robot will generate output resembling:
Again, this robot can be put into a crontab entry to put stock quotes into your mailbox.
But now that I can collect daily quotes, I want to analyze and view the
data. The next article will discuss saving the data and using the BLT
graph widget to view it.
|
![]() Last changed: 24 Jul. 2000 mc |
|