USENIX ;login: - Programming

effective perl programming

Perl and SQL Databases: A Tasty TiDBIt

by Joseph N. Hall
<[email protected]>

Joseph N. Hall is the author of Effective Perl Programming (Addison-Wesley, 1998). He teaches Perl classes, consults, and plays a lot of golf in his spare time.

For years, programmers have used text files as databases. UNIX is rife with examples: the passwd and group files, for instance, as well as many others in /etc. Text files work well for small amounts of data — a dozen rows or perhaps a hundred or more — but become cumbersome at larger sizes. They're slow to access, they can't be written simultaneously by multiple users, and they're tedious to edit. The lack of any inherent structuring in their contents also limits the usefulness of text files.

If you're keeping a database in text files, and things aren't working out, the obvious alternative is a "real" database, which nowadays means an SQL database. (There are a few intermediate alternatives, like DBM files, but not many problems fit that niche.) However, in years past, an SQL database wasn't an attractive solution for an everyday problem. Database servers were expensive and not really designed with small- to medium-sized chores in mind.

But all this has changed! If you are working on a standard UNIX platform, you can build and install any one of several open-source SQL database servers in an hour or two. Even better, you can talk to it directly with Perl through a straightforward "DBI" interface. Nowadays, using SQL databases from within your Perl scripts isn't just possible — it's a good idea.

DBI and DBD
The DBI module is a "database-independent interface" to many different SQL-based databases. Mainstream commercial products like Oracle, Informix, and Sybase are well supported. However, for the purpose of this column I'm going to focus on MySQL, which is a well-known, high-performance, open-source alternative.

Each different database has its own DBD (Database Driver) module. Oracle has DBD::Oracle, Sybase has DBD::Sybase, MySQL has DBD::mysql, and so on. Each DBD provides an interface between the corresponding database client library and the database-independent DBI module.

We'll use two different DBDs in the examples that follow: DBD::MySQL and a simpler alternative called DBD::CSV, which applies the DBI interface to text files in CSV (Comma Separated Value) format. The examples assume that you know some basic SQL. If you don't happen to know any SQL, there are many good books on the topic — one of my favorite introductory texts is The LAN Times Guide to SQL. You can also find some SQL tutorials on the Internet.

Installing MySQL
Obtain a copy of the MySQL source tarball from one of the mirrors pointed to by <http://www.mysql.com>. Unpack it, then run configure and make as directed in the install instructions. You may want to install it underneath your home directory using the
--prefix option to configure. You could also skip the build process (it takes half an hour or so on a moderately fast single-user machine) and use a binary tarball instead. Either way, when you have it built and installed, you have to initialize the grant tables with the mysql_install_db command:

% scripts/mysql_install_db

You can now start the MySQL server. Because we're just playing around with it, let's run it on a different port and socket for now:

   % setenv MYSQL_TCP_PORT 4001
   % setenv MYSQL_UNIX_PORT /tmp/mysql.login.sock
   % scripts/safe_mysqld &

With the server running, create a database called "test_foo," which we'll use later:

% bin/mysqladmin -p create test_foo
Database "test_foo" created.

You can see how the server is doing by running the MySQL client. (If you do this later, you'll need the environment variables set.) Try the status command:

   % bin/mysql
   Welcome to the MySQL monitor. Commands end with ; or \g.
   Your MySQL connection id is 12 to server version: 3.22.27
   ...
   mysql> status
   - - - - - - -
   bin/mysql Ver 9.36 Distrib 3.22.27, for sun-solaris2.7 (sparc)
   ...

That's all you have to do for now. This installation has no security, but that's a problem you can resolve later if you decide to use MySQL for real.

Installing DBI
The following (abbreviated) instructions assume you have full administrative control over a Perl installation. It doesn't have to be your machine's "main" installation. If you want, build Perl in your home directory or some other convenient location before proceeding.

First, fire up the CPAN shell from the appropriate copy of Perl:

% /whereever/my/binary/is/perl -MCPAN -e shell

You may have to configure CPAN if this is your first time using it. Make your life easier by setting the prerequisites_policy config variable to follow. Once in the CPAN shell, verify that you can find the DBI module:

   cpan> i DBI
   Bundle Bundle::DBI (TIMB/DBI-1.13.tar.gz)
   Module DBI (TIMB/DBI-1.13.tar.gz)

If so, go ahead and build and install it:

cpan> make DBI

... output omitted ...

cpan> install DBI

... output omitted ...

So far, so good. Now, install some DBDs. First do DBD::CSV. (We'll do the Text::CSV_XS module, which is a prerequisite, first.)

cpan> install Text::CSV_XS
cpan> install Bundle::DBD::CSV

Even if you haven't got MySQL working you'll be able to get a feel for DBI with DBD::CSV. Speaking of which, to build the MySQL DBD:

cpan> make Bundle::DBD::mysql

When you are asked which database to install support for, answer "1" for MySQL only (unless you also happen to have mSQL installed). When asked for the host and port, if you are running MySQL on an alternative port as suggested above, respond with "localhost:4001" (or whatever value you used). Assuming the make went smoothly, test and install the MySQL DBD:

cpan> install Bundle::DBD::mysql

NOTE: The DBD bundles reinstall DBI, at least in some cases. This is normal, if seemingly boneheaded.

So It's Installed — Now What?
Let's use DBI as the basis for a simple mail-filtering application. Our eventual goal will be to create a program that parses a mail message and returns a zero exit status if the message is known to come from an "approved" address, or a nonzero status otherwise. A program like this can be used by a delivery agent to accept or bounce incoming email, or at the least to divert "unapproved" messages into a different folder. We'll determine whether an originating address is approved by looking up the sender's host in a database.

We'll start with a very simple schema consisting of a single column, HOST, containing approved host names. To make this even simpler, let's start with DBD::CSV. Here is a Perl program that will "connect" to the "database" (really it's just a bunch of files) and create a table for us:

   use DBI;
   my $dbh = DBI->connect("DBI:CSV:f_dir=csv")
    or die "couldn't connect";
   $dbh->do(q(
    CREATE TABLE APPROVED_HOST (
     HOST CHAR(128)
    )
   )) or die;

The argument to the connect method is the "data source" (DSN) string. This tells DBI which driver to use (CSV in this case). It also supplies additional arguments that are passed into the driver itself. In this example, we've supplied the argument f_dir=csv, which instructs the CSV driver to create its text files in the subdirectory csv. If connect fails, it will return false, and we die because there is no particular point in continuing. The connect method returns a database handle, which we store in the variable $dbh. Database handles represent active connections.

The do method is one of several ways of executing SQL statements. It takes a string and passes it to the driver for execution. Again, it returns true or false indicating success or failure, respectively. Note that we've quoted the argument to do with the generalized single quote syntax q() — this isn't strictly necessary, but it makes the code easier to read.

After this program runs, the csv directory will contain a file named APPROVED_HOST, named after the APPROVED_HOST table. It won't contain anything other than a single line with the name of the table's (single) column, HOST, but we'll fix that in a moment.

Now, let's write a program, called approve, to insert an approved host name in the table. This is also straightforward:

   use DBI;
   my $dbh = DBI->connect("DBI:CSV:f_dir=csv")
    or die "couldn't connect";

   my $host = shift or die "usage: approve host\n";
   $dbh->do(q(INSERT INTO APPROVED_HOST VALUES (?)),
    undef, $host);

Use approve like this:

% approve foo.bar.com

Here we are using the multi-argument form of the do method. The second argument is a hashref of "attributes" that isn't often needed (just put undef in it). The remaining arguments are "bind values" that are bound to placeholders in the SQL argument. Each question mark in the first argument is a placeholder. When DBI executes the SQL statement, it replaces the placeholders in it with their corresponding bind values (SQL escaping them in the process). In this example, there is a single placeholder (the value in the INSERT statement) and a single bind value that gets plugged into it (the $host variable).

Our last simple example is a program called ok, which prints "yes" or "no" depending on whether or not its argument is an approved host:

   use DBI;
   my $dbh = DBI->connect("DBI:CSV:f_dir=csv")
    or die "couldn't connect";
   my $host = shift or die "usage: ok host\n";

   ($h) = $dbh->selectrow_array(q(
    SELECT * FROM APPROVED_HOST WHERE HOST = ?
   ), undef, $host);
   if ($h) {
    print "yes\n";
   } else {
    print "no\n";
   }

The selectrow_array method is a convenient way to run an SQL query statement when you need only the first row of the result. The row is returned as a list. If the query returns zero rows, selectrow_array returns an empty list. We use this to determine whether the host was found in the table and then print out "yes" or "no" accordingly.

We could have more sensibly used a COUNT here, but the CSV driver, which is very basic, doesn't support it.

Connecting with DBD::mysql
Let's rewrite the programs above to use DBD::mysql. We'll start with the program to create a table. The only change that's absolutely necessary is the connect method:

   my $dbh = DBI->connect("DBI:mysql:database=test_addr;"
     "mysql_socket=/tmp/mysql.login.sock")
    or die "couldn't connect";

The first part of the DSN string has changed from DBI:CSV to DBI:mysql. The rest of the DSN string is DBD-specific. The MySQL DBD allows quite a few different options. By default it connects to a MySQL server running on the local host through a UNIX socket. Because we started the server on a different (nonstandard) socket, we have to specify a value for mysql_socket. Setting the MYSQL_UNIX_PORT variable would also work, as would using a config file.

Other optional arguments for the connect method include user and password. We're using the defaults, which is fine for our test database.

Let's change the schema while we're at it. We'll make HOST a primary key, and add some DATETIMEs so that we can keep track of when approvals are created and expire them after a period of time.

   $dbh->do(q(DROP TABLE IF EXISTS APPROVED_HOST));
   $dbh->do(q(
    CREATE TABLE APPROVED_HOST (
     HOST VARCHAR(128) PRIMARY KEY,
     APPROVED_DATE DATETIME NOT NULL,
     EXPIRE_DATE DATETIME NOT NULL
    )
   )) or die;
   $dbh->disconnect;

The call to disconnect is a very good idea and avoids inconsistent operation and warning messages. To save space, though, I won't always show it. Next, let's look at a revised version of the approve program:

   use DBI;
   use POSIX;
   my $dbh = DBI->connect #... as before

   my $host = shift or die "usage: approve host\n";
   my $now_td = strftime("%Y-%m-%d", localtime);
   my $later_td = strftime("%Y-%m-%d",
    localtime(time+24*60*60*180));

   $dbh->do(q(INSERT INTO APPROVED_HOST
     (HOST, APPROVED_DATE, EXPIRE_DATE) VALUES (?, ?, ?)
    ), undef, $host, $now_td, $later_td);

The localtime and POSIX strftime functions are handy when converting UNIX times to formats that can be understood by databases. I insert a "now" date as well as a date 180 days in the future. The dates are in "YYYY-MM-DD" format, which is readily understood by both humans and MySQL. Next, the ok program:

# use statements and connect omitted ...

   my $host = shift or die "usage: ok host\n";
   my $now_td = strftime("%Y-%m-%d", localtime);
   ($count) = $dbh->selectrow_array(q(
    SELECT COUNT(*) FROM APPROVED_HOST
    WHERE HOST = ? AND EXPIRE_DATE > ?
   ), undef, $host, $now_td);
   if ($count) {
    print "yes\n";
   } else {
    print "no\n";
   }

This works like the previous version of ok, except that it also checks to see that the approval hasn't expired, and it uses a count of the matching rows (there should be only one anyway). Next, let's look at a program called approved that lists all the currently approved hosts:

# use statements and connect omitted ...

   my $now_td = strftime("%Y-%m-%d", localtime);
   $sth = $dbh->prepare(q(
    SELECT HOST, EXPIRE_DATE FROM APPROVED_HOST
    WHERE EXPIRE_DATE > ?
    ORDER BY HOST
   ));
   $sth->execute($now_td);
   my ($host, $expire_date);
   while (($host, $expire_date) = $sth->fetchrow_array) {
    printf "%30s expires $expire_date\n", $host;
   }

This program is the first we've looked at that uses a query that will return multiple rows. There are several ways of working with such queries. In general, you first "prepare" the SQL statement into a statement handle. Then you execute the prepared statement and iterate over the rows in the result. The prepare method returns a statement handle object ($sth in this case). After calling the execute method on the statement handle, we read the resulting table with the fetchrow_array method. There are a number of alternative ways of handling query results — see the DBI documentation for more information.

Our last program, which accomplishes the promised task of "approving" mail messages, requires that you have the Mail::Internet bundle:

   use DBI;
   use POSIX;
   use Mail::Internet;
   use Mail::Address;
   my $dbh = DBI->connect( # ... as before

   my $now_td = strftime("%Y-%m-%d", localtime);
   my $mail = Mail::Internet->new(\*STDIN) or die "can't parse message";
   my ($from_a) = Mail::Address->parse($mail->head->get('From'));
   my $host = $from_a->host;
   my ($count) = $dbh->selectrow_array(q(
    SELECT COUNT(*) FROM APPROVED_HOST
    WHERE HOST = ? AND EXPIRE_DATE > ?
   ), undef, $host, $now_td);
   $dbh->disconnect;
   exit($count ? 0 : 1);

We read the message from standard input, then use a few lines of Mail::Internet voodoo to extract the host name from the From: line. Then we look up the host name in the database and return an exit status of 0 or 1 depending on whether or not it is approved.

Databases: Free and Easy!
There are many more details to consider in a production version of this system — error handling, for example. But that'll have to wait for a future column.

Meanwhile, I hope that with these examples I've shown you that nowadays SQL databases are both inexpensive (free!) and easy to use. If you need a safe, organized place to store some data — no matter whether you have a little or a lot — consider doing it with Perl and an SQL database.