USENIX ;login: - Software Patent Institute Report

Software Patent Institute Report

by Roland J. Cole
<[email protected]>

Executive Director, Software Patent Institute

During 1999, the USENIX Association made a grant to the Software Patent Institute of $55,000, to be used by SPI to add materials to the SPI database. The Software Patent Institute is a nonprofit effort sponsored by a number of organizations concerned that decisions about patents for software-related inventions be based on the best possible information. One of SPI's major activities has been to develop and maintain a database of "software prior art" — that is, descriptions of software technologies from textbooks, manuals, technical reports, and journal articles. Because material from about 1990 on could be found in various online databases, SPI concentrated on materials not readily available, especially from the 1980s, 1970s, and 1960s.

Thanks to the USENIX grant, we were able to recreate the system SPI had developed for adding documents to its database. Thanks to increased experience on our part and to lowered prices and increased performance of computer hardware and software, the new system is much faster and more efficient than the SPI system of a few years ago. As we discuss in more detail below, we are now completing the processing of technical reports at a rate of sometimes more than 200 per month.

We ended the period of the grant with the database already expanded significantly in scope (thanks to the bibliographies already loaded) and expanding rapidly in size and institutional variety, thanks to all the technical reports. Continuing at this pace through the end of 1999, with no additional resources from USENIX, we have more than doubled the number of documents that were in the database at the end of 1998. We are tremendously grateful and are sure this collaboration between USENIX and SPI has extended the value of the database to all concerned.

The SPI Database Development Process
In brief, our process is as follows. We first obtain either possession of a document of interest or copyright permission if it is required. Most of the documents come to us in paper form, although we receive some in electronic form. Many do not require copyright permission, as they are either government documents or published without a copyright notice prior to 1989 (usually by an academic or nonprofit institution). When we do need copyright permission, it can take a while. Almost every copyright holder has been willing to grant no-cost permission to our nonprofit effort. However, it often takes weeks or months to track them down, explain our proposal, and hear back from them. Alternatively, when we have copyright permission but do not have the actual document, obtaining the document can take a while. These are older materials, often out of print, and libraries are sometimes reluctant to loan them out (and often do not allow us to unbind and rebind them, since they have been rebound enough times before that the gutter margin is too small for yet another attempt).

Once we have both permission and physical possession of the documents, we then scan paper documents with a sheet-fed scanner. Then we use a software program (Caere Omnipage) that converts the image into text (albeit with varying degrees of accuracy) often called "OCR" (for optical character recognition, although the software uses language analysis as well as character analysis). If the document was difficult to OCR, we do our initial corrections in Omnipage, a process we call "proofing." The goal is not absolute correspondence to the original (that would take far too much time and other resources), but "good enough" correspondence that the search engine will work acceptably. Our goal is successful searching, with the idea that our database functions more or less as an extended "card catalog" rather than as a document delivery system, recognizing that our documents are not readily available for electronic searching unless we make them so. To use our resources most effectively, we omit nontextual material such as mathematical equations, segments of program code, figures, diagrams, and pictures, although we try to preserve the captions of such items.

If the document arrives in electronic form, we can obviously skip the scanning, OCRing, and proofing stages. We replace them with a stage we call "conversion," since we receive a variety of electronic formats and need to end up with ASCII text.

Our next step is to finish whatever "correction" of the document we feel is necessary, plus add the HTML-like (hypertext markup language) codes we use to identify headings, captions, and the like. We are trying to preserve the structure of the document — chapters, sections, and paragraphs — and to identify page breaks. We use Microsoft Word for this process and call this activity "formatting." Both paper documents and electronic documents go through this stage.

We then add a front end to the document which contains the citation material, such as author, publisher, and so forth.

Then we run the documents through a program written by SPI to check our formatting, a process we call "cleancheck." This program checks for mismatched tags, incorrect syntax, and the like. During the USENIX grant, we were able to convert this to an email process — our document specialist emails the supposedly formatted document to the SPI server in Michigan from her computer in Kansas City, and receives a return email (usually within seconds) that tells her the document is OK or lists the errors needing correction.

Next, this same program processes correctly formatted documents into records for the database, intelligently dividing documents into segments of approximately 10,000 bytes while keeping paragraphs, pages, and sections intact as much as possible.

The database program then takes the resulting collection of records and loads them into the database. We lump these last two steps together as the "loading stage."

As we said in our application for the grant, we had developed and honed this generic process from 1992 through 1997, but financial support had dropped to a level that enabled us to keep the database online but not to continue this process, let alone improve it. We processed approximately ten documents during the December 1997 through September 1998 period.

Thanks to the USENIX grant, we were able to start moving some 18,400 documents through our process during the December 1998 through November 1999 period. Along the way, we were able to restart our process and tune it to take advantage of major decreases in price and increases in performance in hardware and software between 1997 and 1999.

Before and After Process Improvements
Between December of 1998 (when the grant started) and November of 1999 (the end of this report period), SPI accomplished the following:

1. Upgraded hardware and software to the "state of the art" for standalone systems.

2. Redesigned the process to allow for steps to be separated and assigned to different people in different locations.

3. The net effect of 1 and 2 was to move from 10 pages per day to some 200 pages per day finished, with intermediate progress of several thousand pages per day of scanning and OCRing.

4. Catalogued and started processing the 18,400 documents already in our possession, including some 36 boxes of IBM material obtained from the University of Michigan and some 110 boxes of technical reports obtained from Yale University.

5. The material added during this period includes the entire series of computer bibliographies complied by ACM from 1960 through 1994, plus technical reports from several dozen institutions, thus greatly expanding the scope of the database in both subject matter and institutional source.

Report of Documents in Process
The table presented below shows the active list in more detail. It shows the great variety in the institutional sources of these documents (they come from sources all over the U.S. and in Europe) and how many of them have already been scanned and OCRed, although not yet formatted.

The first number is the count, the second number is the stage of the process those documents are in. (Please note: all counts are approximate.) Also, we have continued processing since this table was prepared, in part because USENIX has agreed to renew its grant for 2000. In particular, we have scanned several thousand additional documents and loaded several hundred more.

1 = permission to be asked
2 = awaiting permission
3 = awaiting documents
4 = to be scanned
5 = scanned
6 = OCRed
7 = being proofed
8 = proofed
9 = received electronically
10 = converted
11 = being formatted
12 = formatted
13 = being cleanchecked
14 = passed cleancheck
15 = loaded

Document Source — count @ status
ACM Bibliographies — 9@15, 1@7
ACM Guides to Computing Literature — 14@15
Bolt, Beranek & Newman — 12@5
C.I.T. — 109@1, 61@5
Centrum voor Wiskunde en Informatica — 338@1, 105@2
Carnegie-Mellon — 70@1, 210@4
Columbia — 15@5
Cornell — 63@5, 5@15
Docs. by or regarding Smarandache — 55@4, 1@15
Docs. re: MIDAC — 9@4
Docs. re: IBM797 — 7@4
Docs. re: IBMAPL — 1@4
Docs. re: ORDVAC — 1@4
Ecole Polytechnique — 24@1
Harvard — 27@5
IBM Red Books — 54@4, 1@15
IBM Research Reports — 750@4
IBM System/370 Principles of Operation — 1@11
IBM Technical Disclosure Bulletins — 10@15
IEEE Computer — 29@15
IEEE Software — 6@15
Indiana University — 68@5
INRIA — 930@4
Iowa State — 16@4, 37@5
Israel Institute of Technology — 92@1
M.I.T. — 143@4, 152@5, 3@15
M.I.T. Dissertations — 50@1
McMaster University — 66@1
Michigan State — 27@5
Miscellaneous — 300@4
NASA — 295@4, 3@15
Newcastle Upon Tyne — 93@1
North Carolina State University — 24@5, 5@15
NYU — 24@4, 113@5
Old Dominion — 8@5
Ph.D. Dissertations — 11@8, 2@15
Princeton — 91@5
Purdue — 95@5, 5@15
RAND Abstracts — 3000@3
Rice University — 18@5
Royal Institute — 204@4, 7@5
Rutgers — 104@1, 218@4
Southern Methodist — 19@5
SRI — 38@5
Stanford — 310@4, 191@5, 8@15
Technical University of Munich — 710@2
Thinking Machines — 22@1
Turkey Biblio. — 21@1
U. of Alberta — 64@1
U. of Amsterdam — 302@1
U. of Arizona — 109@5
U. of British Columbia — 37@1
U. of California at Berkeley — 7@5
U. of California at Irvine — 11@5
U. of California at Santa Barbara — 18@4
U. of Central Florida — 33@5
U. of Colorado — 16@4, 33@5, 10@15
U. of Denver — 11@5
U. of Edinburgh — 132@1
U. of Helsinki — 15@1
U. of Illinois — 208@1, 950@4, 240@5
U. of Iowa — 14@4
U. of Kentucky — 46@5
U. of Kyoto — 28@1
U. of Maryland — 307@1, 95@4, 87@5, 4@15
U. of Minnesota — 128@5, 5@13
U. of New York at Buffalo — 23@5, 5@15
U. of Oregon — 1@15
U. of Paris — 28@1
U. of Pennsylvania — 10@4
U. of Pittsburgh — 11@4, 46@5, 4@13
U. of Rochester — 22@5, 4@13
U. of Saskatchewan — 160@1
U. of South Wales — 12@1
U. of Southern California — 12@4, 135@5
U. of Tennessee — 10@5, 5@15
U. of Texas at Austin — 75@4
U. of Toronto — 180@4
U. of Washington — 12@4, 75@5
U. of Waterloo — 303@1
U. of Wisconsin — 487@4
UCLA — 40@4, 20@5, 2@13
USENIX articles (from bibliography) 3808@1, 39@2, 23@3
USENIX Bibliography (from Web site) 1@15
V.P.I. — 14@5
Wang Institute — 21@1
Wayne State — 58@5
Weizmann Institute — 169@1
Xerox — 10@8, 5@15
Xerox Technical Reports — 28@4, 7@13, 10@15
Yale — 135@4

Totals by Status as of 9/30/99
(Note that items in any one stage have already gone through the previous stages — all those loaded have been formatted, for instance.)

1 = permission to be asked — 6533
2 = awaiting permission — 854
3 = awaiting documents — 3023
4 = to be scanned — 5610
5 = scanned — 2174
6 = OCRed — 0
7 = being proofed — 1
8 = proofed — 21
9 = received electronically — 0
10 = converted — 0
11 = being formatted — 1
12 = formatted — 0
13 = being cleanchecked — 22
14 = passed cleancheck — 0
15 = loaded — 142

Total Processed at Least in Part from December 1998 Through September 1999
18,431

With the renewal of the USENIX grant for 2000, we expect to increase the speed of the process once again, especially with major steps forward in hardware speed, and to at least double, if not quadruple, the total number of documents in the database during the year. The database is available without charge from SPI at <http://www.spi.org>, and we encourage everyone who is interested to explore what we are creating.