Software Patent Institute Report![]()
by Roland J. Cole
Executive Director, Software Patent Institute
Thanks to the USENIX grant, we were able to recreate the system SPI had developed for adding documents to its database. Thanks to increased experience on our part and to lowered prices and increased performance of computer hardware and software, the new system is much faster and more efficient than the SPI system of a few years ago. As we discuss in more detail below, we are now completing the processing of technical reports at a rate of sometimes more than 200 per month. We ended the period of the grant with the database already expanded significantly in scope (thanks to the bibliographies already loaded) and expanding rapidly in size and institutional variety, thanks to all the technical reports. Continuing at this pace through the end of 1999, with no additional resources from USENIX, we have more than doubled the number of documents that were in the database at the end of 1998. We are tremendously grateful and are sure this collaboration between USENIX and SPI has extended the value of the database to all concerned.
The SPI Database Development Process
Once we have both permission and physical possession of the documents, we then scan paper documents with a sheet-fed scanner. Then we use a software program (Caere Omnipage) that converts the image into text (albeit with varying degrees of accuracy) often called "OCR" (for optical character recognition, although the software uses language analysis as well as character analysis). If the document was difficult to OCR, we do our initial corrections in Omnipage, a process we call "proofing." The goal is not absolute correspondence to the original (that would take far too much time and other resources), but "good enough" correspondence that the search engine will work acceptably. Our goal is successful searching, with the idea that our database functions more or less as an extended "card catalog" rather than as a document delivery system, recognizing that our documents are not readily available for electronic searching unless we make them so. To use our resources most effectively, we omit nontextual material such as mathematical equations, segments of program code, figures, diagrams, and pictures, although we try to preserve the captions of such items. If the document arrives in electronic form, we can obviously skip the scanning, OCRing, and proofing stages. We replace them with a stage we call "conversion," since we receive a variety of electronic formats and need to end up with ASCII text. Our next step is to finish whatever "correction" of the document we feel is necessary, plus add the HTML-like (hypertext markup language) codes we use to identify headings, captions, and the like. We are trying to preserve the structure of the document chapters, sections, and paragraphs and to identify page breaks. We use Microsoft Word for this process and call this activity "formatting." Both paper documents and electronic documents go through this stage. We then add a front end to the document which contains the citation material, such as author, publisher, and so forth. Then we run the documents through a program written by SPI to check our formatting, a process we call "cleancheck." This program checks for mismatched tags, incorrect syntax, and the like. During the USENIX grant, we were able to convert this to an email process our document specialist emails the supposedly formatted document to the SPI server in Michigan from her computer in Kansas City, and receives a return email (usually within seconds) that tells her the document is OK or lists the errors needing correction. Next, this same program processes correctly formatted documents into records for the database, intelligently dividing documents into segments of approximately 10,000 bytes while keeping paragraphs, pages, and sections intact as much as possible. The database program then takes the resulting collection of records and loads them into the database. We lump these last two steps together as the "loading stage." As we said in our application for the grant, we had developed and honed this generic process from 1992 through 1997, but financial support had dropped to a level that enabled us to keep the database online but not to continue this process, let alone improve it. We processed approximately ten documents during the December 1997 through September 1998 period. Thanks to the USENIX grant, we were able to start moving some 18,400 documents through our process during the December 1998 through November 1999 period. Along the way, we were able to restart our process and tune it to take advantage of major decreases in price and increases in performance in hardware and software between 1997 and 1999.
Before and After Process Improvements
1. Upgraded hardware and software to the "state of the art" for standalone systems. 2. Redesigned the process to allow for steps to be separated and assigned to different people in different locations. 3. The net effect of 1 and 2 was to move from 10 pages per day to some 200 pages per day finished, with intermediate progress of several thousand pages per day of scanning and OCRing. 4. Catalogued and started processing the 18,400 documents already in our possession, including some 36 boxes of IBM material obtained from the University of Michigan and some 110 boxes of technical reports obtained from Yale University. 5. The material added during this period includes the entire series of computer bibliographies complied by ACM from 1960 through 1994, plus technical reports from several dozen institutions, thus greatly expanding the scope of the database in both subject matter and institutional source.
Report of Documents in Process
The first number is the count, the second number is the stage of the process those documents are in. (Please note: all counts are approximate.) Also, we have continued processing since this table was prepared, in part because USENIX has agreed to renew its grant for 2000. In particular, we have scanned several thousand additional documents and loaded several hundred more.
1 = permission to be asked
Document Source count @ status
Totals by Status as of 9/30/99
1 = permission to be asked 6533
Total Processed at Least in Part from December 1998 Through September
1999
With the renewal of the USENIX grant for 2000, we expect to increase
the speed of the process once again, especially with major steps
forward in hardware speed, and to at least double, if not quadruple,
the total number of documents in the database during the year. The
database is available without charge from SPI at
<http://www.spi.org>, and we encourage everyone who is
interested to explore what we are creating.
|
![]() Last changed: 3 Aug. 2000 mc |
|