Recently we had a new client come to us looking for help with several subscription-based VuFind sites they manage, and ultimately to have us host them as part of our managed hosting service. This client had a unique challenge for us: 3 million records, available as tab-separated text files of up to 70,000 records each.
Most of the data sets we work with are relatively small: libraries with a few thousand records, archives with a few tens of thousands, and every so often, databases of a few hundred thousand, like those in the Arctic Health bibliography.
While VuFind and the Apache Solr search engine that powers it (and also powers our Andornot Discovery Interface) have no trouble with that volume of records, transforming the data from hundreds of tab-separated text files into something Solr can use, in an efficient manner, was a pleasant challenge.
VuFind has excellent tools for importing traditional library MARC records, using the SolrMarc tool to post data to Solr. For other types data, such as records exported from DB/TextWorks databases, we’ve long used the PHP-based tools in VuFind that use XSLTs to transform XML into Solr's schema and post it to Solr. While this has worked well, XSLTs are especially difficult to debug, so we considered alternatives.
For this new project, we knew we needed to write some code to manipulate the 3 million records in tab-separated text files into XML, and we knew from our extensive experience with Solr that it's best to post small batches of records at a time, in separate files, rather than one large post of 3 million! So we wrote a python script to split up the source data into separate files of about 1,000 records each, and also remove invalid characters that had crept in to the data over time (this data set goes back decades and has likely been stored in many different character encodings on many different systems, so it's no surprise there were some gremlins).
Once the script was happily creating Solr-ready XML files, rather than use VuFind's PHP tools and an XSLT to index the data, it just seemed more straightforward to push the XML directly to Solr. For this, we wrote a bash shell script that uses the post tool that ships with Solr to iterate through the thousands of data files and push each to Solr, logging the results.
The combination of a python script to convert the tab-separated text files into Solr-ready XML and a bash script to push it to Solr worked extremely well for this project. Python is lightning fast at processing text and pushing data directly to Solr is definitely faster than invoking XSLT transformations.
This approach would work well for any data. Python is a very forgiving language to develop with, making it easy and quick to write scripts to process any data source. In fact, since this project, we've used Python to manipulate a FileMaker Pro database export for indexing in our Andornot Discovery Interface (also powered by Apache Solr) and to harvest data from the Internet Archive and Online Archive of California, for another Andornot Discovery Interface project (watch this blog for news of both when they launch).
We look forward to more challenges like this one! Contact us for help with your own VuFind, Solr and similar projects.