Converting DB/TextWorks Data to MARC: Not as Easy as it Sounds!

by Jonathan Jacobsen Wednesday, April 25, 2018 4:55 PM

We recently had a client ask us if we could help them convert their DB/TextWorks library catalogue database to MARC format, for submission to another search system.

Our initial reaction was "Sure, easy-peasy." After all, we're librarians, we know MARC well, we're experts in DB/TextWorks, and we've done this before. How hard could it be?

Ha! Famous last words, and all that.

To be fair, there are a few wrinkles:

  1. The conversion is not a one-time affair, but rather something the client would like automated and running on a regular basis.
  2. Not all records in the database are to be converted.
  3. MARC is not the simplest of formats. Whether MARC or MARCXML, there are a fairly rigid set of rules that must be followed to create 100% valid, standards-compliant MARC records.

Nonetheless, since MARC has been around rather a long time, there are a plethora of tools available for creating MARC records from other data sources. There's Inmagic's own MARC Transformer, which works directly with DB/TextWorks databases, as well as various free or open source tools from the U.S. Library of Congress and other agencies, MARCEdit, Balboa Software's Data Magician, and others.

We also have our Andornot Data Extraction Utility for automatically exporting data from a DB/TextWorks database and manipulating it into a variety of formats.

Ever optimistic, we figured some combination of these tools could be strung together in sequence without too much effort to create a solution for this client. We wanted to avoid developing a DB/TextWorks-to-MARC conversion program from scratch as it would be quite time consuming, mostly due to the MARC format requirements themselves.

Several tools, upon closer investigation, proved to be too ancient to run reliably on a modern Windows server, or couldn't work with the current version of DB/TextWorks. Others proved almost impossible to use in an automated fashion. They could be useful in a one-time manual conversion to MARC format, but not in the hands-off, automated workflow we needed.

The exploration of this issue was an interesting exercise in seeing how old data formats and old programs age and become harder to work with.

The recipe that baked the cake in the end was:

  1. Use the Andornot Data Extraction Utility and the Inmagic ODBC driver to extract data from the DB/TextWorks database to a pseudo-MARC plain text format. 
  2. Use a custom-developed PowerShell script to manipulate the records in this file to handle some of the quirks of the ODBC output and to more closely adhere to the MARC format.
  3. Use the command line interface to MARCEdit to convert the pseudo-MARC to MARC Communications Format files.
  4. Upload the MARC files over FTP to the destination server.
  5. Manage all the moving parts through a PowerShell script and log the steps and results to a file for easy troubleshooting in case of problems.
  6. Run the script nightly as a scheduled task on a Windows server.

When written like that, in hindsight, it sounds so simple. And in the end, it was, and works well. But the journey to arrive at this solution was one of the more challenging small projects we've undertaken, considering how simple the task sounds at first.

We hope this will help you if you have a similar project, but don't hesitate to ask us for help, now that we've worked through this.

Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

IIS Application Pool Resurrection Script

by Peter Tyrrell Monday, May 25, 2015 10:45 AM

Overview

Default IIS application pool settings allow for no more than 5 uncaught exceptions within 5 minutes, and when this magic number is reached, the application pool shuts itself down. Uncaught exceptions are somewhat rare for us in the web applications we write because we have frameworks that catch and log errors. Some of our older web applications suffer from uncaught exceptions however, and so does Inmagic Webpublisher on servers where we host clients that use that software.

It used to be that text alerts would wake us up in the middle of the night screaming that sites dependent on Webpublisher were down, and we would remote in to the server to restart the relevant application pool. Well, that was pretty much untenable, so I wrote a script to restart the application pool automatically that would trigger when the application pool's shutdown was recorded in the Windows Application Event Log. A caveat here is that application pools usually shut themselves down for good reason - you shouldn't apply this script as a bandaid if you can fix the underlying causes.

Prerequisites

  • PowerShell v2 (get current version with $PSVersionTable.PSVersion).
  • PowerShell execution policy must allow the script to run (i.e. Set-ExecutionPolicy RemoteSigned or Set-ExecutionPolicy Unsigned).

Install the Script

  1. Register a new Windows Application Event Log Source called 'AppPool Resurrector'. Do it manually or use my PowerShell script.
  2. Put the AppPoolResurrector.ps1 script somewhere on the server, and take note of the name of the application pool you want to monitor.
  3. Create a new task in Windows Task Scheduler once per application pool you want to monitor
    1. Trigger is 'On an Event' Event ID: 1000, Source: Application Error, Log: Application
    2. Action is 'Start a program', Program/script: PowerShell, Add arguments: -command &" 'c:\path\to\apppoolresurrector.ps1' 'name-of-app-pool' "

Note the script activates to check whether the named application pool is still running, and then proceeds to restart it if necessary. There will be times it is activated by a log event to find that the application pool is fine, probably because the log event was unrelated to the application pool in the first place.

Script Content

Month List