Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

Andornot's June 2017 Newsletter Available: News, Tips and Tricks for Libraries, Archives and Museums

by Jonathan Jacobsen Thursday, June 22, 2017 8:54 AM

Andornot's June 2017 Newsletter has been emailed to subscribers and is available to read here, with news, tips and tricks for libraries, archives and museums.

 

In This Issue

Andornot News

Andornot's Latest Projects

Tips, Tricks and Ideas

Other News

Tags: newsletters

Richmond Archives Adds Name Origins Resource to Online Search

by Jonathan Jacobsen Tuesday, June 06, 2017 9:51 AM

I live in Richmond, part of the Metro Vancouver Regional District, and have an interest in local history, so I was particularly interested when Andornot was asked by the City of Richmond Archives to help with a project on the origins of Richmond place names. 

The City of Richmond Archives is a long time user of Inmagic DB/TextWorks for managing their collections, and were instrumental in developing the set of linked databases that became our Andornot Archives Starter Kit. Over the past couple years we’ve helped the Archives upgrade their Inmagic WebPublisher-based online search system, which is available at http://archives.richmond.ca/archives/descriptions/ 

The new Name Origins search, available at http://archives.richmond.ca/archives/places/ features almost 500 records (and growing) that document and describe the history of Richmond streets, roads, bridges, neighbourhoods, and other landmarks. It’s easy to search by keyword or by type of place, and whenever possible, a Google map of the named place is shown. This database is updated by the Friends of the Richmond Archives, volunteers with a passion for local history. Launching this new database online was made possible through the Richmond Canada 150 Community Celebration Grant Allocations. 

As I worked in the web search interface to the database, I couldn’t help but search for places in my neighbourhood and around Richmond, and become captivated by the history of them. Now community members can access this information 24-7 and learn the history behind the names of streets, areas, and landmarks in their community.

Contact Andornot for options for your Inmagic databases and for search engines and other software to make your collections accessible online.

CHLA Conference a Success for Andornot Grant Recipient Mark Goodwin

by Mark Goodwin Tuesday, May 30, 2017 6:32 PM

I had the privilege of being selected as the recipient of Andornot's 2017 Professional Development Grant in order to fund my attendance at the Canadian Health Library Association (CHLA) conference in Edmonton, Alberta. Having recently started a position as a Reference Librarian at the BC Cancer Agency, the conference offered an opportunity to grow as a new health information professional - and mine exhibitors for free swag (thanks Andornot!).

As a first time attendee, I made it a priority to take advantage of every networking opportunity available. I acted as a CHLA Social Media Ambassador and attended social events like the First Timer's Reception. All of this provided excellent avenues for forging connections with colleagues in BC and across Canada. Free cocktails are always a plus, too.

One of my conference highlights was University of Alberta Professor Tim Caulfield's keynote on celebrity culture and its (spoiler: mostly negative) influence on public health, which left me feeling inspired to be more involved socially as a champion for evidence-based information. The discussion continued during an interactive session around the prevalence of fake news and pseudoscience. One of my main takeaways from all this? The power of personal stories. Health professionals often combat bad information with a 'just the facts' approach. A more effective technique is to focus on personal narratives, and then use facts and evidence to reinforce the message.

I also discovered a number of health information resources that will be extremely useful to my work in a practical sense. Sessions and courses covered everything from research data management tools to health app reviews. You know you're in the right continuing education course when your instructor has the Twitter handle @Grampa_Data!

I love being a health librarian because it allows me to help others - and my experience at this conference will help me succeed in doing that. Mission accomplished in the swag department as well - I have enough tote bags and water bottles to last me at least a year.

My deepest thanks go out to Andornot. I wouldn't have been able to attend this event without their generous support!

Twitter: @MarkJWGoodwin

LinkedIn: www.linkedin.com/in/markjwgoodwin

Our Awards Banquet table featured librarians from coast to coast.

Our Awards Banquet table featured librarians from coast to coast. Photo by @katmil2020

BC Health Librarians busy 'networking.'

BC Health Librarians busy 'networking.' Photo by @Librownian

Me with Tim Caulfield

Me with Tim Caulfield, author of Is Gwyneth Paltrow Wrong About Everything?: When Celebrity Culture and Science Clash

Tags: events | funding

Stanford's King Institute Launches New Documents Search Engine

by Jonathan Jacobsen Thursday, May 11, 2017 1:03 PM

Last year, Andornot had the pleasure of working with the King Institute at Stanford University on their archival database of tens of thousands of speeches, sermons, letters, and other documents by and about Martin Luther King, Jr. 

Known as OKRA (Online King Records Access), the database includes descriptive information as well as holdings details for these resources held at repositories all over the United States. 

In that first project, we conducted a major rebuild of their DB/TextWorks-based databases to make it more usable by staff and students at the Institute.

This year, we were able to upgrade the web-based search interface for this resource with one built from our Andornot Discovery Interface

The new search interface is available at http://okra.stanford.edu and offers researchers features that will greatly help their work, such as:

  • type-ahead suggestions of names, places and topics as a user starts a search;
  • spelling corrections and search suggestions;
  • a sophisticated search engine that presents the most relevant results first (with an option to re-sort by title or date);
  • facets to easily refine a search by name, place, topic, date and other aspects of the data;
  • handy tools for saving and bookmarking records, emailing them, or sharing them on social media; and
  • an advanced search form for constructing highly specific searches, or for simply browsing all available names, topics, places and other key indexes of the data.

The new search engine adopts the same layout and design as the main King Institute website, for a seamless transition between the two.

Contact Andornot for data management and search solutions similar to this one.

Month List