Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

Transformer order in Solr DataImportHandler

by Peter Tyrrell Wednesday, November 12, 2014 12:03 PM

It has taken me years to realize this, but the order in which transformer types are listed in a Solr DataImportHandler (DIH) entity takes precedence over the order in which transformations are written within the entity. It’s just counterintuitive to expect line 2 to act before line 1.

Mixing and matching transformer types can be fraught with peril if you don’t realize this, especially if you expect one transformer to work with the output of another type.

Me, I have pretty much avoided this pitfall in recent times by moving all transformations to a script transformer, but I still have to work with examples like the one above.

Tags: Solr

Make numbers behave when sorting alphanumerically in Solr

by Peter Tyrrell Monday, November 03, 2014 10:09 AM

Problem

Numbers mixed with alphabetic characters are sorted lexically in Solr. That means that 10 comes before 2, like this:

  • Title No. 1
  • Title No. 10
  • Title No. 100
  • Title No. 2

Solution

To force numbers to sort numerically, we need to left-pad any numbers with zeroes: 2 becomes 0002, 10 becomes 0010, 100 becomes 0100, et cetera. Then even a lexical sort will arrange values like this:

  • Title No. 1
  • Title No. 2
  • Title No. 10
  • Title No. 100

The Field Type

This alphanumeric sort field type converts any numbers found to 6 digits, padded with zeroes. (If you expect numbers larger than 6 digits in your field values, you will need to increase the number of zeroes when padding.)

The field type also removes English and French leading articles, lowercases, and purges any character that isn’t alphanumeric. It is English-centric, and assumes that diacritics have been folded into ASCII characters.

Sample output

Title No. 1 => titleno000001
Title No. 2 => titleno000002
Title No. 10 => titleno000010
Title No. 100 => titleno000100

Tags: Solr

Solr atomic updates as told by the Ancient Mariner

by Peter Tyrrell Thursday, October 30, 2014 1:40 PM

I just have to share this voyage of discovery, because I have wallowed in the doldrums of despair and defeat the last couple of days, only finding the way this morning, in 15 minutes, after sleeping on it. Isn't that always the way?

My Scylla and Charybdis were a client's oral history master and tracks textbases. The master record becomes the primary document in Solr, while the tracks atomically update that document. We've done this before: each track contributes an audio file to the document's list of media. No problem, it's easy to append something new to a primary document.

However, each track also has its own subjects, names and places, depending on the contents of the audio track. These also need to be appended to the primary document. Easy, right? Well, no. It is easy to blindly append something, but you start getting repeats in the primary document. For instance, if the name 'Blackbeard' is in the metadata for 8 out of 10 tracks, the primary document ends up with name=Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard. You get the picture.

Okay, so let's look in the existing primary record to see if Blackbeard already... oh, wait. You can't get at the existing values while doing an atomic update. Hm.

Ah, we can 'remove' values matching Blackbeard, then 'add' Blackbeard. That should work. And it does. But what about multiple entries coming out of Inmagic like 'Blackbeard|Kidd, William'? Dang it: that string doesn't match anything, so neither name gets removed, and we're back to multiples of each name. We'll need to script a split on the pipe before remove/add.

Split happening: great, great. Now 'Blackbeard' and 'Kidd, William' are going in nicely without duplication. Oh. But wait, what about when multiple textbase fields map to the same Solr field? For example, HistoricNeighbourhood and PlanningArea => place?

And here the tempest begins. It's relatively simple to deal with multiple mappings, or multiple Inmagic entries. But not both. The reason is that now the object representing all the possible values is a Java ArrayList, which doesn't translate perfectly to any javascript type. You can't treat it like an array and deal with the values separately, nor can you treat it like a string and split it to create an array. You can't enumerate it, you can't cast it, it's a black box that is elusive beyond imagining.

Everything I tried, failed. It was dismal. It was all the more maddening because it seemed like it should have been such a simple thing. "Appearances can be deceiving!" shouted the universe, putting its boot-heel to my backside again and again.

Finally this morning, a combination of transformers (including regex) saved my bacon and I am eating the bacon and now I want to lie down for a while, under a blanket made of bacon.

The Technical

I'm using a RegexTransformer to do the splits, THEN a script transformer to remove-and-append.

In Solr DataImportHandler config XML:

 

<entity 
    name="atomic-xml"
    processor="XPathEntityProcessor"
    datasource="atomic"
    stream="true"
    transformer="RegexTransformer,script:atomicTransform"
    useSolrAddSchema="true"
    url="${atomic.fileAbsolutePath}"
    xsl="xslt/dih.xsl"
>
    <!--
        Sequential order of transformers important: regex split, THEN script transform.
        Handles multiple entries plus multiple mappings. E.g.
        <field name="name_ignored">Kyd, William|Teach, Edward</field>
        <field name="name_ignored">Rackham, John</field>
    -->
    <field column="name_ignored" sourceColName="name_ignored" splitBy="\|" />
    <field column="place_ignored" sourceColName="place_ignored" splitBy="\|" />
    <field column="topic_ignored" sourceColName="topic_ignored" splitBy="\|" />

</entity>

 

In Solr DIH script transformer:

 

var atomic = {};

atomic.appendTo = function (field, row) {

    var val = row.get(field + '_ignored');
    if (val === null) return;

    var hash = new java.util.HashMap();
    hash.put('remove', val);
    hash.put('add', val);
    row.put(field, hash);

};

var atomicTransform = function (row) {
    atomic.appendTo('name', row);
    atomic.appendTo('topic', row);
    atomic.appendTo('place', row);    
    return row;
};

 

Tags: Inmagic | javascript | Solr

Advanced autocomplete with Solr Ngrams

by Peter Tyrrell Wednesday, July 03, 2013 3:11 PM

Overview

The following approach is a good one if you require:

  • phrase suggestions, not just words
  • the ability to match user input against multiple fields
  • multiple fields returned
  • multiple field values to make up a unique suggestion
  • suggestion results collapsed (grouped) on a field or fields
  • the ability to filter the query
  • images with suggestions


I needed a typeahead suggestion (autocomplete) solution for a textbox that searches titles. In my case, I have a lot of magazines that are broken down so that each page is a document in the Solr index, and has metadata that describes its parentage. For example, page 1 of Dungeon Magazine 100 has a title: "Dungeon 100"; a collection; "Dungeon Magazine"; and a universe: "Dungeons and Dragons". (Yes, all the material in my index is related to RPG in some way.) A magazine like this might consist of 70 pages or so, whereas a sourcebook like the Core Rulebook for Pathfinder, a D&D variant, boasts 578, so title suggestions have to group on title and ignore counts. Further, the Warhammer 40k game Dark Heresy also has a Core Rulebook, so title suggestions have to differentiate between them.

To build this typeahead solution, I:

  • added new Solr field types to schema.xml to support ngram matching
  • added a /suggest handler to solrconfig.xml that weights matches appropriately
  • bound the suggestions in JSON format to Twitter's typeahead.js

 

Example 1: two core rulebooks.

 

Example 2: "dark" matching in Title and Collection

 

 

Add new field types to Solr schema.xml

 

text_suggest_ngram

For partial matches that will be boosted lower than exact or left-edge matches, e.g. match 'bro' in "A brown fox".

 

 

text_suggest_edge

For left-edge matches, e.g. match 'A bro' but not 'brown' in "A brown fox".

 

text_suggest

For whole term matches. These will be weighted the highest.

These field types are taken lock, stock and barrel from https://github.com/cominvent/autocomplete. In that project, the suggest engine takes the form of an entirely separate core - I have simplified matters for myself. Great stuff, though.

 

Make copies of relevant fields in Solr schema.xml

As noted above, the fields in play for me are title, collection, and universe. Note I am also making a string copy of each to group on.

 

Add /suggest request handler to solrconfig.xml

The /suggest handler looks for user input matches within the suggest fields defined in the qf parameter. Each field has a boost assigned: the higher the boost number, the more a match on that field will contribute to the final document score. I found I had to play around with the boost numbers relative to each other before getting the behaviour I really wanted. Boosting the whole-term text_suggest fields highest was not an automatic route to success. Your mileage may vary.

The pf parameter is additional to qf: it boosts documents in cases where user input terms appear in close proximity.

Above, I mentioned that a Solr document in this index is equated with a single page from a book. If a book is 50 pages long, then a naive suggester is going to return 50 documents when that book's title is matched. The suggest handler avoids that problem by collapsing (grouping) on the fields in play, which explains why the universe field is referenced there, even though it's not being used to match query input. With grouping, a unique suggestion consists of universe+collection+title. Note that group.sort and sort parameters differ. The former must produce valid groups, while the latter determines order in which suggestions are displayed to the user.

Conclusion

In a future post, I will describe how I bound the results from the /suggest handler to Twitter's typeahead.js on the front end to produce what is seen in the examples seen in the screenshots above.

Tags: Solr

Month List