Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

Shortening 2: Peter’s Flaky Pastry Recipe

by Peter Tyrrell Wednesday, September 21, 2016 9:51 AM

I use shortening in my pies, and they are reckoned to be very good, if I do say so myself. Here is my flaky pastry recipe.

3 cups all-purpose flour (400g, 14.4 oz)
0.5 cups unsalted butter (114g, 4 oz)
0.5 cups shortening (114g, 4 oz)
1 tbsp granulated sugar (15mg)
1 tsp salt (5mg)
1 cup water

1 beaten egg
1-2 tbsp sugar

Mix the dry ingredients. Cut the butter and shortening into acorn sized lumps. Using a mixer, pastry knife or a pair of table knives, mix in the fat until the butter lumps are the size of small peas. You can hand-fondle any remaining lumps to size. Don’t overmix, as can occur when you use a mixer. If the dough has the consistency of breadcrumbs, you’ve gone too far. In fact, when using a mixer, I turn if off early and do the rest by hand. Just to be sure. Those little lumps of fat are going to create pockets in the pastry while in the oven, which is where the pastry’s flake comes from. If the butter and shortening are mixed too thoroughly into the flour, you’ll wind up with a dense, heavy pastry.

Add the water bit by bit while mixing. (A mixer is invaluable here.) Watch the dough carefully, because you may not need all the water. You want the dough moist enough to clump together, but not wet. How much water the pastry will want depends on the humidity, temperature, and probably the phase of the moon. Temperamental stuff, pastry. When I make pies at our summer cabin, I always need to add the full amount of water, but at home, never. And again, do not overmix.

Dump out the dough onto a floured surface and knead it gently by folding it over 5 or 6 times, just enough so it is holding together. Overmixing or too much kneading at this stage will lead to tough and chewy pastry, because you will have over-activated the gluten in the flour.

Divide the dough into two halves, wrap with cling film plastic, and put in the refrigerator for at least an hour. If you’re in a hurry and don’t have that much time, you probably shouldn’t have tried to make pies today.

Make your filling, and put that in the refrigerator too. Side note: whatever your filling, be sure to mitigate its moisture content with enough flour, cornstarch, chia seeds or what have you, and avoid adding excess liquid when ladling your filling into the pie. Too much liquid and your pie will come out of the oven with a soggy bottom.

When your dough has chilled long enough, haul out one half and roll it out on a floured surface to fit your pie pan. Ceramic pie pans are best because they conduct and evenly distribute heat super well. However, glass pans are fine, plus they allow you to check the bottom of the pie as it bakes, which is arguably more important when you are still getting used to a recipe. The dough should hang over the edge of the pie pan.

Add filling. As above, the less liquid the better. Put the uncovered pie in the fridge.

Roll out the second half of the dough on a floured surface and cover the filling, so that the dough hangs over the edge of the pie pan. You want enough so that you can pinch and roll the bottom and top dough together to create a seal, and that raised crust around the edge. Cut off any excess before your pinchrolling activity or you’ll end up with an uneven or overly thick crust.

I press my thumb into the crust to create a sort of scallop pattern. Do whatever you must, just make sure the crust seals the top and bottom together.

Beat an egg and brush it lightly onto the pie surface to create a lovely browning effect in the oven. Sprinkle sugar on the top also if you’re into that.

Cut some blowholes into the pie with a sharp knife so it can breathe while baking. Don’t do this and you can expect exploded pie guts all over your oven. I used to put fancy scrollwork into my pies for vents but now just stab them with XXXs.

Bake at 375 F (190 C) for about an hour. Check the pie after 50 minutes. When ready to come out, the pie should have brown highlights, and the bottom—if you can check through a glass pan—should be a golden brown. The filling will probably bubble out of the vents a bit. Don’t be afraid to keep baking for 10 or even 15 minutes past the hour if that’s what it needs. You’re more likely to underbake than overbake, in my experience.

Let cool, then serve it forth.

General Tip: Keep the ingredients cold, even going so far as to put them in the refrigerator or freezer before you begin. While you’re working, everything you don’t need immediately should go back in the refrigerator until you do. Even put ice cubes in your water. Really.

Tags: tips

IIS Application Pool Resurrection Script

by Peter Tyrrell Monday, May 25, 2015 10:45 AM

Overview

Default IIS application pool settings allow for no more than 5 uncaught exceptions within 5 minutes, and when this magic number is reached, the application pool shuts itself down. Uncaught exceptions are somewhat rare for us in the web applications we write because we have frameworks that catch and log errors. Some of our older web applications suffer from uncaught exceptions however, and so does Inmagic Webpublisher on servers where we host clients that use that software.

It used to be that text alerts would wake us up in the middle of the night screaming that sites dependent on Webpublisher were down, and we would remote in to the server to restart the relevant application pool. Well, that was pretty much untenable, so I wrote a script to restart the application pool automatically that would trigger when the application pool's shutdown was recorded in the Windows Application Event Log. A caveat here is that application pools usually shut themselves down for good reason - you shouldn't apply this script as a bandaid if you can fix the underlying causes.

Prerequisites

  • PowerShell v2 (get current version with $PSVersionTable.PSVersion).
  • PowerShell execution policy must allow the script to run (i.e. Set-ExecutionPolicy RemoteSigned or Set-ExecutionPolicy Unsigned).

Install the Script

  1. Register a new Windows Application Event Log Source called 'AppPool Resurrector'. Do it manually or use my PowerShell script.
  2. Put the AppPoolResurrector.ps1 script somewhere on the server, and take note of the name of the application pool you want to monitor.
  3. Create a new task in Windows Task Scheduler once per application pool you want to monitor
    1. Trigger is 'On an Event' Event ID: 1000, Source: Application Error, Log: Application
    2. Action is 'Start a program', Program/script: PowerShell, Add arguments: -command &" 'c:\path\to\apppoolresurrector.ps1' 'name-of-app-pool' "

Note the script activates to check whether the named application pool is still running, and then proceeds to restart it if necessary. There will be times it is activated by a log event to find that the application pool is fine, probably because the log event was unrelated to the application pool in the first place.

Script Content

Beware NameValueCollection.ToString, and DontUsePercentUUrlEncoding

by Peter Tyrrell Monday, February 02, 2015 12:48 PM

The Quick

Tell ASP.NET never to use %UnicodeValue notation when URL encoding by putting the following appSetting in web.config:

<add key="aspnet:DontUsePercentUUrlEncoding" value="true" />


The Slow

Sometimes, such as when calling NameValueCollection.ToString(), that value gets url encoded for you. However, the url encoding in .NET defaults to a %Unicode notation, which, according to MSDN's own warning now attached to the obsolete-as-of-4.5 HttpUtility.UrlEncodeUnicode() method, "produces non-standards-compliant output and has interoperability issues." Therefore, even if you are targeting .NET Framework 4.5 in your project, NameValueCollection.ToString() will still use that obsolete method, and you will get %u00XX style encoding in your URLs.

Telltale signs of "interoperability issues" include the word Français being encoded Fran%u00e7ais, which then blows up the search engine you lovingly built that runs on Java and Apache Solr.

Tags: ASP.NET

Transformer order in Solr DataImportHandler

by Peter Tyrrell Wednesday, November 12, 2014 12:03 PM

It has taken me years to realize this, but the order in which transformer types are listed in a Solr DataImportHandler (DIH) entity takes precedence over the order in which transformations are written within the entity. It’s just counterintuitive to expect line 2 to act before line 1.

Mixing and matching transformer types can be fraught with peril if you don’t realize this, especially if you expect one transformer to work with the output of another type.

Me, I have pretty much avoided this pitfall in recent times by moving all transformations to a script transformer, but I still have to work with examples like the one above.

Tags: Solr

Month List