Adjusting Solr relevancy ranking for good metadata in the Andornot Discovery Interface

by Peter Tyrrell Thursday, January 18, 2018 4:00 PM

I learned an interesting lesson about Solr relevancy tuning due to a request from a client to improve their search results. A search for chest tube was ranking a record titled "Heimlich Valve" over a record titled "Understanding Chest Tube Management," and a search for diabetes put "Novolin-Pen Quick Guide" above "My Diabetes Toolkit Booklet," for example.

Solr was using the usual default AnDI (Andornot Discovery Interface) boosts, so what was going wrong?

Andi default boosts (pf is phrase matching):
qf=title^10 name^7 place^7 topic^7 text
pf= title^10 name^7 place^7 topic^7 text

The high-scoring records without terms in their titles had topic = "chest tube" or topic = "diabetes", yes, but so did the second-place records with the terms in their titles! Looking at the boosts, you would think that the total relevancy score would be a sum of (title score) plus (topic score) plus the others.

Well, you'd be wrong.

In Solr DisMax queries, the total relevancy score is not the sum of contributing field scores. Instead, the highest individual contributing field score takes precedence. It’s a winner-takes-all situation. Oh.

In the samples above, the boost on the incidence of “chest tube” or “diabetes” in the topic field was enough to overcome the title field's contribution, in the context of Solr’s TF-IDF scoring algorithm. I.e. it’s not just a matter of “the term is there” versus “the term is not there”, instead the score is proportional to the number of query terms the field contains and inversely proportional to the number of times those query terms appear across the whole collection of documents. Field and document length matters. Also whether the term appears nearer the front of the text.

So I could just ratchet up the boost on the title field and be done with it, right? Well, maybe.

As someone else* has said: DisMax is great for finding a needle in a haystack. It’s just not that good at searching for hay in a haystack.

The client’s collection has a small number of records, and the records themselves are quite short, consisting of a handful of highly focused metadata. The title and topic fields are pithy and the titles are particularly good at summarizing the “aboutness” of the record, so I focused on those aspects when re-arranging relevancy boosts.

New Solr field type: *_notf, a text field for title and topic that does not retain term frequencies or term positions. This means a term hit will not be correlated to term frequency in the field. It is not necessary to take term frequency into account in a title because the title’s “aboutness” isn’t related to the number of times a term appears in it. The logic of term frequency makes sense in the long text of an article, say, but not in the brief phrase that is a title. Or topic.

New Solr fields: title_notf, topic_notf

Updated boosts (pf is phrase matching):
qf=title_notf^10 topic_notf^7 text
pf=title^10 topic^7

Note that phrase matching still uses the original version of the title and topic fields, because they index term positions. Thus they can score higher when the terms chest and tube appear together as the phrase “chest tube”.

Also, I added a tie=1.0 parameter to the DisMax scoring, so that the total relevancy score of any given record will be the sum of contributing field scores, like I expected in the first place.

total score = max(field scores) + tie * sum(other field scores)

So, lesson learned. Probably. And the lesson has particular importance to me because the vast majority of our clients are libraries, archives or museums who spend time honing their metadata rather than relying on keyword search across masses of undifferentiated text. Must. Respect. Cataloguer.

Further Reading

Getting Dissed by Dismax – Why your incorrect assumptions about dismax/edismax are hurting search relevancy

Title Search: when relevancy is only skin deep

* Doug Turnbull, author of both articles above.

Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

Shortening 2: Peter’s Flaky Pastry Recipe

by Peter Tyrrell Wednesday, September 21, 2016 9:51 AM

I use shortening in my pies, and they are reckoned to be very good, if I do say so myself. Here is my flaky pastry recipe.

3 cups all-purpose flour (400g, 14.4 oz)
0.5 cups unsalted butter (114g, 4 oz)
0.5 cups shortening (114g, 4 oz)
1 tbsp granulated sugar (15mg)
1 tsp salt (5mg)
1 cup water

1 beaten egg
1-2 tbsp sugar

Mix the dry ingredients. Cut the butter and shortening into acorn sized lumps. Using a mixer, pastry knife or a pair of table knives, mix in the fat until the butter lumps are the size of small peas. You can hand-fondle any remaining lumps to size. Don’t overmix, as can occur when you use a mixer. If the dough has the consistency of breadcrumbs, you’ve gone too far. In fact, when using a mixer, I turn if off early and do the rest by hand. Just to be sure. Those little lumps of fat are going to create pockets in the pastry while in the oven, which is where the pastry’s flake comes from. If the butter and shortening are mixed too thoroughly into the flour, you’ll wind up with a dense, heavy pastry.

Add the water bit by bit while mixing. (A mixer is invaluable here.) Watch the dough carefully, because you may not need all the water. You want the dough moist enough to clump together, but not wet. How much water the pastry will want depends on the humidity, temperature, and probably the phase of the moon. Temperamental stuff, pastry. When I make pies at our summer cabin, I always need to add the full amount of water, but at home, never. And again, do not overmix.

Dump out the dough onto a floured surface and knead it gently by folding it over 5 or 6 times, just enough so it is holding together. Overmixing or too much kneading at this stage will lead to tough and chewy pastry, because you will have over-activated the gluten in the flour.

Divide the dough into two halves, wrap with cling film plastic, and put in the refrigerator for at least an hour. If you’re in a hurry and don’t have that much time, you probably shouldn’t have tried to make pies today.

Make your filling, and put that in the refrigerator too. Side note: whatever your filling, be sure to mitigate its moisture content with enough flour, cornstarch, chia seeds or what have you, and avoid adding excess liquid when ladling your filling into the pie. Too much liquid and your pie will come out of the oven with a soggy bottom.

When your dough has chilled long enough, haul out one half and roll it out on a floured surface to fit your pie pan. Ceramic pie pans are best because they conduct and evenly distribute heat super well. However, glass pans are fine, plus they allow you to check the bottom of the pie as it bakes, which is arguably more important when you are still getting used to a recipe. The dough should hang over the edge of the pie pan.

Add filling. As above, the less liquid the better. Put the uncovered pie in the fridge.

Roll out the second half of the dough on a floured surface and cover the filling, so that the dough hangs over the edge of the pie pan. You want enough so that you can pinch and roll the bottom and top dough together to create a seal, and that raised crust around the edge. Cut off any excess before your pinchrolling activity or you’ll end up with an uneven or overly thick crust.

I press my thumb into the crust to create a sort of scallop pattern. Do whatever you must, just make sure the crust seals the top and bottom together.

Beat an egg and brush it lightly onto the pie surface to create a lovely browning effect in the oven. Sprinkle sugar on the top also if you’re into that.

Cut some blowholes into the pie with a sharp knife so it can breathe while baking. Don’t do this and you can expect exploded pie guts all over your oven. I used to put fancy scrollwork into my pies for vents but now just stab them with XXXs.

Bake at 375 F (190 C) for about an hour. Check the pie after 50 minutes. When ready to come out, the pie should have brown highlights, and the bottom—if you can check through a glass pan—should be a golden brown. The filling will probably bubble out of the vents a bit. Don’t be afraid to keep baking for 10 or even 15 minutes past the hour if that’s what it needs. You’re more likely to underbake than overbake, in my experience.

Let cool, then serve it forth.

General Tip: Keep the ingredients cold, even going so far as to put them in the refrigerator or freezer before you begin. While you’re working, everything you don’t need immediately should go back in the refrigerator until you do. Even put ice cubes in your water. Really.

Tags: tips

IIS Application Pool Resurrection Script

by Peter Tyrrell Monday, May 25, 2015 10:45 AM

Overview

Default IIS application pool settings allow for no more than 5 uncaught exceptions within 5 minutes, and when this magic number is reached, the application pool shuts itself down. Uncaught exceptions are somewhat rare for us in the web applications we write because we have frameworks that catch and log errors. Some of our older web applications suffer from uncaught exceptions however, and so does Inmagic Webpublisher on servers where we host clients that use that software.

It used to be that text alerts would wake us up in the middle of the night screaming that sites dependent on Webpublisher were down, and we would remote in to the server to restart the relevant application pool. Well, that was pretty much untenable, so I wrote a script to restart the application pool automatically that would trigger when the application pool's shutdown was recorded in the Windows Application Event Log. A caveat here is that application pools usually shut themselves down for good reason - you shouldn't apply this script as a bandaid if you can fix the underlying causes.

Prerequisites

  • PowerShell v2 (get current version with $PSVersionTable.PSVersion).
  • PowerShell execution policy must allow the script to run (i.e. Set-ExecutionPolicy RemoteSigned or Set-ExecutionPolicy Unsigned).

Install the Script

  1. Register a new Windows Application Event Log Source called 'AppPool Resurrector'. Do it manually or use my PowerShell script.
  2. Put the AppPoolResurrector.ps1 script somewhere on the server, and take note of the name of the application pool you want to monitor.
  3. Create a new task in Windows Task Scheduler once per application pool you want to monitor
    1. Trigger is 'On an Event' Event ID: 1000, Source: Application Error, Log: Application
    2. Action is 'Start a program', Program/script: PowerShell, Add arguments: -command &" 'c:\path\to\apppoolresurrector.ps1' 'name-of-app-pool' "

Note the script activates to check whether the named application pool is still running, and then proceeds to restart it if necessary. There will be times it is activated by a log event to find that the application pool is fine, probably because the log event was unrelated to the application pool in the first place.

Script Content

Beware NameValueCollection.ToString, and DontUsePercentUUrlEncoding

by Peter Tyrrell Monday, February 02, 2015 12:48 PM

The Quick

Tell ASP.NET never to use %UnicodeValue notation when URL encoding by putting the following appSetting in web.config:

<add key="aspnet:DontUsePercentUUrlEncoding" value="true" />


The Slow

Sometimes, such as when calling NameValueCollection.ToString(), that value gets url encoded for you. However, the url encoding in .NET defaults to a %Unicode notation, which, according to MSDN's own warning now attached to the obsolete-as-of-4.5 HttpUtility.UrlEncodeUnicode() method, "produces non-standards-compliant output and has interoperability issues." Therefore, even if you are targeting .NET Framework 4.5 in your project, NameValueCollection.ToString() will still use that obsolete method, and you will get %u00XX style encoding in your URLs.

Telltale signs of "interoperability issues" include the word Français being encoded Fran%u00e7ais, which then blows up the search engine you lovingly built that runs on Java and Apache Solr.

Tags: ASP.NET

Month List