How To Import Excel Data into VuFind

by Jonathan Jacobsen Tuesday, January 08, 2019 4:52 PM

Recently we had a new client come to us looking for help with several subscription-based VuFind sites they manage, and ultimately to have us host them as part of our managed hosting service. This client had a unique challenge for us: 3 million records, available as tab-separated text files of up to 70,000 records each.

Most of the data sets we work with are relatively small: libraries with a few thousand records, archives with a few tens of thousands, and every so often, databases of a few hundred thousand, like those in the Arctic Health bibliography.

While VuFind and the Apache Solr search engine that powers it (and also powers our Andornot Discovery Interface) have no trouble with that volume of records, transforming the data from hundreds of tab-separated text files into something Solr can use, in an efficient manner, was a pleasant challenge.

VuFind has excellent tools for importing traditional library MARC records, using the SolrMarc tool to post data to Solr. For other types data, such as records exported from DB/TextWorks databases, we’ve long used the PHP-based tools in VuFind that use XSLTs to transform XML into Solr's schema and post it to Solr. While this has worked well, XSLTs are especially difficult to debug, so we considered alternatives.

For this new project, we knew we needed to write some code to manipulate the 3 million records in tab-separated text files into XML, and we knew from our extensive experience with Solr that it's best to post small batches of records at a time, in separate files, rather than one large post of 3 million! So we wrote a python script to split up the source data into separate files of about 1,000 records each, and also remove invalid characters that had crept in to the data over time (this data set goes back decades and has likely been stored in many different character encodings on many different systems, so it's no surprise there were some gremlins).

Once the script was happily creating Solr-ready XML files, rather than use VuFind's PHP tools and an XSLT to index the data, it just seemed more straightforward to push the XML directly to Solr. For this, we wrote a bash shell script that uses the post tool that ships with Solr to iterate through the thousands of data files and push each to Solr, logging the results.

The combination of a python script to convert the tab-separated text files into Solr-ready XML and a bash script to push it to Solr worked extremely well for this project. Python is lightning fast at processing text and pushing data directly to Solr is definitely faster than invoking XSLT transformations.

This approach would work well for any data. Python is a very forgiving language to develop with, making it easy and quick to write scripts to process any data source. In fact, since this project, we've used Python to manipulate a FileMaker Pro database export for indexing in our Andornot Discovery Interface (also powered by Apache Solr) and to harvest data from the Internet Archive and Online Archive of California, for another Andornot Discovery Interface project (watch this blog for news of both when they launch).

We look forward to more challenges like this one! Contact us for help with your own VuFind, Solr and similar projects.

Java 11 date parsing? Locale, locale, locale.

by Peter Tyrrell Monday, January 07, 2019 11:39 AM

Java is undergoing some considerable licensing changes, prompting us to plan an all-out move from Oracle Java 8 to OpenJDK Java 11 this Spring for every Solr instance we host. I have been running covertly about the hills setting traps for Java 11.0.1 to see what I might snare before unleashing it on our live servers. I caught something this week.

Dates! Of course it's about parsing dates! I noticed that the Solr Data Import Handler (DIH) transforms didn't handle making created dates during ingest. (In DIH, we use a script transformer and manipulate some Java classes with javascript. This includes the parsing of dates from text.) Up until now, our DIH has used an older method of parsing dates with a Java class called SimpleDateFormat. If you look for info on parsing dates in Java, you will find years and years of advice related to that class and its foibles, and then you will notice that in recent times experts advise using the java.time classes introduced in Java 8. Since SimpleDateFormat didn't work during DIH, I assumed that SimpleDateFormat was deprecated in Java 11 (it isn't actually), and moved to convert the relevant DIH code to use java.time.

Many hours passed here, during which the output of two lines of code* made no goddamn sense at all. The javadocs that describe the behaviour of java.time classes are completely inadequate, with their stupid little "hello, world" examples, when dates are tricky, slippery, malicious dagger-worms of pure hatred. Long story short, a date like '2004-09-15 12:00:00 AM' produced by Inmagic ODBC from a DB/Textworks database could not be parsed. The parser choked on the string at "AM," even though my match pattern was correct: 'uuuu-MM-dd hh:mm:ss a'. Desperate to find the tiniest crack to exploit, I changed every variable I could think of one at a time. That was how I found that, when I switched to Java 8, the same exact code worked. Switch back to Java 11. Not working. Back to Java 8. Working. WTF?

I thought, maybe the Nashorn scripting engine that allows javascript to be interpreted inside the Java JVM is to blame, because this scenario does involve Java inside javascript inside Java, which is weird. So I set up a Java project with Visual Studio Code and Maven and wrote some unit tests in pure Java. (That was pretty fun. It was about the same effort as ordering a pizza in Italian when you don’t speak Italian: everything about the ordering process was tantalizingly familiar but different enough to delay my pizza for quite some time.) The problem remained: parsing worked as-written in Java 8, but not Java 11.

I started writing a Stack Overflow question. In so doing, I realized I hadn't tried an overload method of java.time.format.DateTimeFormatter.ofPattern() which takes a locale. I had already dotted many i's and crossed a thousand t's, but I wanted to really impress anyone reading the question that I had done my homework, because I hate looking ignorant, so I wrote another unit test that passed in Locale.ENGLISH and, ohmigawd, that solved the problem entirely. If you have been following along, that means that "AM/PM" could not be understood by the parser, even with the right pattern matcher, without the context of a locale, and obviously the default locale used by the simpler version of DateTimeFormatter.ofPattern() was inadequate to the task. I tested and Locale.ENGLISH and Locale.US both worked with "AM/PM" but Locale.CANADA did not. Likely the latter is my default locale, because I do reside in Canada. Really? Really, Java? We have AM and PM here in the Great White North, I assure you.

I don’t know if this a bug in Java 11. I’m merely happy to have understood the problem at this point. Just another day in the developer life, eh? Something that should be a snap becomes a grueling carnival ride that deposits you at the exit, white-faced and shaking, with an underwhelming sense of minor accomplishment. How do you explain to people that you spent 8 hours teaching a computer to treat an ordinary date as a date? Write a blog post, I guess. Winking smile

* Two lines of code. 8 hours of frustration. Here it is, ready?

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.Locale;

public class App {

    public LocalDateTime Parse (String dateText, String pattern) {

        DateTimeFormatter parser = DateTimeFormatter.ofPattern(pattern, Locale.ENGLISH);
        LocalDateTime date = LocalDateTime.parse(dateText, parser);
        return date;

    }
}

Restore Local Params in Solr 7.5

by Peter Tyrrell Thursday, December 06, 2018 12:13 PM

Local params are a way of modifying Solr's query parser within a query, setting all related parameters with a shorthand syntax. Super convenient for modifying query behaviour on the fly; local params are disabled by default in Solr 7.5, when employing the edismax query parser.

TL;DR

Modify the request handler in solrconfig.xml to add the so-called 'magic field' uf=_query_ back to the uf (User Fields) parameter to restore the kind of local params behaviour that was default prior to Solr 7.5.

<str name="uf">* _query_</str>

The above allows users to do fielded searches on any field, plus allows them to use local params in their queries.

Why Local Params? An Example

Say we are developers who want to use the MoreLikeThis feature of Solr. There are multiple ways of setting this up, as described in the Solr Reference Guide. But, say we are also developers who are using SolrNet to create requests to and handle responses from Solr. (As indeed is the case for use here at Andornot, in our Solr-backed Discovery Interface.)

One of SolrNet's strengths is that it maps Solr responses to strongly-typed classes. On the other hand, its weakness is that you can really only map query result documents to one strongly-typed class. (Not strictly true, but true from a practical, please-don't-make-me-do-contortions point of view.)

No response from Solr can deviate too far from that mapping. Other bits can be tacked on to the response and be handled by SolrNet (highlighting, spellcheck, facets, etc.), but these must be components that are somehow related to the context of the documents in the main response. In the case of MoreLikeThis, you have to set up the component so that each query result document generates a list of 'like' documents. Having to generate such a list for each document returned slows down the query response time and bloats the size of the response. Quite unnecessary, in my opionion. I much prefer to generate the list of 'like' documents on the fly when the user has asked for them. An easy way of doing that without messing with the SolrNet mapping setup is to use local parameters.

Say our user finds an intriguing book in their search results called "More Armadillos Stacked on a Bicycle". Perhaps, our user muses, this book is a sequel to a previous publication regarding such matters. They feel a thrill of anticipation as they click on a 'More Like This' link. (I know I would.)

{ 
   'id': 123, 
   'title': 'More Armadillos Stacked on a Bicycle', 
   'topic': [ 
      'armadillos', 
      'fruitless pursuits', 
      'bicycles' 
   ] 
}

When using local params, the 'More Like This' query can use the same Solr request handler and all the parameters embedded within it, but swap out the query parser for the MLTQParser. The bits that are needed to complete the MoreLikeThis request are passed in via local param syntax, still within the main query parameter! (Perhaps I did not need that exclamation mark, but the armadillos-upon-a-bicycle adrenaline has not yet worn off.)

/select?q={!mlt qf=title,topic mintf=1 mindf=1}123

The local params syntax above says "find other documents like id=123 where extracted keywords from its various fields find matches in title and topic." The convenient part for the developer using SolrNet is that the response maps neatly to the kind of response we expect from a regular query: a set of result documents mapped to a strongly-typed class, which makes the response easy to handle and display using existing mechanisms.

Why Not Local Params?

I suppose we can imagine a clever and malicious user who is able to use the power of local params to hack Solr queries in order to get at information that perhaps they otherwise shouldn't. If, as a developer, you need to ensure that users are limited in their scope, then disabling local params and even further locking down the uf (User Fields) parameter to deny certain fielded searches is right and good.

Adjusting Solr relevancy ranking for good metadata in the Andornot Discovery Interface

by Peter Tyrrell Thursday, January 18, 2018 4:00 PM

I learned an interesting lesson about Solr relevancy tuning due to a request from a client to improve their search results. A search for chest tube was ranking a record titled "Heimlich Valve" over a record titled "Understanding Chest Tube Management," and a search for diabetes put "Novolin-Pen Quick Guide" above "My Diabetes Toolkit Booklet," for example.

Solr was using the usual default AnDI (Andornot Discovery Interface) boosts, so what was going wrong?

Andi default boosts (pf is phrase matching):
qf=title^10 name^7 place^7 topic^7 text
pf= title^10 name^7 place^7 topic^7 text

The high-scoring records without terms in their titles had topic = "chest tube" or topic = "diabetes", yes, but so did the second-place records with the terms in their titles! Looking at the boosts, you would think that the total relevancy score would be a sum of (title score) plus (topic score) plus the others.

Well, you'd be wrong.

In Solr DisMax queries, the total relevancy score is not the sum of contributing field scores. Instead, the highest individual contributing field score takes precedence. It’s a winner-takes-all situation. Oh.

In the samples above, the boost on the incidence of “chest tube” or “diabetes” in the topic field was enough to overcome the title field's contribution, in the context of Solr’s TF-IDF scoring algorithm. I.e. it’s not just a matter of “the term is there” versus “the term is not there”, instead the score is proportional to the number of query terms the field contains and inversely proportional to the number of times those query terms appear across the whole collection of documents. Field and document length matters. Also whether the term appears nearer the front of the text.

So I could just ratchet up the boost on the title field and be done with it, right? Well, maybe.

As someone else* has said: DisMax is great for finding a needle in a haystack. It’s just not that good at searching for hay in a haystack.

The client’s collection has a small number of records, and the records themselves are quite short, consisting of a handful of highly focused metadata. The title and topic fields are pithy and the titles are particularly good at summarizing the “aboutness” of the record, so I focused on those aspects when re-arranging relevancy boosts.

New Solr field type: *_notf, a text field for title and topic that does not retain term frequencies or term positions. This means a term hit will not be correlated to term frequency in the field. It is not necessary to take term frequency into account in a title because the title’s “aboutness” isn’t related to the number of times a term appears in it. The logic of term frequency makes sense in the long text of an article, say, but not in the brief phrase that is a title. Or topic.

New Solr fields: title_notf, topic_notf

Updated boosts (pf is phrase matching):
qf=title_notf^10 topic_notf^7 text
pf=title^10 topic^7

Note that phrase matching still uses the original version of the title and topic fields, because they index term positions. Thus they can score higher when the terms chest and tube appear together as the phrase “chest tube”.

Also, I added a tie=1.0 parameter to the DisMax scoring, so that the total relevancy score of any given record will be the sum of contributing field scores, like I expected in the first place.

total score = max(field scores) + tie * sum(other field scores)

So, lesson learned. Probably. And the lesson has particular importance to me because the vast majority of our clients are libraries, archives or museums who spend time honing their metadata rather than relying on keyword search across masses of undifferentiated text. Must. Respect. Cataloguer.

Further Reading

Getting Dissed by Dismax – Why your incorrect assumptions about dismax/edismax are hurting search relevancy

Title Search: when relevancy is only skin deep

* Doug Turnbull, author of both articles above.

Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

Month List