Transformer order in Solr DataImportHandler

by Peter Tyrrell Wednesday, November 12, 2014 12:03 PM

It has taken me years to realize this, but the order in which transformer types are listed in a Solr DataImportHandler (DIH) entity takes precedence over the order in which transformations are written within the entity. It’s just counterintuitive to expect line 2 to act before line 1.

Mixing and matching transformer types can be fraught with peril if you don’t realize this, especially if you expect one transformer to work with the output of another type.

Me, I have pretty much avoided this pitfall in recent times by moving all transformations to a script transformer, but I still have to work with examples like the one above.

Tags: Solr

Make numbers behave when sorting alphanumerically in Solr

by Peter Tyrrell Monday, November 03, 2014 10:09 AM


Numbers mixed with alphabetic characters are sorted lexically in Solr. That means that 10 comes before 2, like this:

  • Title No. 1
  • Title No. 10
  • Title No. 100
  • Title No. 2


To force numbers to sort numerically, we need to left-pad any numbers with zeroes: 2 becomes 0002, 10 becomes 0010, 100 becomes 0100, et cetera. Then even a lexical sort will arrange values like this:

  • Title No. 1
  • Title No. 2
  • Title No. 10
  • Title No. 100

The Field Type

This alphanumeric sort field type converts any numbers found to 6 digits, padded with zeroes. (If you expect numbers larger than 6 digits in your field values, you will need to increase the number of zeroes when padding.)

The field type also removes English and French leading articles, lowercases, and purges any character that isn’t alphanumeric. It is English-centric, and assumes that diacritics have been folded into ASCII characters.

Sample output

Title No. 1 => titleno000001
Title No. 2 => titleno000002
Title No. 10 => titleno000010
Title No. 100 => titleno000100

Tags: Solr

Solr atomic updates as told by the Ancient Mariner

by Peter Tyrrell Thursday, October 30, 2014 1:40 PM

I just have to share this voyage of discovery, because I have wallowed in the doldrums of despair and defeat the last couple of days, only finding the way this morning, in 15 minutes, after sleeping on it. Isn't that always the way?

My Scylla and Charybdis were a client's oral history master and tracks textbases. The master record becomes the primary document in Solr, while the tracks atomically update that document. We've done this before: each track contributes an audio file to the document's list of media. No problem, it's easy to append something new to a primary document.

However, each track also has its own subjects, names and places, depending on the contents of the audio track. These also need to be appended to the primary document. Easy, right? Well, no. It is easy to blindly append something, but you start getting repeats in the primary document. For instance, if the name 'Blackbeard' is in the metadata for 8 out of 10 tracks, the primary document ends up with name=Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard. You get the picture.

Okay, so let's look in the existing primary record to see if Blackbeard already... oh, wait. You can't get at the existing values while doing an atomic update. Hm.

Ah, we can 'remove' values matching Blackbeard, then 'add' Blackbeard. That should work. And it does. But what about multiple entries coming out of Inmagic like 'Blackbeard|Kidd, William'? Dang it: that string doesn't match anything, so neither name gets removed, and we're back to multiples of each name. We'll need to script a split on the pipe before remove/add.

Split happening: great, great. Now 'Blackbeard' and 'Kidd, William' are going in nicely without duplication. Oh. But wait, what about when multiple textbase fields map to the same Solr field? For example, HistoricNeighbourhood and PlanningArea => place?

And here the tempest begins. It's relatively simple to deal with multiple mappings, or multiple Inmagic entries. But not both. The reason is that now the object representing all the possible values is a Java ArrayList, which doesn't translate perfectly to any javascript type. You can't treat it like an array and deal with the values separately, nor can you treat it like a string and split it to create an array. You can't enumerate it, you can't cast it, it's a black box that is elusive beyond imagining.

Everything I tried, failed. It was dismal. It was all the more maddening because it seemed like it should have been such a simple thing. "Appearances can be deceiving!" shouted the universe, putting its boot-heel to my backside again and again.

Finally this morning, a combination of transformers (including regex) saved my bacon and I am eating the bacon and now I want to lie down for a while, under a blanket made of bacon.

The Technical

I'm using a RegexTransformer to do the splits, THEN a script transformer to remove-and-append.

In Solr DataImportHandler config XML:


        Sequential order of transformers important: regex split, THEN script transform.
        Handles multiple entries plus multiple mappings. E.g.
        <field name="name_ignored">Kyd, William|Teach, Edward</field>
        <field name="name_ignored">Rackham, John</field>
    <field column="name_ignored" sourceColName="name_ignored" splitBy="\|" />
    <field column="place_ignored" sourceColName="place_ignored" splitBy="\|" />
    <field column="topic_ignored" sourceColName="topic_ignored" splitBy="\|" />



In Solr DIH script transformer:


var atomic = {};

atomic.appendTo = function (field, row) {

    var val = row.get(field + '_ignored');
    if (val === null) return;

    var hash = new java.util.HashMap();
    hash.put('remove', val);
    hash.put('add', val);
    row.put(field, hash);


var atomicTransform = function (row) {
    atomic.appendTo('name', row);
    atomic.appendTo('topic', row);
    atomic.appendTo('place', row);    
    return row;


Tags: Inmagic | javascript | Solr

Advanced autocomplete with Solr Ngrams

by Peter Tyrrell Wednesday, July 03, 2013 3:11 PM


The following approach is a good one if you require:

  • phrase suggestions, not just words
  • the ability to match user input against multiple fields
  • multiple fields returned
  • multiple field values to make up a unique suggestion
  • suggestion results collapsed (grouped) on a field or fields
  • the ability to filter the query
  • images with suggestions

I needed a typeahead suggestion (autocomplete) solution for a textbox that searches titles. In my case, I have a lot of magazines that are broken down so that each page is a document in the Solr index, and has metadata that describes its parentage. For example, page 1 of Dungeon Magazine 100 has a title: "Dungeon 100"; a collection; "Dungeon Magazine"; and a universe: "Dungeons and Dragons". (Yes, all the material in my index is related to RPG in some way.) A magazine like this might consist of 70 pages or so, whereas a sourcebook like the Core Rulebook for Pathfinder, a D&D variant, boasts 578, so title suggestions have to group on title and ignore counts. Further, the Warhammer 40k game Dark Heresy also has a Core Rulebook, so title suggestions have to differentiate between them.

To build this typeahead solution, I:

  • added new Solr field types to schema.xml to support ngram matching
  • added a /suggest handler to solrconfig.xml that weights matches appropriately
  • bound the suggestions in JSON format to Twitter's typeahead.js


Example 1: two core rulebooks.


Example 2: "dark" matching in Title and Collection



Add new field types to Solr schema.xml



For partial matches that will be boosted lower than exact or left-edge matches, e.g. match 'bro' in "A brown fox".




For left-edge matches, e.g. match 'A bro' but not 'brown' in "A brown fox".



For whole term matches. These will be weighted the highest.

These field types are taken lock, stock and barrel from In that project, the suggest engine takes the form of an entirely separate core - I have simplified matters for myself. Great stuff, though.


Make copies of relevant fields in Solr schema.xml

As noted above, the fields in play for me are title, collection, and universe. Note I am also making a string copy of each to group on.


Add /suggest request handler to solrconfig.xml

The /suggest handler looks for user input matches within the suggest fields defined in the qf parameter. Each field has a boost assigned: the higher the boost number, the more a match on that field will contribute to the final document score. I found I had to play around with the boost numbers relative to each other before getting the behaviour I really wanted. Boosting the whole-term text_suggest fields highest was not an automatic route to success. Your mileage may vary.

The pf parameter is additional to qf: it boosts documents in cases where user input terms appear in close proximity.

Above, I mentioned that a Solr document in this index is equated with a single page from a book. If a book is 50 pages long, then a naive suggester is going to return 50 documents when that book's title is matched. The suggest handler avoids that problem by collapsing (grouping) on the fields in play, which explains why the universe field is referenced there, even though it's not being used to match query input. With grouping, a unique suggestion consists of universe+collection+title. Note that group.sort and sort parameters differ. The former must produce valid groups, while the latter determines order in which suggestions are displayed to the user.


In a future post, I will describe how I bound the results from the /suggest handler to Twitter's typeahead.js on the front end to produce what is seen in the examples seen in the screenshots above.

Tags: Solr

Inmagic DB/Text for SQL runs on SQL Server 2012

by Peter Tyrrell Wednesday, June 05, 2013 11:56 AM

Dbtext for SQL tests out okay on SQL Server 2012. My specific test environment was:

  • Client
    • Windows 8 x64
    • Inmagic Dbtext for SQL 13, SQL authentication
  • Server
    • Windows Server 2008 R2
    • SQL Server 2012 SP1 Standard (11.0.2100)
    • Mixed mode enabled
    • TCP/IP protocol enabled
    • Firewall enabled, TCP 1433 port open

Month List