Solr atomic updates as told by the Ancient Mariner

by Peter Tyrrell Thursday, October 30, 2014 1:40 PM

I just have to share this voyage of discovery, because I have wallowed in the doldrums of despair and defeat the last couple of days, only finding the way this morning, in 15 minutes, after sleeping on it. Isn't that always the way?

My Scylla and Charybdis were a client's oral history master and tracks textbases. The master record becomes the primary document in Solr, while the tracks atomically update that document. We've done this before: each track contributes an audio file to the document's list of media. No problem, it's easy to append something new to a primary document.

However, each track also has its own subjects, names and places, depending on the contents of the audio track. These also need to be appended to the primary document. Easy, right? Well, no. It is easy to blindly append something, but you start getting repeats in the primary document. For instance, if the name 'Blackbeard' is in the metadata for 8 out of 10 tracks, the primary document ends up with name=Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard. You get the picture.

Okay, so let's look in the existing primary record to see if Blackbeard already... oh, wait. You can't get at the existing values while doing an atomic update. Hm.

Ah, we can 'remove' values matching Blackbeard, then 'add' Blackbeard. That should work. And it does. But what about multiple entries coming out of Inmagic like 'Blackbeard|Kidd, William'? Dang it: that string doesn't match anything, so neither name gets removed, and we're back to multiples of each name. We'll need to script a split on the pipe before remove/add.

Split happening: great, great. Now 'Blackbeard' and 'Kidd, William' are going in nicely without duplication. Oh. But wait, what about when multiple textbase fields map to the same Solr field? For example, HistoricNeighbourhood and PlanningArea => place?

And here the tempest begins. It's relatively simple to deal with multiple mappings, or multiple Inmagic entries. But not both. The reason is that now the object representing all the possible values is a Java ArrayList, which doesn't translate perfectly to any javascript type. You can't treat it like an array and deal with the values separately, nor can you treat it like a string and split it to create an array. You can't enumerate it, you can't cast it, it's a black box that is elusive beyond imagining.

Everything I tried, failed. It was dismal. It was all the more maddening because it seemed like it should have been such a simple thing. "Appearances can be deceiving!" shouted the universe, putting its boot-heel to my backside again and again.

Finally this morning, a combination of transformers (including regex) saved my bacon and I am eating the bacon and now I want to lie down for a while, under a blanket made of bacon.

The Technical

I'm using a RegexTransformer to do the splits, THEN a script transformer to remove-and-append.

In Solr DataImportHandler config XML:

 

<entity 
    name="atomic-xml"
    processor="XPathEntityProcessor"
    datasource="atomic"
    stream="true"
    transformer="RegexTransformer,script:atomicTransform"
    useSolrAddSchema="true"
    url="${atomic.fileAbsolutePath}"
    xsl="xslt/dih.xsl"
>
    <!--
        Sequential order of transformers important: regex split, THEN script transform.
        Handles multiple entries plus multiple mappings. E.g.
        <field name="name_ignored">Kyd, William|Teach, Edward</field>
        <field name="name_ignored">Rackham, John</field>
    -->
    <field column="name_ignored" sourceColName="name_ignored" splitBy="\|" />
    <field column="place_ignored" sourceColName="place_ignored" splitBy="\|" />
    <field column="topic_ignored" sourceColName="topic_ignored" splitBy="\|" />

</entity>

 

In Solr DIH script transformer:

 

var atomic = {};

atomic.appendTo = function (field, row) {

    var val = row.get(field + '_ignored');
    if (val === null) return;

    var hash = new java.util.HashMap();
    hash.put('remove', val);
    hash.put('add', val);
    row.put(field, hash);

};

var atomicTransform = function (row) {
    atomic.appendTo('name', row);
    atomic.appendTo('topic', row);
    atomic.appendTo('place', row);    
    return row;
};

 

Tags: Inmagic | javascript | Solr

Join Andornot at the SLA Western Canada Chapter's Year End Event on November 18

by Jonathan Jacobsen Friday, October 03, 2014 9:12 AM

Andornot is delighted to be sponsoring the SLA Western Canada Chapter's Year End Event, on November 18. 

This year's event features a special encore presentation of "Finding Those Who Don’t Want to Be Found Using Social Media and Other Cyber Tools," one of the most talked about sessions from the SLA 2014 Annual Conference in Vancouver.

Julie Clegg, principal of a leading investigative agency, will discuss how to use social media and other resources to track and locate people who do not want to be readily found. Emphasis will be placed on resources, tips and techniques that you can add to your research toolkit.

Here's what SLA conference attendees had to say about the session on Twitter:

@annenb Session on Finding Those who Don’t Want to Be Found was awesome & terrifying from privacy perspective. Best session in years.

@leighmonty Julie Clegg on SM forensics was high value. Unaware about crimes committed in virtual environments / online assault previously.

@KnowledgeLinking Julie Clegg session had WOW factor for me because she did it all in real time, showing us how to use Soc Med to find people.

This event is open to all. Tickets are now available via PayPal. 

Unable to attend in person? No problem! Live webcasting of the keynote presentation is provided by Langara College.

Date: Tuesday, November 18, 2014

Location: Langara College, 100 West 49th Avenue, Vancouver BC (room C509)

Agenda: 

4:45-5:30pm: Optional tour of Langara Library and Learning Commons (convene at Circulation Desk)

5:30-6:00pm: Check-in, canapes, and no-host cash bar

6:00-7:30pm: Welcome from Chapter President. Keynote presentation by Julie Clegg.

7:30-8:30pm: Closing remarks, door prizes, and refreshments

 

For details and to register, please visit http://wcanada.sla.org/2014/10/01/western-canada-chapter-year-end-event/

Andornot is sponsoring this event, and will be there in person, as in past years. We hope to see all of you there.

Tags: events

Month List