Solr atomic updates as told by the Ancient Mariner

by Peter Tyrrell Thursday, October 30, 2014 1:40 PM

I just have to share this voyage of discovery, because I have wallowed in the doldrums of despair and defeat the last couple of days, only finding the way this morning, in 15 minutes, after sleeping on it. Isn't that always the way?

My Scylla and Charybdis were a client's oral history master and tracks textbases. The master record becomes the primary document in Solr, while the tracks atomically update that document. We've done this before: each track contributes an audio file to the document's list of media. No problem, it's easy to append something new to a primary document.

However, each track also has its own subjects, names and places, depending on the contents of the audio track. These also need to be appended to the primary document. Easy, right? Well, no. It is easy to blindly append something, but you start getting repeats in the primary document. For instance, if the name 'Blackbeard' is in the metadata for 8 out of 10 tracks, the primary document ends up with name=Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard. You get the picture.

Okay, so let's look in the existing primary record to see if Blackbeard already... oh, wait. You can't get at the existing values while doing an atomic update. Hm.

Ah, we can 'remove' values matching Blackbeard, then 'add' Blackbeard. That should work. And it does. But what about multiple entries coming out of Inmagic like 'Blackbeard|Kidd, William'? Dang it: that string doesn't match anything, so neither name gets removed, and we're back to multiples of each name. We'll need to script a split on the pipe before remove/add.

Split happening: great, great. Now 'Blackbeard' and 'Kidd, William' are going in nicely without duplication. Oh. But wait, what about when multiple textbase fields map to the same Solr field? For example, HistoricNeighbourhood and PlanningArea => place?

And here the tempest begins. It's relatively simple to deal with multiple mappings, or multiple Inmagic entries. But not both. The reason is that now the object representing all the possible values is a Java ArrayList, which doesn't translate perfectly to any javascript type. You can't treat it like an array and deal with the values separately, nor can you treat it like a string and split it to create an array. You can't enumerate it, you can't cast it, it's a black box that is elusive beyond imagining.

Everything I tried, failed. It was dismal. It was all the more maddening because it seemed like it should have been such a simple thing. "Appearances can be deceiving!" shouted the universe, putting its boot-heel to my backside again and again.

Finally this morning, a combination of transformers (including regex) saved my bacon and I am eating the bacon and now I want to lie down for a while, under a blanket made of bacon.

The Technical

I'm using a RegexTransformer to do the splits, THEN a script transformer to remove-and-append.

In Solr DataImportHandler config XML:

 

<entity 
    name="atomic-xml"
    processor="XPathEntityProcessor"
    datasource="atomic"
    stream="true"
    transformer="RegexTransformer,script:atomicTransform"
    useSolrAddSchema="true"
    url="${atomic.fileAbsolutePath}"
    xsl="xslt/dih.xsl"
>
    <!--
        Sequential order of transformers important: regex split, THEN script transform.
        Handles multiple entries plus multiple mappings. E.g.
        <field name="name_ignored">Kyd, William|Teach, Edward</field>
        <field name="name_ignored">Rackham, John</field>
    -->
    <field column="name_ignored" sourceColName="name_ignored" splitBy="\|" />
    <field column="place_ignored" sourceColName="place_ignored" splitBy="\|" />
    <field column="topic_ignored" sourceColName="topic_ignored" splitBy="\|" />

</entity>

 

In Solr DIH script transformer:

 

var atomic = {};

atomic.appendTo = function (field, row) {

    var val = row.get(field + '_ignored');
    if (val === null) return;

    var hash = new java.util.HashMap();
    hash.put('remove', val);
    hash.put('add', val);
    row.put(field, hash);

};

var atomicTransform = function (row) {
    atomic.appendTo('name', row);
    atomic.appendTo('topic', row);
    atomic.appendTo('place', row);    
    return row;
};

 

Tags: Inmagic | javascript | Solr

How a Hospital Library and the Local Public Library Partnered to Share Patient Education Materials

by Kathy Bryce Wednesday, July 09, 2014 9:31 AM

If you search the Halifax, Nova Scotia public library catalogue for “physiotherapy”, the first record to appear is for an educational pamphlet on “Physiotherapy services in Nova Scotia”  with a link to view it online as a PDF.  Subsequent records in the search results are also patient education pamphlets covering such topics as a guide to going home after surgery, ankle injuries and shoulder-strengthening exercises.

Halifax PL          CapitalHealth2

The Health Sciences Library of Capital Health has recently partnered with Halifax Public Libraries to add hundreds of these hospital-produced patient education pamphlet records to the public library’s catalogue. The goal is to make locally produced current information about health promotion, medical conditions, diagnostic tests, and surgical procedures more accessible to the public. These materials are also freely available and searchable from the website of the Health Sciences Library of Capital Health.

The hospital uses Inmagic DB/TextWorks to maintain the pamphlet database in a non-MARC format. Lara Killian from the Health Sciences Library spoke on the project at the recent CHLA conference in Montreal and described the project. Records are exported into MARC format from DB/TextWorks using a map created with the MARC Transformer available from Inmagic. These records are then massaged using the free MARCEdit software to create a file suitable for loading into the MARC-based AquaBrowser discovery software used by the Public Library.  There were some challenges with the MARC formatting, such as the display of French diacritical marks. At the Public Libraries, Dave MacNeil worked with AquaBrowser to tweak the formatting of the search result display to ensure that when these pamphlets show up, the direct link to the free PDF is easily identifiable.

This new initiative launched in June 2014, with the goal of increasing visibility and usage of the pamphlets by adding this new public access point.

If you need help with a similar project, please contact us for assistance.

Andornot Newsletter – March 2014

by Kathy Bryce Thursday, March 27, 2014 8:13 PM

Please check out the latest issue of our newsletter.

In this Issue:
  • Archives Upgrades: The Ontario Jewish Archives, the Galt Museum and Archives, and The Elgin County Archives
  • Meet with Andornot in 2014: Our Conference Line-up
  • Inmagic News: DB/TextWorks and WebPublisher 14.5, Free Training Sessions
  • Tips and Tricks: Spring Cleanup for Inmagic Textbases
  • Tweets: Round-up of Library, Archive and Museum News

Please contact us for further information or to be added to our newsletter list.

Spring cleanup for your Inmagic databases. Part 4: Renaming fields

by Kathy Bryce Tuesday, March 25, 2014 9:26 AM

In the first post of this series we wrote about cleaning up the files associated with DB/TextWorks and in the second we covered rationalizing your textbase elements.The third post discussed some steps you can take to protect and maintain your textbases in good health.

In this last part of our Spring cleanup series we will discuss renaming fields. This requires the most caution and forethought, but is also advisable to ensure that new users can understand your textbases. All too often we find clients who have maintained the same textbases for years and years and see no problem with fields named AU, TI etc. It’s pretty easy to guess that these stand for Author and Title in a library catalogue, but what about some of the other abbreviations that may date from much earlier versions of Inmagic when there were limits to the field name length.  We came across a client with an LCCN field, i.e. Library of Congress Control Number. A new non library person started data entry and guessed that this field was an abbreviation for their shelf location, thus creating a horrendous mixture of entries. (We always recommend adding Automatic Date type fields for RecordCreated and RecordModified which can make cleanup of this type of mistake a bit easier.)  Field names in the current version of DB/TextWorks have a 20 character limit which is usually ample to describe the contents. We recommend not including any spaces, but visually separating words with caps or underscores as in PublicationDate or Project_Number.  If you have several databases with similar fields, you should consider giving them consistent names.

If you make changes to a field name, all DB/TextWorks query screens and form boxes simply use the new values and continue to function. Any box labels that were taken directly from the field names will however continue to show the old values.

As a precaution we always recommend making a backup or copy of the textbase before making any significant modifications.  Next, determine if there are any textbases linked to the one you wish to change. Linked fields in a Secondary textbase can be identified by viewing the Textbase Information under the Display tab, but the fields that are linked to are not shown in the primary textbase information, so you do need to understand if there are relationships between your textbases before renaming fields.

As mentioned in Part 2 on changing textbase elements, extra care must be taken if you have WebPublisher PRO, as query screens or canned searches will reference field names and will not update automatically if you edit these. Changing field names may also break forms or query screens with embedded scripts. Scripting capabilities were introduced in DB/TextWorks version 4 so pre 2001 textbases are not likely to include any. More recent textbases from Inmagic such as those in the Library Module, and those provided by Andornot will include some scripting.

If your textbases don’t have linked textbases, scripts or web access, then renaming fields can be straightforward and a great way to rationalize your textbase to make it easier for others to understand.

If you don’t feel comfortable doing this renaming cleanup yourself, contact us and we can help you on a consulting basis.

We hope you have enjoyed this four-part series on spring cleaning your databases – please let us know if there are other topics you would like us to cover!

Spring cleanup for your Inmagic databases. Part 3: Protecting and maintaining your textbases

by Kathy Bryce Monday, March 24, 2014 10:32 AM

In the first post of this series we wrote about cleaning up the files associated with DB/TextWorks and in the second we covered rationalizing your textbase elements.  In this post we’ll discuss some steps you can take to protect and maintain your textbases in good health.

Usually Inmagic DB/TextWorks textbases can function for many years without any intervention or problems. However if you do ever see a “Stop: textbase is in an inconsistent state….” message, please do NOT keep working in it! We have had clients tell us that they just ignore that message not realizing that the textbase might be corrupt. Frequently this message is just caused by a temporary loss of network connectivity while a record is being edited and can be fixed very quickly.

We recommend every so often running Check Textbase from Manage Textbases on a menu imagescreen (i.e. without a textbase open). This will detect and repair problems in the textbase and your user file. The process generally takes just a few minutes for most textbases, but can take a while for very large ones. We suggest specifying Options to Repair Structural Problems and Rebuild 10 or more Damaged Indexes (depending on textbase size). If any problems are found these will be listed in the .chk file with a recommendation for action. Running Check Textbase in this manner will clear the inconsistent state message if it was just caused by a network glitch.

As part of your regular maintenance we also recommend confirming that you have a backup routine for your textbases. We have have heard some horror stories over the years.  Two clients had fires, and two had floods in their buildings.  One of these had no offsite backup and lost several years work.  Another client had all their textbases deleted by an over zealous IT guy who didn’t know what they were and figured they weren’t important, and another client hit batch delete instead of batch modify!  For many of our smaller clients without any IT support you can always simply make a backup by copying your textbases to a USB stick and taking it home with you.

The above information applies to the non-SQL version of DB/TextWorks. Clients with DB/Text for SQL versions should ensure their IT staff are aware of the recommendations in the Administrators Guide available from the Inmagic extranet.

For more information, check out the Help file built in to DB/TextWorks, or the printable PDFfor version 13.   If you run Check Textbase and need help implementing the recommendations, please contact Inmagic Support if you have a maintenance contract, or we can help you on a consulting basis.

Month List