Canadian Jewish Heritage Network Launches Enhanced Search and Mobile Interface

by Jonathan Jacobsen Monday, June 25, 2012 5:15 PM

In June 2011, Andornot helped launch a new website and database for the Canadian Jewish Heritage Network. This site uses the Umbraco CMS to provide information on Canadian Jewish history, and includes two databases of archival and genealogical information, powered by Inmagic WebPublisher PRO and DB/TextWorks.

This June we’ve helped upgrade the site with two new features: an enhanced, multi-database search, and a mobile interface.

Enhanced, Multi-Database Search

The enhanced search is available on the Explore page and uses the Apache Solr search engine to index and search the two separate Inmagic databases (which remain available on the site). Results from both databases are integrated and sorted by relevance, using the sophisticated algorithms in Solr to present the best matches to the user. This feature is particularly useful to genealogical researchers as a “Did you mean” capability can alert them to misspellings or name variations. For example:

CJHN OneSearch

Mobile Interface

IMG_0467To access the mobile interface, simply open www.cjhn.ca in your mobile browser. You’ll be automatically redirected to the mobile view, with all the same content as the full site, but formatted to fit the smaller view of a mobile device.

The mobile view was created by extending the Umbraco CMS with additional templates and stylesheets. This allows CJHN to write content once, but display it in the format most suitable to the viewer.

Contact Andornot to upgrade your own web application to feature a mobile view or enhanced search options.

Pitfalls to avoid when importing XML files to Solr DataImportHandler

by Peter Tyrrell Tuesday, June 12, 2012 8:15 AM

Don’t send UTF-8 files with BOMs

Any UTF-8 encoded XML files you intend to import to Solr via the DataImportHandler (DIH) must not be saved with a leading BOM (byte order mark).  The UTF-8 standard does not require a BOM, but many Windows apps (e.g. Notepad) include one anwyay. (Byte order has no meaning in UTF-8. The standard does permit the BOM, but doesn’t recommend its use.)

The BOM byte sequence is 0xEF,0xBB,0xBF. When the text is (incorrectly) interpreted as ISO-8859-1, it looks like this:

bom

The Java XML interpreter used by Solr DIH does not want to see a BOM and chokes when it does. You might get an error like this:

ERROR:  'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Content is not allowed in prolog.'

Avoid out-of-memory problems caused by very large import files

Very large import files may lead to out-of-memory problems with Solr’s Java servlet container (we use Tomcat, currently). “Very large” is a judgment call, but anything over 30 MB is probably going to be trouble. It is possible to increase the amount of memory allocated to Tomcat, but not necessary if you can break the large import files into smaller ones. I force any tools I’m using to cap files to 1000 rows/records, which ends up around 2 MB in size with the kind of library and archives data we tend to deal with.

Inmagic Genie 3.5 released

by Kathy Bryce Monday, June 11, 2012 2:25 PM

All clients with a current Inmagic maintenance subscription for Genie (part of the Inmagic Library Suite) should soon be receiving an email from Inmagic with the download information for version 3.5.

If you have a current maintenance subscription but have not received a notification email in the next week, please email advantage@inmagic.com with your serial number and email address so it can be resent.  Please also remember to let us know if your contact information has changed so we can update our records and pass this on to Inmagic.

Here are a few highlights of the Inmagic Genie v3.5 release.  Please see the Genie v3.5 Readme and documentation for more detail;

  • Remember Search Criteria
  • Multiple Library Email Addresses Supported for Requests Sent from the Cart
  • View Book Covers on Summary or Detail Catalog Reports
  • Update Genie Configuration Files from a Remote Location

Please contact us if you would like assistance upgrading or would like to renew an expired maintenance subscription. We can also help you update your current interface to include the latest features available in the software itself, or with our add-on products.

Sample Solr DataImportHandler for XML Files

by Peter Tyrrell Thursday, June 07, 2012 11:51 AM

I spent a lot of time in trial and error getting the Solr DataImportHandler (DIH) set up and working the way I wanted, mainly due to a paucity of practical examples on the internet at large, so here is a short post on Solr DIH for XML with a working sample, and may it save you many hours of nail-chewing.

This post assumes you are already somewhat familiar with Solr, but would like to know more about how to import XML data with the DataImportHandler.

DIH Overview

The DataImportHandler (DIH) is a mechanism for importing structured data from a data store into Solr. It is often used with relational databases, but can also handle XML with its XPathEntityProcessor. You can pass incoming XML to an XSL, as well as parse and transform the XML with built-in DIH transformers. You could translate your arbitrary XML to Solr's standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

Sample 1: dih-config.xml with FileDataSource

Here's a sample dih-config.xml from a an actual working site (no pseudo-samples here, my friend). Note that it picks up xml files from a local directory on the LAMP server. If you prefer to post xml files directly via HTTP you would need to configure a ContentStreamDataSource instead.

It so happens that the incoming xml is already in standard Solr update xml format in this sample, and all the XSL does is remove empty field nodes, while the real transforms, such as building the content of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume", and "ignored_seriesissue", are done with DIH Regex transformers. (The XSLT is performed first, and the output of that is then given to the DIH transformers.) The attribute "useSolrAddSchema" tells DIH that the xml is already in standard Solr xml format. If that were not the case, another attribute, "xpath", on the XPathEntityProcessor would be required to select content from the incoming xml document.

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" />
    <document>
        <!--
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 
        -->
        <entity
            name="pickupdir"
            processor="FileListEntityProcessor"
            rootEntity="false"
            dataSource="null"
            fileName="^[\w\d-]+\.xml$"
            baseDir="/var/lib/tomcat6/solr/xxx/import/"
            recursive="true"
            newerThan="${dataimporter.last_index_time}"
        >

        <!--
            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.
        -->
        <entity 
            name="xml"
            processor="XPathEntityProcessor"
            transformer="RegexTransformer,TemplateTransformer"
            datasource="pickupdir"
            stream="true"
            useSolrAddSchema="true"
            url="${pickupdir.fileAbsolutePath}"
            xsl="xslt/dih.xsl"
        >

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />
    </entity>
      </entity>
    </document>
</dataConfig>

Sample 2: dih-config.xml with ContentStreamDataSource

This sample receives XML files posted direct to the DIH handler. Whereas Sample 1 is able to batch process any number of files, this sample would work on one at a time as they were posted.

<dataConfig>
    <dataSource name="streamsrc" encoding="UTF-8" type="ContentStreamDataSource" />
    
    <document>
        <!--
            Parses standard Solr update XML passed as stream from HTTP post.
            Strips empty nodes with dih.xsl, then applies transforms.
        -->
        <entity
            name="streamxml"
            datasource="streamsrc"
            processor="XPathEntityProcessor"
            rootEntity="true"
            transformer="RegexTransformer,DateFormatTransformer"
            useSolrAddSchema="true"
            xsl="xslt/dih.xsl"
        >
            <field column="contributor_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="date_t" splitBy="\|" />
            <field column="date_tdt" dateTimeFormat="M/d/yyyy h:m:s a" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="language_t" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="relation_t" splitBy="\|" />
            <field column="rights_t" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="subject_t" splitBy="\|" />
            <field column="title_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
        </entity>        
    </document>
</dataConfig>

How to set up DIH

1. Ensure the DIH jars are referenced from solrconfig.xml, as they are not included by default in the Solr WAR file. One easy way is to create a lib folder in the Solr instance directory that includes the DIH jars, as the solrconfig.xml looks in the lib folder for references by default. Find the DIH jars in the apache-solr-x.x.x/dist folder when you download the Solr package.

dih-jars

2. Create your dih-config.xml (as above) in the Solr "conf" directory.

3. Add a DIH request handler to solrconfig.xml if it's not there already.

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">dih-config.xml</str>
    </lst>
</requestHandler>

How to trigger DIH

There is a lot more info re full-import vs. delta-import and whether to commit, optimize, etc. in the wiki description on Data Import Handler Commands, but the following would trigger the DIH operation without deleting the existing index first, and commit the changes after all the files had been processed. The sample given above would collect all the files found in the pickup directory, transform them, index them, and finally, commit the update/s to the index (which would make them searchable the instant commit was finished).

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true

Month List