Installing Inmagic DB/Textworks v13 on Windows 8 x64 RTM

by Peter Tyrrell Tuesday, September 11, 2012 3:00 PM

At time of writing (Sep 11 2012) Inmagic has not yet released info about DB/Textworks on Windows 8. Fortunately, I am willing to throw myself onto the grenade of a brand new operating system in the name of fun and adventure. So the short story is: you can do it, though I personally am still picking shrapnel out of my tender bits. Here’s what you need to know.

DB/TextWorks (can I just refer to it as “dbext” from here on in? thanks) has a dependency on the Microsoft Visual C++ 2008 Redistributable Package x86, specifically Microsoft.VC90.MFC version 9.0.21022.8. You will not find that DLL on a new install of Windows 8. Although the install process may go okay, when you attempt to launch dbtext32.exe, you’ll see an error like the following:




A look at the application event log garners an Event ID 33 SideBySide error:




And there you have it: a missing assembly. It so happens that the Webpublisher installer ensures this assembly is present, and goes and gets it if necessary. But not the dbtext installer. So one way to fix the problem, as it turns out, is to run the Webpublisher installer on the machine, if you happen to have it available. Or, you can download the redistributable and install it manually. Either way, you can get dbtext up and running on Windows 8. Now follows the short and snappy version.


Quick Fix

  1. Manually install the C++ 2008 x86 package at, before or after the dbtext install.

Canadian Conservation Institute Launches Andornot Discovery Interface

by Peter Tyrrell Tuesday, July 10, 2012 12:43 PM

The Canadian Conservation Institute (CCI) in Ottawa, a long-time Andornot client, required a new version of their bilingual online catalogue and staff bibliography that would pass the strict requirements of W3C’s Web Content Accessibility Guidelines (WCAG). Andornot helped CCI boost the requirement into an opportunity to add new features, including facets, multi-database search, spelling suggestions, and faster search performance.

The CCI Library has one of the largest conservation and museology collections in the world. The collections are regarded as an important source for conservation and museology literature on a wide variety of topics, such as preventive conservation, industrial collections, architectural heritage, fire and safety protection, museum planning, archaeological conservation, preservation in storage and display, exhibition design, disaster preparedness, and museum education. The holdings include a large selection of books on textiles, furniture, paintings, sculptures, prints and drawings, and archaeological and ethnological objects.

-- "CCI Library". Canadian Conservation Institute. Retrieved 4 July 2012.

cci-results-facetedThe upgraded website uses the Andornot Discovery Interface (AnDI for short), a modern and highly configurable web application that tempers cutting-edge open source search technology with many years of Andornot experience in search-focused design.

It was possible to meet WCAG compliance because AnDI provides complete control over every HTML tag and CSS statement. The HTML5 structure presents a clean cross-browser template that reads well on mobile devices and has backwards-compatible support for older browsers.

The CCI Library's French and English versions were created with AnDI's built-in multilingual support, and are triggered through the presence of "en" or "fr" in the URL. Moving from one to another is a smooth transition: a user can switch the page language at any time without interrupting their experience or being redirected to a start page. Even errors and page-not-found messages are bilingual.

Facets and spelling suggestions (and many other features) are made possible by AnDI's open source search technology: Apache Solr. Solr is blazing fast, optimized for full-text search on the web, and relied on by some of the biggest names on the internet.

Every page is bookmarkable because the URL always holds the information needed to reconstruct the page. This makes the site friendly to permalinks and Search Engine Optimization (SEO).

The CCI Library retains its catalogue and staff bibliography collections in separate Inmagic DB/TextWorks databases that staff continue to update through its familiar desktop interface. Updates are extracted and indexed by Solr automatically on a regular basis via Andornot's Data Extraction Utility (internally we nickname it 'Extract-o-matic') from a Powershell script. The index schema is a Dublin Core derived metadata element set that Andornot helped to map to both collections.


andi-element-lozenge-1.0_188x188AnDI can be configured to reflect any field set from any data store or database, as well as rich documents such as PDF and Word, images with EXIF metadata, etc. Contact Andornot about AnDI for your own collection.

Pitfalls to avoid when importing XML files to Solr DataImportHandler

by Peter Tyrrell Tuesday, June 12, 2012 8:15 AM

Don’t send UTF-8 files with BOMs

Any UTF-8 encoded XML files you intend to import to Solr via the DataImportHandler (DIH) must not be saved with a leading BOM (byte order mark).  The UTF-8 standard does not require a BOM, but many Windows apps (e.g. Notepad) include one anwyay. (Byte order has no meaning in UTF-8. The standard does permit the BOM, but doesn’t recommend its use.)

The BOM byte sequence is 0xEF,0xBB,0xBF. When the text is (incorrectly) interpreted as ISO-8859-1, it looks like this:


The Java XML interpreter used by Solr DIH does not want to see a BOM and chokes when it does. You might get an error like this:

ERROR:  ' Content is not allowed in prolog.'

Avoid out-of-memory problems caused by very large import files

Very large import files may lead to out-of-memory problems with Solr’s Java servlet container (we use Tomcat, currently). “Very large” is a judgment call, but anything over 30 MB is probably going to be trouble. It is possible to increase the amount of memory allocated to Tomcat, but not necessary if you can break the large import files into smaller ones. I force any tools I’m using to cap files to 1000 rows/records, which ends up around 2 MB in size with the kind of library and archives data we tend to deal with.

Sample Solr DataImportHandler for XML Files

by Peter Tyrrell Thursday, June 07, 2012 11:51 AM

I spent a lot of time in trial and error getting the Solr DataImportHandler (DIH) set up and working the way I wanted, mainly due to a paucity of practical examples on the internet at large, so here is a short post on Solr DIH for XML with a working sample, and may it save you many hours of nail-chewing.

This post assumes you are already somewhat familiar with Solr, but would like to know more about how to import XML data with the DataImportHandler.

DIH Overview

The DataImportHandler (DIH) is a mechanism for importing structured data from a data store into Solr. It is often used with relational databases, but can also handle XML with its XPathEntityProcessor. You can pass incoming XML to an XSL, as well as parse and transform the XML with built-in DIH transformers. You could translate your arbitrary XML to Solr's standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

Sample 1: dih-config.xml with FileDataSource

Here's a sample dih-config.xml from a an actual working site (no pseudo-samples here, my friend). Note that it picks up xml files from a local directory on the LAMP server. If you prefer to post xml files directly via HTTP you would need to configure a ContentStreamDataSource instead.

It so happens that the incoming xml is already in standard Solr update xml format in this sample, and all the XSL does is remove empty field nodes, while the real transforms, such as building the content of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume", and "ignored_seriesissue", are done with DIH Regex transformers. (The XSLT is performed first, and the output of that is then given to the DIH transformers.) The attribute "useSolrAddSchema" tells DIH that the xml is already in standard Solr xml format. If that were not the case, another attribute, "xpath", on the XPathEntityProcessor would be required to select content from the incoming xml document.

    <dataSource encoding="UTF-8" type="FileDataSource" />
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 

            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />

Sample 2: dih-config.xml with ContentStreamDataSource

This sample receives XML files posted direct to the DIH handler. Whereas Sample 1 is able to batch process any number of files, this sample would work on one at a time as they were posted.

    <dataSource name="streamsrc" encoding="UTF-8" type="ContentStreamDataSource" />
            Parses standard Solr update XML passed as stream from HTTP post.
            Strips empty nodes with dih.xsl, then applies transforms.
            <field column="contributor_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="date_t" splitBy="\|" />
            <field column="date_tdt" dateTimeFormat="M/d/yyyy h:m:s a" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="language_t" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="relation_t" splitBy="\|" />
            <field column="rights_t" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="subject_t" splitBy="\|" />
            <field column="title_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />

How to set up DIH

1. Ensure the DIH jars are referenced from solrconfig.xml, as they are not included by default in the Solr WAR file. One easy way is to create a lib folder in the Solr instance directory that includes the DIH jars, as the solrconfig.xml looks in the lib folder for references by default. Find the DIH jars in the apache-solr-x.x.x/dist folder when you download the Solr package.


2. Create your dih-config.xml (as above) in the Solr "conf" directory.

3. Add a DIH request handler to solrconfig.xml if it's not there already.

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">dih-config.xml</str>

How to trigger DIH

There is a lot more info re full-import vs. delta-import and whether to commit, optimize, etc. in the wiki description on Data Import Handler Commands, but the following would trigger the DIH operation without deleting the existing index first, and commit the changes after all the files had been processed. The sample given above would collect all the files found in the pickup directory, transform them, index them, and finally, commit the update/s to the index (which would make them searchable the instant commit was finished).


Video: Slay Monster PDFs with pdfbox

by Peter Tyrrell Friday, October 21, 2011 3:38 PM

A screencast of the lightning talk presentation I gave at the Access 2011 Conference on October 21, 2011 in Vancouver, BC, entitled "Quick and Dirty's Guide to Slaying Monster PDFs," in which I show how to use pdfbox to slice up large PDFs for indexing to make search more meaningful.

Tags: Solr | video

Month List