How to Import Data from Inmagic DB/TextWorks into Omeka

by Jonathan Jacobsen Monday, July 03, 2017 7:40 AM

Last week we published a blog post on our favourite Omeka plugins. This week we focus on one in particular, the CSV Import plugin. This plugin is included in every site hosted through Digital History Hub, our low-cost Omeka hosting platform.

One of Omeka's many strengths is the built-in data entry screens, based on Dublin Core fields. While there's a small learning curve to understanding DC, once mastered, it provides just the right set of metadata to describe anything you might want to put in an Omeka site, whether an artifact, photograph, document, map, etc.

But what if you already have a database of this sort of information and want to publish most or all of it in an Omeka site? Perhaps you're using the ever-popular Inmagic DB/TextWorks database management system, but don't yet have your records searchable online, or want to use Omeka's Exhibit Builder plug-in to mount an online virtual exhibit featuring a portion of your collection. Re-entering all that metadata into Omeka one record a time would be onerous. This is where the CSV Import plug-in comes in!

As the name implies, this plugin allows you to quickly import many records in a batch from a text file. You simply choose a suitable text file, map fields from your source into Omeka's Dublin Core schema, set a few other values and very quickly your records will be available in Omeka for review, further editing or simply ready for searching. The only main feature missing from this plugin is the ability to import PDFs, documents, photos and other media files that are saved locally on your computer or network. To bulk import these files, they need to be accessible on a web server with a URL to the file in your database. Note that this may not be as challenging to set up as you may think; there are always ways to work around issues like this, so don't hesitate to contact us for help.

Here's a step by step guide to using this plug-in with DB/TextWorks and Omeka. The procedure for exporting data from other databases will vary of course, but the principles remain the same. As always, do contact us for help !

Mapping Fields

Start by reviewing Omeka's Dublin Core fields on the Item entry screen and think about where data from your database should go. 

You may want to prepare a simple two column list mapping fields from your data source into the Dublin Core fields, like this:

DB/TextWorks Field Name Omeka Dublin Core Field Name
Title Title
Material Type Format
Author Creator
Corporate Author Creator
Publication Date Date
ISBN Identifier

etc.

You don't need to populate every Omeka DC field of course, just the ones that make sense for your data. And you can merge multiple fields from your database into one Dublin Core field in Omeka. To learn more about each DC field, read the brief note on the Omeka data entry screen, or visit http://dublincore.org/documents/dces/ for more detailed information.

Note that there is also a plugin called Dublin Core Extended Fields which adds even more fields. If you have a particularly complex database and feel the need to preserve and fully represent all or most fields, this might be for you. In our view, though, keeping things simple is better, and was precisely why DC was developed, to have a brief, common set of fields that could be used to describe almost anything.

Choosing Data to Export

When you get to the step of importing records into Omeka, you have the option of assigning one Item Type to all incoming records, and only one. The Item Type determines which additional metadata elements are available when editing the record. For example, the "Still Image" Item Type adds fields for Original Format and Physical Dimensions. If your source data contains information that is available in these extended fields and you wish to import it, or add it after by editing imported records in Omeka, you may wish to export records in groups by Item Type. E.g. all "still images", then all "Moving Images", etc. You can then import these in batches and specify the correct Item Type for each. The additional fields specific to that Item Type will then be available for import from your source data.

Exporting From DB/TextWorks

If your data contains special characters like accented letters or letters from outside the Latin alphabet, the file must be encoded as UTF-8 for Omeka to import it correctly. DB/TextWorks offers several text encoding options, so before exporting data, choose Tools > Options > Text Encoding and under "Output file encoding", choose the UTF-8 option (applies to v15.0 or later of DB/TextWorks).

To export a selection of records, search for them first, then select File > Export. 

Save the file somewhere handy, with a .txt or .csv extension. 

In the Export Options dialogue, make the following choices:

Export File Format: Delimited ASCII

Delimiter options:

Record Separator {CR}{LF}

Entry Separator |

Quote Character "

Field Separator , (only commas are supported for import)

Select the "Store Field Names in First Row" option

If any of your fields are of the type Rich Text, be sure to export those as HTML. That HTML can be preserved during the import to Omeka by selecting the HTML option for the field on Step 2 of the import (see below).

Records to Export: choose to export either the records you searched for with "Export Current Record Set" or the entire database with "Export Entire Textbase"

Fields to Export: select only those fields that you included in your field mapping

Optionally you can save these options as a profile for re-use again later.

Complete the export and note how many records were exported (so you can verify that the same number are imported into Omeka).

Importing Data into Omeka

With the export to a comma-separated text file complete, login to your Omeka site and select the CSV Import option in the menu. If that option isn't available, you'll need to install and activate this plugin first.

In Step 1 of the CSV Import, select your exported data file, then set the following options on this page:

If your database field names happen to be identical to those in Omeka and have “DublinCore” in their names (e.g. DublinCore:Title), you can select the Automap Column Names to Elements option. For all others (most of you!), deselect this option.

If importing different types of records in batches, select the Item Type appropriate to each batch.

Choose the following delimiters to match your export from DB/TextWorks:

Column Delimiter , (matches the Field Separator in the DB/TextWorks export)

Tag Delimiter | (matches the Entry Separator in the DB/TextWorks export)

File Delimiter | (matches the Entry Separator in the DB/TextWorks export)

Element Delimiter | (matches the Entry Separator in the DB/TextWorks export)

Optionally, choose to assign all items to a Collection or make all items Public. 

If you're importing a large number of records, you probably don't want to Feature all of them, as it's more common to select a small set of Items to feature on the home page of Omeka.

Continue to the next step.

In Step 2, you will select the Omeka DC fields into which your data source fields will be imported, using your field mapping as a guide. 

Click the Use HTML checkbox if this data includes HTML markup (e.g. if it's a Rich Text Format field in DB/TextWorks and during export, you included that field and chose to export it as HTML).

For source fields which contain tags, select the Tags option instead of selecting a field to import the data to.

For source fields which contain URLs to files, select the Files option instead of selecting a field to import the data to. This will cause the import to fetch those files and add them to Omeka. Fetching many large files will take quite a while, so if this is your very first import, you might be best to try importing just a small data set with or even without this files option, to work out kinks in your whole procedure.

Reviewing Imported Data

If you imported a small number of records, you can review each one. If you imported a large number, you may wish to spot check a random sample, to make sure all the data ended up where you expected it, that records are public or not, featured or not, in a collection or not, etc.

If there are problems, the Undo Import feature is your new best friend. Find it back in the CSV Import plugin and use it to remove the records just imported.

Need Help?

Need help with any of this? Contact Andornot and we'll be glad to work with you on this.

 

 

Our Favourite Omeka Plugins

by Jonathan Jacobsen Tuesday, June 27, 2017 8:54 AM

At Andornot, we're big fans of the Omeka web publishing and content management platform, as a low cost, easy, simple way to get historic, cultural or other content online. Why, we've even launched a whole website dedicated to it: Digital History Hub !

One of Omeka's many strengths is the selection of plugins that add all sorts of extra features. By our count, there are over 90 of them. Most are listed here and here, but we've found a few others around the web too. Some of the plugins are older and not as actively supported as others, or serve only a very specific purpose, or are not of use to very many Omeka users.

We've reviewed and tried them almost all of them, though, and present here our most highly recommended ones. These are plugins that, in our view, should be added to almost every Omeka site as they are each so useful and so likely to appeal to a wide array of Omeka users. About half are helpful for Omeka site administrators, while the other half offer new features in the public side.

Learn more about each plugin by clicking its name here: http://omeka.org/add-ons/plugins/ and then the More Info link.

Plugin NameDescription and Andornot Comments
Admin Images Allows administrators to upload images not attached to items for use in carousels and simple pages. Very handy.
Bulk Metadata Editor Adds search and replace functionality, allowing administrators to update metadata fields over many records quickly and easily.
CSV Import Imports items, tags, and files from CSV files. Great when you have data in another database, such as Inmagic DB/TextWorks and don't want to re-key it into Omeka.
Derivative Images Recreate (or create) derivative images (e.g. thumbnails). Handy when the initial size set proves to be too large or too small for the selected theme. Saves re-uploading each image.
Exhibit Builder Build rich exhibits using Omeka. See jpl-presents.org for an Omeka site that uses exclusively exhibits to present content.
HTML5 Media Enables HTML5 for media files using MediaElement.js, to allow streaming playback. Great for sites with audio and video recordings.
Google Analytics A small plugin to include Google Analytics JavaScript code on pages. Everyone should want to know how much traffic their site gets!
Search By Metadata Allows administrators to configure metadata fields to link to items with same field value (e.g. click a Subject link to view all records with that same Subject).
Simple Contact Form Adds a simple contact form for users to contact the administrator. Be sure to configure the RECAPTCHA anti-spam feature too. Requires mail sending ability on the server, but a nice alternative to just listing an email address.
Simple Pages Allows administrators to create additional web pages for their public site. In our view, every site should have at least some sort of About page with more information about the site, who created it, etc.
Sitemap 2 This Omeka 2.0+ plugin provides a persistent url for a dynamically generated XML Sitemap, for SEO purposes. With this enabled, create a Google Webmaster account (and similar one in Bing) to feed your site into these search engines.
Social Bookmarking Uses AddThis to insert a customizable list of social bookmarking sites on each item page. Great for helping users share your items on Twitter, Pinterest, Facebook, Google+, etc.

All of the plugins above are installed and ready to use in every site built through our Digital History Hub.

The next list of plugins below are those which we think are quite useful, on a case-by-case basis. We make them available in every Digital History Hub Omeka site, for the site owner to install, configure and use if it suits their needs, their data and their audience.

Plugin NameDescription and Andornot Comments
Commenting Allows commenting on Items, Collections, Exhibits, and more. Most useful for gathering feedback from other site administrators, in our view. Consider Disqus instead for public comments (Note: there is an older Disqus plugin, but it may need updating).
Contribution Allows collecting items from visitors. Great for engaging the community and gathering additional contributions to a site. Requires the Guest User plugin.
Contributor Contact Supplies administrators with tools to contact contributors in bulk. Complements the above Contribution plugin.
CSS Editor Add public CSS styles through the admin interface. Useful when you don't have access to the theme's CSS files and want to make some minor adjustments.
Geolocation Adds location info and maps to Omeka. Who doesn't love browsing a map as a way of discovering resources!
Getty Suggest Enable an autosuggest feature for Omeka elements using the Getty Collection controlled vocabularies. Could be quite useful for art and architectural items, as well as place names.
Guest User Adds a guest user role. Can't access the backend administrative interface, but allows plugins such as Contribution to use an authenticated user.
Hide Elements Hide admin-specified metadata elements. Great when you really don't need even the 15 Dublin Core elements and have, perhaps, volunteers performing data entry – makes it even simpler for them.
PDF Embed Embeds PDF documents into item and file pages. Very useful if you have these in your Omeka collection.
Simple Vocab A simple way to create controlled vocabularies, such as keywords or subjects, for consistent data entry. Works best with small-ish vocabularies.
Simple Vocab Plus A fuller featured option for controlled vocabularies with auto suggest.

Visit our Digital History Hub site for more information on Omeka and low-cost hosting plans, or contact us for help getting an Omeka site up, or for adding these or other plugins to an existing one.

And watch this blog for more in-depth posts about select plugins. Next up is a step-by-step guide to exporting data from an Inmagic DB/TextWorks database, then batch importing it into Omeka.

Tags: Omeka

Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

by Peter Tyrrell Friday, June 23, 2017 1:21 PM

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

Tags: PowerShell | Solr | Tika

Andornot's June 2017 Newsletter Available: News, Tips and Tricks for Libraries, Archives and Museums

by Jonathan Jacobsen Thursday, June 22, 2017 8:54 AM

Andornot's June 2017 Newsletter has been emailed to subscribers and is available to read here, with news, tips and tricks for libraries, archives and museums.

 

In This Issue

Andornot News

Andornot's Latest Projects

Tips, Tricks and Ideas

Other News

Tags: newsletters

Richmond Archives Adds Name Origins Resource to Online Search

by Jonathan Jacobsen Tuesday, June 06, 2017 9:51 AM

I live in Richmond, part of the Metro Vancouver Regional District, and have an interest in local history, so I was particularly interested when Andornot was asked by the City of Richmond Archives to help with a project on the origins of Richmond place names. 

The City of Richmond Archives is a long time user of Inmagic DB/TextWorks for managing their collections, and were instrumental in developing the set of linked databases that became our Andornot Archives Starter Kit. Over the past couple years we’ve helped the Archives upgrade their Inmagic WebPublisher-based online search system, which is available at http://archives.richmond.ca/archives/descriptions/ 

The new Name Origins search, available at http://archives.richmond.ca/archives/places/ features almost 500 records (and growing) that document and describe the history of Richmond streets, roads, bridges, neighbourhoods, and other landmarks. It’s easy to search by keyword or by type of place, and whenever possible, a Google map of the named place is shown. This database is updated by the Friends of the Richmond Archives, volunteers with a passion for local history. Launching this new database online was made possible through the Richmond Canada 150 Community Celebration Grant Allocations. 

As I worked in the web search interface to the database, I couldn’t help but search for places in my neighbourhood and around Richmond, and become captivated by the history of them. Now community members can access this information 24-7 and learn the history behind the names of streets, areas, and landmarks in their community.

Contact Andornot for options for your Inmagic databases and for search engines and other software to make your collections accessible online.

Month List