How To Import Excel Data into VuFind

by Jonathan Jacobsen Tuesday, January 08, 2019 4:52 PM

Recently we had a new client come to us looking for help with several subscription-based VuFind sites they manage, and ultimately to have us host them as part of our managed hosting service. This client had a unique challenge for us: 3 million records, available as tab-separated text files of up to 70,000 records each.

Most of the data sets we work with are relatively small: libraries with a few thousand records, archives with a few tens of thousands, and every so often, databases of a few hundred thousand, like those in the Arctic Health bibliography.

While VuFind and the Apache Solr search engine that powers it (and also powers our Andornot Discovery Interface) have no trouble with that volume of records, transforming the data from hundreds of tab-separated text files into something Solr can use, in an efficient manner, was a pleasant challenge.

VuFind has excellent tools for importing traditional library MARC records, using the SolrMarc tool to post data to Solr. For other types data, such as records exported from DB/TextWorks databases, we’ve long used the PHP-based tools in VuFind that use XSLTs to transform XML into Solr's schema and post it to Solr. While this has worked well, XSLTs are especially difficult to debug, so we considered alternatives.

For this new project, we knew we needed to write some code to manipulate the 3 million records in tab-separated text files into XML, and we knew from our extensive experience with Solr that it's best to post small batches of records at a time, in separate files, rather than one large post of 3 million! So we wrote a python script to split up the source data into separate files of about 1,000 records each, and also remove invalid characters that had crept in to the data over time (this data set goes back decades and has likely been stored in many different character encodings on many different systems, so it's no surprise there were some gremlins).

Once the script was happily creating Solr-ready XML files, rather than use VuFind's PHP tools and an XSLT to index the data, it just seemed more straightforward to push the XML directly to Solr. For this, we wrote a bash shell script that uses the post tool that ships with Solr to iterate through the thousands of data files and push each to Solr, logging the results.

The combination of a python script to convert the tab-separated text files into Solr-ready XML and a bash script to push it to Solr worked extremely well for this project. Python is lightning fast at processing text and pushing data directly to Solr is definitely faster than invoking XSLT transformations.

This approach would work well for any data. Python is a very forgiving language to develop with, making it easy and quick to write scripts to process any data source. In fact, since this project, we've used Python to manipulate a FileMaker Pro database export for indexing in our Andornot Discovery Interface (also powered by Apache Solr) and to harvest data from the Internet Archive and Online Archive of California, for another Andornot Discovery Interface project (watch this blog for news of both when they launch).

We look forward to more challenges like this one! Contact us for help with your own VuFind, Solr and similar projects.

VuFind Version 5.0 Released

by Jonathan Jacobsen Monday, July 16, 2018 5:37 PM

Version 5.0 of VuFind, the popular open-source discovery interface, was released today, a year after the last major release (4.0). 

This version improves the software’s compatibility with recent language and operating system releases and adds several significant new features.

Some key additions:

  • New features to improve compliance with the General Data Protection Regulation (GDPR), including optional user-initiated account removal and support for encrypted session data.
  • Significant improvements to the "Channels" interface for serendipitous resource discovery, including a streamlined user interface and several new channel providers (such as "new items" and "trending items").
  • Improved support for rendering geographic data.
  • PHP 7.2 compatibility.
  • Optional user access to their own account history.
  • Upgrades to the latest Solr, SolrMarc and Zend Framework component versions.

Additionally, several bug fixes, new configuration options, performance enhancements and minor improvements have been incorporated.

Although VuFind was largely developed by and for academic libraries, we've found applications for it in other organizations, including smaller specialized libraries. Our blog has details of selected projects. In general, we recommend VuFind for organizations with purely bibliographic records and little or no need for customization, a custom graphic design, integration of other features or content, etc. For organizations with those requirements, our Andornot Discovery Interface (AnDI) is a perfect choice.

In comparison to the release schedule for VuFind Andornot’s own Andornot Discovery Interface, which shares the same Apache Solr search engine as VuFind, is continuously upgraded with each project we use it for. Earlier sites built from AnDI can be upgraded as needed, and we’ve begin doing so upon request by clients. Upgrades include any new features and bug fixes added or made to AnDI since the initial build of the site, plus upgrades to key components, such as Solr, .Net versions, Javascript libraries, and more.

Contact us to learn more about VuFind or AnDI and how either might offer your users an improved search experience for your collections and resources.

Tags: VuFind

VuFind 4.0 released with new features and fixes

by Jonathan Jacobsen Monday, July 17, 2017 7:48 AM

Version 4.0 of VuFind, the popular open-source discovery interface, was released early July 2017.

This version brings VuFind up to date with important PHP and Solr developments while also adding several new features and offering a straightforward upgrade path from the 2.x series of releases.

Some key additions and changes:

  • New channels feature. These are similar to the canned queries we include in almost all projects we work on, no matter which system, where pre-created search parameters or groups of records are offered to users through a simple link, as a guide to interesting aspects of the collection. See a demo at https://vufind.org/demo/Channels/Home.
  • New ability to create and host static content pages. This feature is especially welcome as in previous versions, additional content (e.g. About Us, Contact Us) was most easily placed on the home page, which could make for a bit of a crowded space.
  • Improved ability to load cover images from local files. We added this ourselves as custom development in a previous VuFind project, so are happy to see it appear in the core VuFind system.
  • A new theme, called Sandals. As with several previous themes, it's based on the responsive Bootstrap framework, so it works well on mobile devices. This new theme has a somewhat more modern look to it.

Additionally, several bug fixes, new configuration options and minor improvements have been incorporated.

Although VuFind was largely developed by and for academic libraries, we've found applications for it in other organizations, including smaller specialized libraries. Our blog has details of selected projects. In general, we recommend VuFind for organizations with purely bibliographic records and little or no need for customization, a custom graphic design, integration of other features or content, etc. For organizations with those requirements, our Andornot Discovery Interface is a perfect choice.

Contact us to learn more about the VuFind discovery interface and how it might suit your organization.

ARLIS Launches Susitna Doc Finder VuFind Catalog

by Jonathan Jacobsen Monday, October 10, 2016 1:09 PM

Over the past couple of years, Andornot has helped the Alaska Resources Library & Information Services (ARLIS) launch, then upgrade, a VuFind-powered catalog of Alaska North Slope natural gas pipeline work from the past 40 years. 

A second VuFind catalog has recently been added to the ARLIS site: the Susitna Doc Finder

The Susitna Doc Finder is a comprehensive catalog of documents that have resulted from every phase of the historic 1980s Susitna Hydroelectric Project (SuHydro Project), as well as those documents continually being produced since 2010 under the current Susitna-Watana Hydroelectric Project (SuWa Project).

Records for this catalog are managed in both a MARC cataloguing ILS, as well as a local Inmagic DB/TextWorks database. Exports from both are indexed nightly by VuFind, using heavily customized import mappings and additional fields and browse indexes. 

Almost all records link to PDF reports from the project. Text is extracted from these and indexed, to complement the excellent initial metadata. 

Cover images of these PDF reports are generated during indexing and appear in search results, in several sizes, both for visual interest, and to give a glimpse of a report before clicking to download it.

The web interface uses a VuFind theme built from the ever-popular Twitter Bootstrap responsive web framework. Almost all of Andornot's web projects use this or a similar responsive framework to provide the same level of access on devices of all sizes and shapes, from full-size desktop browsers down to tablets and phones.

Results from this VuFind system are also available through Google, as Google has crawled and indexed the VuFind system.

Further information:

Contact us to discuss options for a discovery interface style of search for your catalogue or other collection, using VuFind or the Andornot Discovery Interface.

VuFind 3.0 Released

by Jonathan Jacobsen Wednesday, April 27, 2016 8:21 PM

Version 3.0 of VuFind, the popular open-source discovery interface, was released April 25, 2016.

This version brings VuFind up to date with important PHP and Solr developments while also adding several new features and offering a straightforward upgrade path from the 2.x series of releases.

Some key additions and changes:

  • Improved support for indexing multiple authors (and other types of creators).
  • New filtering options in “combined search” mode to make your "bento box" search even more flexible.
  • A database-driven record cache to improve performance and permanence when working with third-party APIs.
  • Compatibility with PHP 7 and Ubuntu 16.04.
  • Inclusion of Solr 5.5.0, which adds new indexing features and better Windows support.
  • A significantly rewritten front-end theme offering greater stability, improved ease of customization and a more consistent user experience.
  • New recommendation modules to help guide users to better search results.

Additionally, several bug fixes, new configuration options and minor improvements have been incorporated.

Although VuFind was largely developed by and for academic libraries, we've found applications for it in other organizations, including smaller specialized libraries. Our blog has details of selected projects. In general, we recommend VuFind for organizations with purely bibliographic records and little or no need for customization, a custom graphic design, integration of other features or content, etc. For organizations with those requirements, our Andornot Discovery Interface is a perfect choice.

Contact us to learn more about the VuFind discovery interface and how it might suit your organization.

Month List