New ThinkWood Research Library Launches

by Jonathan Jacobsen Sunday, July 01, 2018 8:42 AM

The ThinkWood Research Library is a central resource for research on designing and building with wood. An enhanced search engine for this collection has just been launched at https://research.thinkwood.com

The library links to research publications from around the world about structural systems composed of mass timber, heavy timber, and light-frame construction (for buildings five stories and up). Research topics include design and systems, connections, mechanical properties, acoustics and vibration, energy performance, fire, seismic, moisture, wind, serviceability, environmental impact, cost and market adoption.

The library is managed by Forestry Innovation Investment Ltd, a provincial crown corporation, who approached Andornot for assistance with improving management and searching of this library.

Andornot recommended and then implemented a system using Inmagic DB/TextWorks as the back-end database and our Andornot Discovery Interface as the public search system. Data was converted and de-duplicated from two sources: MS Access and a WordPress site.

The result works well for both FII staff who catalog new resources and architects and engineers who have an easier means to search for them.

In the back-end DB/TextWorks database, a few features have proven to be particularly useful in this project, including:

  • Validation lists to ensure consistent application of names, keywords, topics, product types, etc.;
  • dead URL Link Checking to find and edit links to resources that move; and
  • batch modification to clean up older data.

While in AnDI, features such as spelling corrections, relevancy-ranked results, and facets to help narrow a search all combine to make for a simple and enjoyable search process. In particular for this project, made use of AnDI's synonyms feature to equate terms with their acronyms and variations, such as:

  • GLT, glulam, glued laminated timber, glue laminated timber
  • CLT, cross laminated timber, xlam, x-lam, cross-lam 

Whenever any term in a comma-separated set of terms is searched, all the others in the set are also searched for, resulting in broader discovery of resources, especially where different terms have been used.

To improve the visual appeal of the site, we took a small screenshot of each resource (PDFs and web pages) and included it as a thumbnail in the search results.

Andornot was delighted at the positive feedback we received, such as:

"Thank you very much for all the hard work and for all of your expertise. The whole team is very happy with the aesthetics and functionalities of the database and website.

-- Antje Wahl, Manager, Industry Innovation, Forestry Innovation Investment Ltd.

"This is very exciting! Overall, this was one of FII's smoothest web refits/redesigns! Well done to all that were involved :-)

  -- Lindsay Bridgman, Manager, IT, Forestry Innovation Investment Ltd.

Contact us to discuss projects to better manage your resources and library collections.

Addition of digitized newspapers to the Arnprior Archives’ search interface

by Kathy Bryce Friday, June 22, 2018 8:54 AM

Andornot has recently completed work for the Arnprior & McNab/Braeside Archives to add the newly digitized versions of their newspapers up to 1937 to their searchable collections. The majority of issues are from the Arnprior Chronicle starting in 1885.  We also created a Finding Aid allowing researchers to see what issues are available for each of the 16 newspapers with the ability to browse each individually. 

Funding for this project was provided by the Ottawa Branch of the Ontario Genealogical Society, and will be a wonderful new option for genealogical research as well as providing a window into the coverage of historical events. Individual names can be searched, and search words or parts of words are highlighted on the newspaper pages, as in the screenshot below:

image

A search on a general term such as “sawmill” pulls results from several data sources and allows users to easily narrow down their results.

image

As well as providing new search capabilities for this important set of documents, this initiative removes the need to consult the now very fragile originals.

The digitization itself was handled by a local vendor and Andornot scripted the OCR’ing to create a searchable layer in the PDF’s.  When funding permits, the aim is to enhance the search option further by matching up the newspaper issues with an index to births, marriages and deaths created by the Archives. 

If you are considering a similar digitization project, or have databases or other material that you would like to make searchable, contact us for a chat to discuss options!

Arctic Health Upgrades Search Engine for Easier Access by Researchers

by Jonathan Jacobsen Monday, June 11, 2018 7:46 AM

Arctic Health, intended for students, researchers, and anyone with an interest in health aspects of the Arctic, is a central source for information on diverse aspects of the Arctic environment and the health of northern peoples. The Arctic Health website provides access to a database of over 280,000 evaluated publications and resources on these topics. To improve access to this collection, a new search engine has just been launched at https://arctichealth.org

Search results in Arctic Health include published and unpublished articles, reports, data, and links to organizations pertinent to Arctic health, as well as out-of-print publications and information from special collections at the University of Alaska. Resources come from hundreds of local, state, national, and international agencies, as well as from professional societies, tribal groups, and universities.

Arctic Health is managed by the Alaska Medical Library at the University of Alaska Anchorage, by Prof. Kathy Murray and a team of staff. Andornot has worked with this group since 2005 and designed several previous search interfaces using Inmagic WebPublisher PRO and dtSearch.

Prof. Murray approached Andornot last year with several updates in mind, such as to ensure the search results are accessible on mobile devices, not just desktops. Rather than simply adjust the existing site, this precipitated a complete review of the current system, including data entry workflow and the actual content to be included, as well as discussions on a more modern search engine.  

As we do with many projects, Andornot began this challenge by separating out the user groups and functions. Library staff need a system to manage and upload records, with features for adding, editing, converting and validating data. Researchers and health care practitioners, on the other hand, need an easy to use, robust system for searching the vast archive of resources. With such a large number of records, a sophisticated search engine is needed to float the most relevant results to the top of any search.

For the back-end, Andornot developed a web application that uses Inmagic DB/TextWorks for data storage, and Inmagic WebPublisher PRO as a middle layer. We were able to update and re-use an XSLT we'd previously developed that UAA uses to import records in XML format from PubMed. This hybrid approach of using existing commercial software and a custom-developed web application provided the features needed by library staff at a more economical cost than a completely custom written system. 

For the public search interface, we used our Andornot Discovery Interface (AnDI). AnDI is a modern search engine based on the popular Apache Solr system, with features such as:

  • Excellent keyword search engine and relevancy-ranked search results.
  • Automatic spelling corrections and “did you mean?” search suggestions.
  • Full text indexing of linked documents.
  • Facets, such as subjects, authors, places, dates, and material types, to allow users to quickly and simply refine their search.
  • A selection list allows users to mark items of interest as they search, then view, print or email the list.

AnDI helps users quickly find relevant materials from the large collection at Arctic Health and is a significant improvement over the previous search options.

Both systems in this solution are hosted by Andornot as part of our Managed Hosting Service.

Check out the new iteration of the Arctic Health resource database at https://arctichealth.org, and contact Andornot for help with your project.

Automated Sitemap Generator Added to Andornot Discovery Interface

by Jonathan Jacobsen Friday, June 08, 2018 11:36 AM

Andornot believes strongly that it’s not enough for an archive or museum to simply have a fascinating collection and excellent software for managing it and making it publicly accessible. Drawing the public to these resources is equally important, something larger museums and some archives do well of course. For smaller organizations, that means the curator or archivist has to put on a marketing hat from time to time. However, this need not be a painful experience.

For example, a couple of months ago we wrote a blog post about using Wikipedia as a means of increasing the exposure of your organizations and your collections. This can be a quick, easy and fun afternoon task.

And today we're announcing a new feature in our Andornot Discovery Interface (AnDI) to also help attract the public: an automatic site map generator.

A site map is an XML file placed within your website, listing all available pages or resources, to help search engines such as Google and Bing index as much of your content as possible. While search engines will crawl links they find, such as on your home page, to help them discover records, this site map file can be provided to guide them to the full set.

e.g.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://search.yoursite.org/Permalink/descriptions281616</loc><changefreq>weekly</changefreq></url>
<url><loc>http://search.yoursite.org/Permalink/descriptions281617</loc><changefreq>weekly</changefreq></url>
<url><loc>http://search.yoursite.org/Permalink/descriptions281618</loc><changefreq>weekly</changefreq></url>
<url><loc>http://search.yoursite.org/Permalink/descriptions281619</loc><changefreq>weekly</changefreq></url>
<url><loc>http://search.yoursite.org/Permalink/descriptions281620</loc><changefreq>weekly</changefreq></url>

 

Within AnDI, the sitemap lists all available records that can be found in the search engine, using the permalink URL.

This file is not seen by the public and has no impact on the site, but will be used by Google and others to index more of the records in an AnDI site. And thus, when people search by keyword in Google for records that happen to be in that collection, especially ones with unique names, places and words, these records are more likely to appear in their Google search results, drawing more traffic to the site.

This feature has been rolled out to all the clients who participate in our Managed Hosting service, and is available to our other AnDI clients (just send us an email to request it).

There are many ways to spread the word online about your collections and resources, some requiring very little effort. Stay tuned to our blog and newsletter for more!

Adjusting Solr relevancy ranking for good metadata in the Andornot Discovery Interface

by Peter Tyrrell Thursday, January 18, 2018 4:00 PM

I learned an interesting lesson about Solr relevancy tuning due to a request from a client to improve their search results. A search for chest tube was ranking a record titled "Heimlich Valve" over a record titled "Understanding Chest Tube Management," and a search for diabetes put "Novolin-Pen Quick Guide" above "My Diabetes Toolkit Booklet," for example.

Solr was using the usual default AnDI (Andornot Discovery Interface) boosts, so what was going wrong?

Andi default boosts (pf is phrase matching):
qf=title^10 name^7 place^7 topic^7 text
pf= title^10 name^7 place^7 topic^7 text

The high-scoring records without terms in their titles had topic = "chest tube" or topic = "diabetes", yes, but so did the second-place records with the terms in their titles! Looking at the boosts, you would think that the total relevancy score would be a sum of (title score) plus (topic score) plus the others.

Well, you'd be wrong.

In Solr DisMax queries, the total relevancy score is not the sum of contributing field scores. Instead, the highest individual contributing field score takes precedence. It’s a winner-takes-all situation. Oh.

In the samples above, the boost on the incidence of “chest tube” or “diabetes” in the topic field was enough to overcome the title field's contribution, in the context of Solr’s TF-IDF scoring algorithm. I.e. it’s not just a matter of “the term is there” versus “the term is not there”, instead the score is proportional to the number of query terms the field contains and inversely proportional to the number of times those query terms appear across the whole collection of documents. Field and document length matters. Also whether the term appears nearer the front of the text.

So I could just ratchet up the boost on the title field and be done with it, right? Well, maybe.

As someone else* has said: DisMax is great for finding a needle in a haystack. It’s just not that good at searching for hay in a haystack.

The client’s collection has a small number of records, and the records themselves are quite short, consisting of a handful of highly focused metadata. The title and topic fields are pithy and the titles are particularly good at summarizing the “aboutness” of the record, so I focused on those aspects when re-arranging relevancy boosts.

New Solr field type: *_notf, a text field for title and topic that does not retain term frequencies or term positions. This means a term hit will not be correlated to term frequency in the field. It is not necessary to take term frequency into account in a title because the title’s “aboutness” isn’t related to the number of times a term appears in it. The logic of term frequency makes sense in the long text of an article, say, but not in the brief phrase that is a title. Or topic.

New Solr fields: title_notf, topic_notf

Updated boosts (pf is phrase matching):
qf=title_notf^10 topic_notf^7 text
pf=title^10 topic^7

Note that phrase matching still uses the original version of the title and topic fields, because they index term positions. Thus they can score higher when the terms chest and tube appear together as the phrase “chest tube”.

Also, I added a tie=1.0 parameter to the DisMax scoring, so that the total relevancy score of any given record will be the sum of contributing field scores, like I expected in the first place.

total score = max(field scores) + tie * sum(other field scores)

So, lesson learned. Probably. And the lesson has particular importance to me because the vast majority of our clients are libraries, archives or museums who spend time honing their metadata rather than relying on keyword search across masses of undifferentiated text. Must. Respect. Cataloguer.

Further Reading

Getting Dissed by Dismax – Why your incorrect assumptions about dismax/edismax are hurting search relevancy

Title Search: when relevancy is only skin deep

* Doug Turnbull, author of both articles above.

Month List