Evaluation of Automated Tagging Solutions

04 February

As we at Silverchair and Semedica see more and more interest in automated tagging solutions (such as our Tagmaster system), we are more frequently encountering questions about how to evaluate their results. Here are a few ideas on the subject:

Evaluation: Humans Required!

It is hard to get around the fact that you will need human editors (or professional indexers) and your human technology team (who will use the tags to create interesting new features) to verify that an automated system is working correctly and that the tagging is accurate and useful.

Recently, someone asked our CEO Thane Kerner if we had an automated system to verify the accuracy of our automated tagging. Thane replied (rather cheekily, I must say): “If we had an automated review system that could measure tagging accuracy more precisely than the current tagging system, we wouldn’t use it to verify tags, we’d use it to tag the content to begin with!” The lesson: Once you’ve deployed your best automated system to do the tagging, humans are the next logical reviewers.

Here are four factors your humans should consider in their review:

[caption id="attachment_372" align="alignright" width="300"]View inside Semedica's Tagmaster View inside Semedica's Tagmaster, showing tags automatically inserted at the paragraph level[/caption]

1.  Expert/Editorial Accuracy Confidence

One key target for evaluation is to assess how much confidence your key stakeholders (journal boards, editors, etc.) express in the output of the system. But confidence is not a linear equation. I posit the following values:

The first thing you’ll notice is the weight of positive to negative. In high-stakes fields (including science and medicine), humans are naturally biased to more heavily favor negative experiences.  (Of course, this has aided us well in survival: “Don’t eat that type of berry again, it made you sick last time!”) What that means in terms of confidence is that stakeholders will need a disproportionate amount of positive reassurance to get over negative outcomes. And the impact of a particularly egregious negative outcome (resulting from a particularly poorly placed tag) can be devastating to your stakeholder’s impression of a tagging system. (This is why Silverchair’s system defaults to using conservative methods with very little “guessing” to avoid obvious irrelevant tag placement.)

2.  Usefulness!

The next key target for evaluation for both editorial and technical stakeholders to assess is usefulness of the tagging applied. Tags should be highly relevant in a domain-specific context and they should drive better discoverability and linking. Primary care, genetics, surgery, and emergency care all take very different approaches to the same topics, and their tagging should reflect their uses.

The tagging system you are evaluating may have added tagged concepts that are tangential or irrelevant to the use model of the content, and such tags would not be capable of driving innovative site features (in many cases, tangential tagging actually inhibits the ability for new systems to work effectively). For example, it is a nice-to-have if your tagging system can recognize place names and person names, but if it misses or miscategorizes important topics like clinical trial names it doesn’t matter how many people or places it can tag. (Clinical trial acronyms can be particularly tricky to tag―see our post about them.)

3.  Granularity

Does the system still work with “documents” or can it identify topics down to the section/paragraph/figure/table/equation level? At Silverchair we work with many dense medical chapters that may cover more than 200 distinct topics, so we see it as a necessity for our tagging system to break those documents down into smaller parts in order to deliver precise packets of highly relevant information to our users.

4.  Control and Ongoing Improvement

Any system selected is not going to be extremely accurate “out-of-the-box.” (I write that as a realist, not as a pessimist!) So during evaluation you must ask, “How easy is it to make impactful positive changes to the system?” This can take a variety of methods—some systems suggest manually selecting training documents for each topic or category (which can get onerous when you have 20,000 topics), some systems allow your software developers to go in and tinker with the code (you have data classification expert software developers, right?!?), and some systems allow you to load and use a taxonomy or thesaurus to aid in topic identification and tagging (assumes a taxonomy/thesaurus exists or can be created for your domain).

At Silverchair, we work primarily in medicine, which is a taxonomy-rich domain with an ever-growing list of topics. For that reason, we’ve chosen the last method as our control and improvement strategy. Our editors update our Cortex medical taxonomy and its related thesaurus every day to keep pace with the topics being written about and searched for.


If you choose a system that 1) is accurate enough to instill confidence in your editorial team, 2) is useful enough to drive meaningful new features and improvements, 3) classifies your data at a granular level, and 4) is flexible enough to allow explicit control and ongoing improvements―you’ve made a wise purchase!

Back to News & Events