Finding Hidden Text With a Specialized Thesaurus

14 September

Map created by Robert Lewis Stevenson in Treasure Island (image from Wikipedia)

When good authors write, they choose the terminology they want to describe the topics they are addressing and use that terminology consistently throughout the text. This, of course, is good for readers in terms of internal clarity and consistency.

But this authoring strategy is distinctly disadvantageous to discovery (search) and integration (linking) in modern web applications. Why? Because every time an author makes a terminology choice they EXCLUDE other equivalent options. These excluded options could include terminology that other authors have chosen or are the preferred terminology of their potential readers. I’m not blaming the authors, of course—their writing would be nonsense if they included all equivalent choices in their text.

So how do you deal with these missing options? Thesauri to the rescue! Every web search, linking, and categorization system should employ some form of thesaurus behind the scenes. And in specialized areas like medicine, you’ll need a specialized thesaurus rather than a basic broad one. This thesaurus should include synonyms, acronyms, abbreviations, and jargon, and should be based on real-world authoring and searching behavior (rather than academic nit-picking).

In essence, a thesaurus expands the author’s original text into much richer data for automated searching and linking algorithms. Let’s look at an example:

ACTUAL TEXT: Chemoreceptors in the carotid bodies and medulla are activated by hypoxemia, acute hypercapnia, and acidemia.

EXPANDED TEXT1: Chemoreceptors in the carotid bodies (carotid glomus, glomera carotica, glomus caroticum, glomus caroticus) and medulla (adrenal medulla, medulla oblongata, glandula suprarenalis, suprarenal medulla, adm, metepencephalon, medullary, myelencephalon) are activated by hypoxemia (hypoxaemia, arterial hypoxemia), acute hypercapnia (blood carbon dioxide increased, blood co2 increased, carbon dioxide retention, carbon dioxide, increased level, hypercapnemia, hypercapnaemia, hypercarbia, pco2 increased on arterial blood gas, elevated pco2, retention carbon dioxide, serum carbon dioxide increased), and acidemia (acidaemia).

The actual text as written was 15 words. The expanded text was 71 words, or approximately 4.7 times longer. Humans read the first sentence, and machines read the second.

No matter how a user searches for this text (“hypercapnia” vs. “hypercarbia,” for example) they will match this text with a good thesaurus.

Are readers finding what they want on your web site this easily?

1Thesaurus Source: Silverchair’s Cortex taxonomy—with references to SNOMED, Read Codes, MeSH, Digital Anatomist, NCI Thesaurus, NeuroNames Brain Hierarchy, MedDRA, WHO Adverse Reaction Terminology, OMIM, DXplain, CRISP Thesaurus, Clinical Problem Statements, and COSTART.

