(Oceanic birds of South America. v.1. / New York :Macmillan Co. :c1936. / biodiversitylibrary.org/page/12471948)
For scholarly and educational publishers there inevitably comes a time when content must move to a different hosting platform. While the business drivers necessitating a move of legacy and current content are positive, the overall project and details can be daunting. However, just as moving to a new house presents a chance to sort out years of accumulation that no longer serves you, so too can platform migrations be an opportunity for your content and your business.
Data Conversion Laboratory (DCL) has a long history helping Silverchair customers prepare content for ingestion and ensuring the seamless flow of that content into the new platform.
Selecting the best platform for your organization and partnering with service providers who have deep experience results in a healthy digital transformation. This post explores the positive side effects publishers can experience during a platform migration.
Content StructureContent structure is the foundation upon which search and discovery are based. At DCL, we’ve witnessed how content structure issues have a profound impact on scholarly work, and these issues tend to be buried deep in content.
Following are common issues DCL resolves when converting publishers’ content for migration.
- Math: Legacy content might use embedded images used for mathematic equations rather than the actual equations in a structured format like MathML. Converting those images into MathML provides meaningful, searchable content. Plus: this achieves a level of content accessibility for the visually impaired.
- References: We’ve seen large-scale conversion projects in which bibliographic references are structured with a <reference> tag and little else. Restructuring references and validating against a third-party database improves search capabilities as well as increases content coverage and link density across decades of bibliographic content.
- External links and validation: Ensuring the validity of URN links (links to external documents) is very complicated, with distinct structure and rules. URNs are short and do not contain weird characters that can break URL handling software. However, the links can break if documents are deleted or replaced with another document of the same name. Quality checking links is a critical step during every conversion project.
- Missing DOIs: It’s important to insert valid DOIs by either looking them up in Crossref or creating them based on an agreed format.
- Funding information: The increase of open science also mandates that funding and relevant grant information be accurately identified in the published research. Often, grant information is buried in free-form content with no standardized wording or unique funding institution identifiers. At DCL, we’ve applied a series of machine learning/natural language processing, pattern detection, and statistical techniques to build supervised learning datasets and automatically identify and extract grant content from both free-form and structured article text, i.e. granting organization (sponsors), grant numbers and grant recipients.
- Callouts: Often text callouts to assets such as PDFs, images, and supplementary content do not match the filename that they link to. Correcting callout links ensures a good user experience.
- Accessibility: Ensuring that content is accessible for the visually impaired is a requirement for U.S. federal agencies. But following accessibility guidelines also makes good business sense for publishers as the underlying content structure also improves search and discovery. Common standards include WCAG (AA), PCI, Section 508, EU Cookie Policies, OWASP Application Security Verification Standards, and more.
- Content normalization: As content collections grow over time, inconsistencies creep in. We often normalize subjects/categories (such as consistent casing or plural/singular) and article-types, ensuring consistent categorization across the collection.
- Whitespace: “Nothing” can become quite complex in XML. Legacy conversions can collect issues that have been embedded in the content from previous conversion projects (e.g., SGML to XML). For example, in SGML white space was suppressed across the board, so if a few extra spaces were inserted in an element, applications would eliminate extra spacing. In XML, this space will get passed to the application, which can result in weird formatting and in some cases muddle the meaning.
- Special characters: Proper use of en and em dashes, hyphens, and minuses. The systematic update of these characters clears up confusion and ambiguity—a critical component of scholarly and scientific literature.
- Technology: Taking 10-year-old content and moving it onto a new platform with better navigation, design, usability, and improvements to code all support an optimal user experience and the ultimate goal of knowledge dissemination.
If we could recommend just one thing…Plan early for updating content that’s going to a new platform. Don’t wait until you’ve selected your platform vendor and then start thinking about conversion. Discuss the conversion process with your platform vendor when you first engage with them. When going through the RFP process, give vendors as much information about your content as possible. Make sure your conversion vendor offers a programmatic deep dive to learn what’s going on with your library of content.
Want to learn more?