Having been in the industry for over 30 years, the client amassed an enormous collection of content. This resource was invaluable and they wanted to optimise it for their new digital strategy.
50,000 records of varying complexities
The dataset consisted of over 50,000 records, spread over multiple databases and structured in conflicting schemas. The first step that needed to be done was to migrate everything to a single location. Working with the client, we defined a correct structure (schema) for the records and started processing them into a single database.
Duplicates of duplicates of duplicates
With all the data in one location we noticed another problem. The different databases sometimes held copies of the same record in different versions. We needed to figure out which was the most accurate record to keep before deleting all the duplicates. This was done by grouping the duplicates based on common elements such as titles, excerpts, and body text and then finding the one with the most information or newest date. The rest were marked for deletion.
The end goal
It was important for the editors at The Art Newspaper to go through the content to make sure that the resource was clear of any dummy content and that the remaining records were valid. We created a simple Content Management System for them to go through the content, mark each one with a status and clean up anything that needed a manual improvement.
The hard work accomplished by the editors meant that The Art Newspaper had a rich resort to offer their readers, spanning their 30 year successful reporting of the art world.
Kind words from the client
We are very happy to have wearegoat as our long-term partner: they are stellar, when it comes to research-based solutions, solving complex problems in ever-changing environments, and creating solutions that actually work.