Organising 29 years of data
And creating a single valuable resource
The Art Newspaper held a database of their article archive dating back to 1989. The content originated from a mixture of sources, contained numerous duplications, had missing information, and was in multiple contradicting data structures.
The resource had the potential to become abundantly valuable if restructured and organised. Unfortunately, there was no possible way for the editors at The Art Newspaper to clean up this dataset, which also meant that access couldn't be given to the public.
With the dataset containing more than 50,000 records, we understood that The Art Newspaper's staff did not have the time, or capacity, to manually go through, analyse, and attempt to organise it all. This meant that it was vital to automate as much as possible before giving the editors access for any additional manual cleaning.
The first, and most important, step was to restructure the data into a single and comprehensive format. To do this, we worked closely with the editors to understand the different structures of the data and industry standard terminology being used. The data was restructured into an understood universal language that made working with the content much easier for everyone involved in the project.
With the new data structure, it was possible to see what data were missing from the records and how we could potentially pull valuable information from other values. Doing this meant that the editors did not need to manually input missing data for what could be deduced from the rest of the record.
The final step was to deduplicate content based on similar headlines, standfirst, body content, and dates. Ensuring that the remaining content was reduced down to only the useful articles.
The finalised dataset was reduced to around 30,000 entries. Any records that contained missing information were automatically flagged for their editors to address through our custom built CMS.
Before the project started, The Art Newspaper had calculated the time required to manually clean the dataset. The estimation was set at four months with a team of three people. The result of our restructure and processing of the data was a much more feasible and useful exercise.