Government
Data Extraction - Metadata Extraction for Singapore Government Library
Tags: #Government #SQL #Excel #Library #Singapore #DataExtraction #DataEngineering #MySQL #ETL #Database
Background
The Government Library has a legacy database software that holds the Library’s digital assets. There are about 50 assets where the metadata are to be extracted from each of them. These materials are stored in a variety of formats e.g. text, photos, video, audio etc. and are digitized and stored for preservation and access.
The current challenge was that the attributes of each asset spread out across multiple tables within the MySQL database making it very difficult for the staff to consolidate all the attributes of all the assets within a single .csv file.
Furthermore, Analytico Asia was required to load a database which was separated by .MYI, .MDI & .SDI files into MySQL Workbench. This is a technically complex challenge since database files are required to be in .sql format to be loaded into MySQL database.
Since the Library did not have direct access to the application environment due to missing credentials, it was not possible to invoke a password reset to dump out (export) the entire databse in .sql format.
Problem Statement
How can the Library extract the raw metadata from a set of database files that are not in a singular .sql format and transfer the metadata to a CSV file?
Solution
Analytico Asia develop a robust framework involving 3 phases that served as a work around to loading non-sql files into MySQL server. We then load SQL queries to join the multiple tables and created a consolidated view of all the attributes of all the library assets.
Phase 1: Extraction Scripts - Retrieve metadata fields from DB
Phase 1 involved developing, testing and applying SQL scripts to query the various database file to extract the required metadata fields for the assets listed in the database files. The outputs of these queries were processed in Phase 2.
Phase 2: Transformation Scripts - Data cleaning and encoding
Phase 2 involved mapping the metadata attributes found in the different database files to their corresponding assets. As far as possible, we showed the actual values of metadata attributes as opposed to the IDs. For example, instead of showing raw country ID as the value for country, we will show actual country name for a particular asset.
Additionally, all NULL and & values will be displayed as “” and “&” respectively. HTML code version of certain punctuations present in the metadata will be transformed to its original non-HTML form. Separately, date fields in the metadata will be standardized as DD/MM/YYYY.
Once the required data cleaning involving data granularity, standardization, and removal of html code had been completed, we processed these outputs in Phase 3 to make it ready for submission.
Phase 3: Consolidation Scripts - Export formatted metadata to Excel
Phase 3 involved creating 2 worksheets in an Excel file where one worksheet will display each metadata field in a separate column whereas the other worksheet will display the combined metadata fields within one column.