May 18, 2020 — As the research community is generating data from novel research and looking to leverage prior research (e.g., MERS, SARS), there is significant opportunity to integrate these diverse data sources to create an even more valuable data asset for scientific insight generation. This opportunity requires addressing very real data wrangling challenges.
Specifically, data are typically processed at the study and/or experiment level to answer targeted questions and a seamless integration can be hindered by differences in:
- Secondary analysis pipelines with different workflows, settings and reportables, e.g., raw counts vs. RPKM/FPKM
- Ontology engines using differing gene nomenclatures
- Quality control pipelines with differences in stringency and metrics used
A multi-layered and unified approach to data processing, such as those developed by Tobias Guennel et al., is critical to fully leverage the information and content across these studies and experiments for scientific insight generation. Our experience has shown that this upfront investment in comprehensive data alignment contributes exponential value downstream by enabling rapid and flexible analyses across the entirety of the data set, rather than taking an ad hoc approach with piecemeal alignment that can stifle momentum.
Key Steps of Data Integration
An example of how the data processing engine is deployed in the SARS-CoV-2 & COVID-19 data aggregation initiative is the integration of data set GSE147507 (transcriptional response to SARS-CoV-2 infection) and data set GSE75699 (transcriptome profiling of influenza virus-infected human bronchial epithelial cells). Dataset GSE147507 reported out expression levels using raw counts using HGNC gene symbols while dataset GSE75699 reported out expression levels using RPKM (reads per kilobase of exon per million reads mapped) using REFSEQ IDs. Given the variation in reportables as well as annotations, these datasets are not directly comparable without integration.
Our approach in this case was the following:
- Begin with raw sequence data (which in this case were Terabytes of data) for both experiments and run one consistent pipeline for both (rather than the two secondary NGS pipelines executed independently by the researchers)
- Quantify transcript expression using TPM (transcripts per millions) as it is one of the recommended metrics for cross-dataset comparisons (see : https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1526-y and https://www.ncbi.nlm.nih.gov/pubmed/22872506)
- Use common reference transcription and annotation databases for both
- Evaluate and filter data based on the same QC metrics for both prior to differential expression analysis
This consistent approach within assay technology as well as across assay technologies establishes a common structure that permits analyses in the combined data set. Especially valuable, the data processing engines allow for the efficient mapping of additional data sets into a unified data asset on an ongoing basis. We can pull in a large volume of information that enables flexible exploration to support insight generation – which we know is likely to take a few unexpected turns – while minimizing the time and frustration of having to go back to the drawing board to problem solve for how to effectively add an additional data set (again and again). This enabled us to perform causal inferencing to elucidate the mechanistic influences of SARS-CoV-2 infection/COVID-19 as will be detailed in upcoming posts.
If you are interested in reading more about our SARS-CoV-2 & COVID-19 data aggregation initiative, you can see our previous articles in the series below.
- Contextualizing SARS-CoV-2 Infection With Host Biology
- QuartzBio: COVID-19-Related Data Aggregation Initiative for Drug Development
Author: Scott Marshall | Managing Director | QuartzBio, part of Precision for Medicine