February 26, 2021 — Public data assets, such as TCGA, are rich resources for enabling pre-clinical discovery, translational, and clinical biomarker teams to make faster drug development decisions that are informed by human disease data.

Historically, this real-world, human disease data has not had much of a place at these points in the drug life cycle, simply because it didn’t exist. Now that it does, it is imperative that we as researchers figure out the optimal way by which we can integrate findings from public data.

In our last article on derisking drug development using TCGA, we discussed specific applications as well as some of the inherent challenges facing researchers who seek to use TCGA. In particular, significant data QC, processing, and normalizing steps are required to align and format the data as “analysis-ready” before translational researchers can begin interrogating TCGA. These procedures need to be robust in order to guarantee reproducibility, but flexible in order to integrate potential changes in TCGA annotations.

Given the challenges of using TCGA (including as additional data are added), translational researchers increasingly find that pre-aligned and pre-processed TCGA data accelerates their ability to generate reliable insights to inform critical drug development decisions.

Technology Architecture to Support Multi-Omic Biomarker Interrogation

To address this core need, the QuartzBio team has invested in infrastructure (Figure 1) to process public data and integrate it with project-specific sponsor data (preclinical or clinical) in order to discover actionable insights. For example, insights could reveal answers to questions like, “Does tumor type A or B exhibit more drug MOA target pathway activity?” or “Which cell line or mouse model best reflects my disease of interest at the mechanistic level?”

Before it can be integrated, ‘omics data often needs to be processed (e.g., by normalization, examples of which will be discussed below), then cross-referenced between data sets (e.g., ensuring that gene and protein labels can be mapped between data types).

Figure 1. A multiomic data processing engine generates ready-to-query QuartzBio-TCGA data set and fully documented processing package
Figure 1 QuartzBio Data Processing Infrastructure
Multi-Omic Processing Engine Creates Analysis-Ready Public Data and Fully Documented Data Processing Package

QuartzBio’s multiomic processing engine has been used to generate “QB-TCGA,” the fully processed public dataset accompanied by an API that can be used to reliably query the data to extract subsets used for more specific analyses (e.g., based on disease type or clinical characteristics ).

This processing engine has enhanced TCGA in six areas to enable deeper insights:

  1. Mapping across multiple data modalities:
  • miRNA
  • mRNA

2. Enhanced quality control (QC):

  • Batch/sample effect analysis resulting in the separation of FFPE sample data from data generated from fresh-frozen samples
  • Removal of confounding samples with multiple replicates

3. Enhanced flexibility through multiple normalization methods

  • Variance-stabilizing Transformation (VST) normalization across all samples/tumor types
  • TPM normalization for comparison of genes with a sample
  • Z-score creation across all samples/tumor types using vst-norm-counts
  • Raw counts as a substrate for additional normalization methods that may be required

4. Annotation with a proprietary set of network signatures that we maintain, enabling patient profiling and stratification. These signatures represent:

  • Molecular activities (e.g., the kinase activity of MAPK13)
  • Pathway activities (e.g., Pi3K signaling)
  • Drug MOA (e.g., prednisolone treatment)
  • Biological processes (e.g., cell cycle)
  • Drug response (e.g., response to IO therapies)

5. Annotation with associated clinical data

  • Treatment information
  • Overall survival (OS)
  • Progression-free survival (PFS)
  • Demographic information

6. Annotation with immune cell population inferences derived from deconvolution methods

  • ABIS (Figure 2) [https://pubmed.ncbi.nlm.nih.gov/30726743/]
  • xCell [https://pubmed.ncbi.nlm.nih.gov/29141660/]
Figure 2. QB-TCGA Provides Context On Tumor Microenvironment / Immune Cell Populations 
Figure 2 Deconvolution reveals immune cell populations quartzbio

Figure 2. By integrating deconvolution methods into QB-TCGA, this processed dataset enables correlation between cell type populations and biomarkers. For example, the question could be asked, “Do patients with higher mRNA expression levels of IL6 also exhibit higher proportions of MDSCs? B Memory cells?” The data could reveal how cell populations change between tumor and adjacent tissue, and additional correlations could be made with related signatures, such as IO response, PD-L1, and drug signatures (e.g., VEGF inhibitors, other chemotherapies).

QB-TCGA Fully Documented Data Processing Package Accelerates Reproducible, Collaborative Insight Generation

It was the experience of working with multiple sponsors to harvest translational insights that empowered the QuartzBio team to build a fully documented R package for reproducibly processing TCGA – an approach that can now be easily adapted to other public datasets.

The pipelines used to process TCGA are built on the foundations of the QuartzBio team’s knowledge of:

  • What information TCGA can deliver
  • The provenance of this information (for example, “What platform was used to run each RNA sequencing experiment? Were samples run in batch?”)
  • How to integrate TCGA data together
  • How to derive reliable annotations

The result is QB-TCGA, a documented, dockerized application that can be used by multiple scientists simultaneously and still generate the same answers every time. We are excited to apply this unique asset towards the toughest challenges in oncology.

Next: The Knowledge Engine Turns Data into the QuartzBio Knowledgebase

Once public data has been appropriately processed, the QuartzBio team applies knowledge-driven analysis to generate the QuartzBio Knowledgebase.

Stay tuned for our next article to learn more about how this knowlegebase was built, is maintained, and an example of its application for prioritizing therapeutic indications.

Have an idea for using public data to advance your research?

Connect to a Data Veteran at QuartzBio