High Performance Data Analytics in Precision Medicine
Until recently, most medical treatment models have been designed for the “average patient.” This “one-size-fits-all” approach results in treatments that are successful for some patients but unsuccessful for others.
Precision Medicine is an innovative medical treatment model that takes into account individual differences in patients’ genomes, environments, and lifestyles. It provides doctors and medical researchers the ability to customize healthcare—with medical decisions, practices, and/or products– for individual patients.
Advances in Precision Medicine have already led to powerful new discoveries and several new treatments tailored to specific characteristics, such as a person’s genetic makeup, or the genetic profile of an individual’s tumor. This improves chances of survival and reduces exposure to adverse effects.
The Science Division is developing in-memory graph analytics and machine learning technologies to help medical professionals extract knowledge and insights from large, complex genomic databases to improve Precision Medicine for a number of diseases. For example, in-memory graph analytics is a very powerful tool for cancer researchers to determine relationships between genetic variants and specific types of cancer.
Traditional analytical approaches currently used by researchers are too slow to reveal useful information. The Science Division’s in-memory graph analytics and machine learning technologies enable medical professionals to visualize and mine connected data to develop Precision Medicine in ways that are orders of magnitude faster and more intuitive than traditional methods.
The Science Division’s Genomic Pipeline Team has achieved significant speed improvements (10x-100x) relative to recent benchmarks in the genomic sequence alignment and variant discovery pipeline using the SGI UV300 scaleup architecture. Since most algorithms and code were written for scale out configurations, it has been a challenge for us to optimize them for scale up, but there is additional progress to be made by optimizing specific steps that we think will work better in a single, large memory instance such as de novo assembly and joint variant discovery, as well as with large sequence files over 100GB. Based on benchmark results from scale-out and scale-up systems, the Science Division is developing a hybrid supercomputing system that will switch steps in the pipeline to hardware best able to process output data from a previous step.
Aligning hundreds of thousands to millions of individuals’ genome sequencing output to a reference genome is a highly parallelized process that is optimized for scale-out supercomputing using CPU and FPGA processors. Subsequent cleaning and other pre-processing steps also run better on scale-out architectures. However, de novo assembly of sequencing reads utilizes De Bruijn graph technologies to improve assembly results, and sharding graphs across a cluster of scale-out machines significantly decreases speed and performance. De novo assembly may be much faster on a scale-up machine using GP-GPU processors. Joint genotyping in the variant discovery part of the pipeline is improved by comparing multiple genomes together against the reference genome.
The larger the file size that must be analyzed together, the better the fit for a single, large memory scale-up machine. The hybrid supercomputing approach will greatly improve speed and performance throughout the genomic pipeline and in downstream analysis of pipeline output data.
The Cancer Analytics team is developing a large graph database of heterogeneous cancer data with advanced machine and deep learning algorithms to quickly find unique, potentially important relationships in complex data. The past several years have seen a significant increase in high throughput experimental studies that catalog variant datasets using massively parallel sequencing experiments. New insights of biological significance can be gained by this information with multiple genomic location based annotations. However, efforts to obtain this information by integrating and mining variant data have had limited success so far and there has yet to be an efficient method developed that can be scalable, practical and applied to millions of variants and their related annotations. Relational databases have demonstrated to be capable of handling these tasks, but have proven to be an inflexible tool for analysis of variant data.
The representation of data within graph data structures demonstrates promise in determining which specific genomic locations are highly correlated with phenotypic outcomes or even specific variables. the team is expanding the graph database to contain cancer data from The Cancer Genome Atlas (TCGA). We incorporated a variety of new data types including variant data, gene expression, miRNA, DNA methylation, clinical data, and copy number variation from over 45,000 cancer patients. Cancer researchers can quickly query variants and individuals that apply spectral clustering (machine learning) algorithms to groups of people based on their genetic variation. The result is not only descriptive analytics that yield information about populations, individuals, genes, and variants, but through a layered combination of graph and machine learning for example, we can deliver predictive analytics to better inform researchers in Precision Medicine decisions.
The Science Division is applying HPDA and scale-up supercomputing to data from infectious diseases to develop a robust epidemiological tool that can be used to quickly identify, monitor, and contain outbreaks wherever they occur, from remote hot zones in Africa to a nosocomial infection in a local hospital. Scale-up supercomputing allows researchers to significantly speed up this calculation, quickly producing accurate phylogenetic trees for real-time analysis and tracking of microbial infections.