Data Science

Data Science

The Science Division is developing High Performance Data Analytics (HPDA) to significantly improve analytics in science-related fields. HPDA is increasingly recognized by data scientists and researchers in academia, government, and industry as a leading technology to mine and extract actionable knowledge and insights from Big Data in real time to accurately inform decision makers. The term Big Data is best defined as datasets whose size is beyond the ability of traditional hardware and software tools to capture, store, manage, and analyze.

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data scientists to efficiently execute streaming, machine learning and other workloads that require fast iterative access to datasets. It works in scale-out, scale-up, and hybrid supercomputing environments. Performing compute (CPU) and data (I/O) intensive analytics on large, diverse biomedical data requires a data processing framework that optimizes efficient use of the underlying hardware capability. Apache Spark is that framework. For example, FedCentric’s SGI UV300 system has 256 cores that work with Spark to use all the available core. Spark essentially allows a scale-up system, with all of its advantages, to also exploit many of the benefits of a scale-out system by introducing a new data structure called the Resilient Distributed Dataset. This data structure allows large amounts of data to be operated on in parallel by Spark’s virtual executor nodes, under the management of a single fat virtual driver node. Since the nodes are virtual, internodal communication is instant. Each core acts as an executor with an independent java heap, amounting to 256 executors operating in parallel. This architecture gives us the ability to scale processes up to 256x faster than a single threaded system.

Graph analytics is the science of creating graph databases and algorithms that target granularity and weigh how individual data objects are related to each other. Determining how data objects relate to each other and their patterns of connections is more important than simply classifying and summarizing them. Graph models are gaining popularity as a better technology to analyze complex, heterogeneous data and could prove essential in the development and implementation of Precision Medicine.

Machine learning algorithms operate by developing models from example inputs to make data-driven predictions or decisions, rather than following strictly static program instructions. Deep learning is part of a broader family of machine learning methods based on learning representations of data. Neural networks are notable for being adaptive, which means they modify themselves as they learn from initial training and subsequent runs provide more information about the world.