Archived site. See the  latest event.
Mayur Khanna
Mayur Khanna

Nov 30, 2023, 12:00 PM CET

Watch on YouTube

Leveraging Nextflow in a distributed cloud infrastructure to analyze genetic data at scale

Community

Mayur Khanna & Benjamin Alexander


The quantity and quality of genetic data has been increasing at an exponential rate over the last few decades. These increases present new challenges for both the storage and analysis of genetic datasets. Systems such as High-Performance Computing (HPC) environments offer increased computational power for genetic analyses but cannot exceed the limitations of the deployed hardware. Further, computational algorithms are increasing in complexity, but parallelization can be difficult and time-consuming to implement. Variability in algorithm implementation, underlying libraries, and hardware stacks can lead to artifacts and unreproducible results. To address these challenges, the Genetic Informatics Platform (GIP) harnesses the power of cloud computing to provide a highly scalable framework for scientists to ingest genetic data and to build reproducible workflows. Nextflow managed workflows running within our cloud infrastructure are central to the GIP. Nextflow pipeline execution enables us to run dynamically scaling pipeline jobs at a magnitude not possible on a shared HPC environment. Utilizing this system, we have ingested 35,813 public genome wide association studies, standardized to a common human genome build and integrated into a distributed database cluster. These datasets reference 32,845 distinct traits, across a total of 72 populations. Further, we provide a user interface to query and visualize this database through interactive plotting. A batch processing system driven by Nextflow manages these ingestion pipelines in addition to user defined workflows. The Genetic Informatics Platform allows researchers to conduct reproducible research at a scale necessary to meet the demands of the future of genetic datasets and analysis.

Watch on YouTube
Mayur Khanna

Mayur Khanna

Data Engineer at Pfizer

Community
Speaker