Oct 30 , 12:30 - 13:00

Unifying Nextflow Pipeline Outputs and Biological Metadata with SQL and Schema-On-Read Databases

Bioinformatics teams face challenges aggregating results and sample metadata across experiments and next-generation sequencing (NGS) runs, leading to unnecessary time spent record-keeping and data wrangling. Disparate data sources, inconsistent naming conventions, and diverse file formats complicate locating and linking NGS results with metadata. Consequently, fragmented datasets can obscure biological patterns and batch effects, visible only when data is unified and analyzed at scale. Despite widespread use in data science, SQL is underused by the bioinformatics community. Familiar relational databases require users to predefine tables, slowing pipeline development. Schema-on-read databases, like AWS Athena and Google BigQuery, allow bioinformaticians to query directly over pipeline outputs in cloud storage, but only if output files adhere to specific folder structures. In our session, we illustrate how SQL and schema-on-read databases can unify metadata with NGS results across runs to simplify data accessibility. We address two main implementation bottlenecks experienced by the community: (1) a lack of familiarity and tools to create table definitions for NGS data, and (2) the output folder structures of nf-core pipelines are typically incompatible with query-on-read databases, or inefficient for querying. We provide examples of constructing table definitions and database views from common nf-core pipeline outputs (fetchngs, rnaseq) alongside queries that eliminate manual file wrangling time. For example, we processed RNA-seq data with metadata across multiple runs in CCLE and performed queries revealing scientific insights, like target gene expression across cancer types, rapidly with minimal code. These techniques integrate into existing Nextflow infrastructures, streamlining bioinformaticians' access to unified datasets.