Oct 31 , 15:30 - 16:30

nf-core/variantsimulator: A pipeline to simulate variants with different genetic architectures

Despite advancements in genetic research, a considerable portion of the genetic basis for complex traits remains unaccounted for, known as missing heritability. This phenomenon challenges our understanding and ability to predict genetic influences on such traits. Contributing factors include rare variants, and non-linear genetic interactions which collectively can impact a single trait. Synthetic or simulated data, which closely mimic the real-world genetic architectures, i.e. preservation of linkage disequilibrium in Human populations, play a crucial role in developing new statistical, bioinformatics, and deep learning methods to address genetic complexities. By injecting both rare or common loci, as well as interacting variants into simulated data, researchers aim to enhance the analytical power of newly developed methods. Several tools exist for generating synthetic datasets tailored to specific i) study designs, such as family-based or case/control models, and ii) genetic architectures, including rare or common single loci associated variants, or interacting loci with and without marginal effect, different penetrance or trait prevalence. However, implementing these tools can be challenging due to the diversity in employed programming languages (i.e. Julia, Java, and Python), installation procedures, and usage specifications. The lack of standardized environments like conda containers complicates the installation process across platforms, hindering their widespread application. Additionally, each tool requires unique genetic model definitions, including different parameter settings, input formats, and quality control steps, adding complexity to their use. To address these challenges, we are developing the nf-core/variantsimulator, a comprehensive Nextflow pipeline that integrates selected tools (Epigen2, EpiReSim3, Gametes4, HAPNEST5) to generate ground-truth phenotype and genotype datasets. It is designed to accept a single standardized model definition file. Based on the desired design and genetic architecture to be simulated, the model definition is automatically translated into the appropriate format for the tool being used. The pipeline outputs genetic data in standard formats, such as VCF and PLINK, to ensure downstream compatibility. Additionally, linkage disequilibrium and GWAS analyses are incorporated into downstream QC steps to ensure the simulation meets the desired outcomes. The variantsimulator pipeline will streamline the creation of varied statistical and deep-learning solutions in numerous research areas. Moreover, coordination with other nf-core simulation (readsimulator) or genetic analysis pipelines (sarek) will be ensured to allow future integrations.
View project

Speaker

Co-authors

Francesco Lescai, Davide Bagordo, Simone Carpanzano, Eugenio Franzoso, Lorenzo Sola