Oct 28 , 13:00 - 14:00

MGnifams - Workflows for the generation and annotation of metagenomics derived protein families

The MGnify resource (https://www.ebi.ac.uk/metagenomics) is a platform for the assembly, analysis, and archiving of microbiome derived sequence data. MGnify has a repertoire of specialized pipelines (the majority described in Nextflow) to generate detailed taxonomic and functional annotations depending on the nature of the input data (metagenomic, metatranscriptomic, and metabarcoding). In late 2018, MGnify introduced metagenomic assembly as a service, providing greater access to complete proteins and their genomic context. Building upon these assemblies the resource has witnessed an increasing shift to genome-resolved metagenomics. This is demonstrated by the MGnify Genomes resource which hosts 11 biome-specific MAG (metagenome-assembled genome) catalogs comprising hundreds of thousands of genomes, produced by the community and the MGnify team. At the time of writing, MGnify has identified ~2.5 billion unique protein sequences from their metagenomic assemblies - one of the largest sequence collections in the world, which are clustered at 90% sequence identity to produce ~720 million representative sequences. The proteins contained in the MGnify protein database include those that are members of known protein families, as well as those that represent hitherto novel protein families. We have developed Nextflow pipelines to iteratively cluster the sequences to produce metagenomics protein families called MGnifams. Through these pipelines, we determine whether these new families represent expansions of known protein families or entirely novel protein families. MGnifams consists of various in-house developed Nextflow workflows revolving around data preprocessing, sequence clustering and family generation, redundancy checking, matching family profile hidden Markov models to known Pfam domains, structural prediction and annotation, and data exporting to facilitate ingestion into a dedicated MGnifams database. These workflows can be executed either individually or in a complete end-to-end Nextflow pipeline. Workflows are composed of subworkflows and modules, using both modules in-house developed (https://github.com/EBI-Metagenomics/nf-modules) and nf-core ones. An alpha-version demo of MGnifams can be found online here: http://mgnifams-demo.mgnify.org

Speaker