Oct 31 , 15:30 - 16:30

Optimization of nf-core/rnaseq pipeline using a large number of samples on the SevenBridges Platform

The nf-core/rnaseq pipeline is the most utilized nf-core workflow and has received a considerable amount of support from the community in terms of development. A number of benchmarks have been performed and published using the pipeline’s test_full profile, which consists of 8 samples. To perform a benchmark that better reflects the common use case, rnaseq was run on the SevenBridges platform using a dataset containing 78 samples of distinct human liver biopsies. The used dataset can be found under the accession number PRJNA542148 with an average input file size of ~4.2 GB (paired-end data) per sample. The pipeline optimization was focused on enhancing both computation and storage resources to reduce the analysis cost. To optimize performance, specific computational requirements and instances were assigned to each process. Additionally, to process multiple samples simultaneously and speed up task execution, up to 10 parallel AWS instances were used. Running the analysis with multiple parallel instances allowed for faster sample processing, reducing the execution time by a significant margin. Allocated instances were chosen to best fit each job’s computational requirements. To optimize storage, additional disks were attached as the instances approached their storage limit, instead of attaching a large disk for the whole duration of execution. The SevenBridges platform handled computational resource allocation and task orchestration. When the pipeline was run with the same optimization setup, a cost reduction of up to 45% and a decrease in wall time by as much as 41.5% were achieved, compared to the default setup.

Download poster

Speaker

Darko Cucin

Bioinformatician at Velsera

Co-authors

Pavle Marinkovic