HPC-enabled genomic variant discovery using ConsensuSV

Mateusz ChiliƄski

supervisor: Dariusz Plewczynski



Each individual in the human population is characterised by its genomic sequence, the same in all cells of their body. The differences in our personal DNA sequences are what distinguish us from each other and provide variability required for the evolutionary processes. We distinguish 3 major types of genomic variants - Single Nucleotide Polymorphisms (SNPs), which are changes of one base pair (bp) in the sequence, Indels, which are small genomic rearrangements of up to 50bps, resulting in insertion of novel sequence or deletion of sequence, and Structural Variants (SVs), which are large (>50bps) rearrangements of various types, including Insertions, Deletions, Inversions, Duplications and many more. All of those variants are established in comparison to an artificial reference genome, which was created for the ease of the comparative analysis of the common variants between humans. With the decrease of the costs for the next-generation sequencing (e.g. Illumina, PacBio), precision medicine and the discovery of the reasons behind many single-gene and complex genetic diseases is currently being extensively studied. We observe today the fundamental shift toward the whole genome sequence studies, which allow the study of the regulatory regions of human DNA. However, with the amount of the sequencing data at the whole genome and the population scale, the main bottleneck for the data processing and discovery of novel, potentially pathogenic variants is often linked not only to the samples collection and wet lab experiments but also its bioinformatic analysis. In many research studies, such massive analysis of the data is not simple - the complex algorithms and software behind the discovery of variants are often hard to install, satisfying dependencies, then run and finally understand the results in terms of their clinical and biological relevance. In this work, we present ConsensuSV, a software package for the discovery of all types of genomic variants - Structural Variants, Indels, and SNPs, that is highly automated and high-performance computing (HPC) enabled. The software is divided into two main modules - ConsensuSV-pipeline, which is based on the luigi framework and takes care of running the specific tasks in the appropriate order, visualisation, and control of the status for the particular tasks. The second module, ConsensuSV-core is machine learning (ML) enhanced software that provides a meta caller for Structural Variants (SV). In the default version, it takes ConsensuSV-pipeline generated output of 8 SV callers and merges them using neural networks. The software is much easier to use than its competitors - FusorSV and MetaSV, simultaneously achieving high levels of SVs discovery sensitivity and selectivity in comparison to other state-of-the-art tools.