At Institut für Informationsverarbeitung (TNT) we develop methods for processing and analyzing DNA sequencing data. The development of high-throughput sequencing technologies has the potential to enable the use of such genomic sequencing data as a daily practice in various areas. However, the IT costs associated with storing, transferring and processing large amounts of genomic sequencing data now significantly exceed the costs of performing the actual sequencing. Our work aims at democratizing genomic sequencing data access, for example to enable its broad use in personalized medicine.
In DNA sequencing, a nucleotide sequence to be read out is first fragmented. The fragments are first multiplied and then read out by a sequencing machine. All known sequencing technologies are generally defective. For this reason, a quality value is assigned to nucleotide. The read-out fragments are called reads and are stored together with the quality values in FASTQ files. Further processing steps are the alignment of the reads with the aim of reconstructing the underlying DNA sequence and the identification of structural variants of the sequenced material.
In our work we are especially concerned with compression methods for aligned reads and transparent lossy compression of quality values.
The MPEG-G standard series is the first ISO/IEC project for the storage and transmission of sequencing data. Large parts of our work have been incorporated into MPEG-G.
Sequence alignment, lossy compression, machine learning, entropy coding methods
[1] Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jörn Ostermann, Claudio Alberti, Marco Mattavelli, S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. Nature Methods 13(12), pp. 1005–1008, 2016.
[2] Jan Voges, Jörn Ostermann, Mikel Hernaez. CALQ: compression of quality values of aligned sequencing data. Bioinformatics 34(10), pp. 1650–1658, 2018
[3] Claudio Alberti, Noah Daniels, Mikel Hernaez, Jan Voges, Rachel L Goldfeder, Ana A Hernandez-Lopez, Marco Mattavelli, Bonnie Berger. An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. 2016 Data Compression Conference (DCC), pp. 221–230, Snowbird, UT (US), 2016.
[4] Claudio Alberti, Tom Paridaens, Jan Voges, Daniel Naro, Junaid J. Ahmad, Massimo Ravasi, Daniele Renzi, Giorgio Zoia, Idoia Ochoa, Marco Mattavelli, Jaime Delgado, Mikel Hernaez. An introduction to MPEG-G, the new ISO standard for genomic information representation. bioRxiv preprint, 2018.