elPrep* is a high-performance tool for preparing SAM/BAM/CRAM files for variant calling in genomic sequencing pipelines.

Execution Time Cut to 15 Minutes1

elPrep* is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools* and Picard* for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep* apart is its software architecture, which allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep* is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time.

Performance Results

For a preparation pipeline of five steps on a whole-exome BAM file (NA12878), elPrep* reduces the execution time from about 1:40 hours, when using a combination of SAMtools* and Picard*, to about 15 minutes when using elPrep*, while utilizing the same server resources (48 threads and 23 GB RAM)1. Tested using picard-tools-1.229*, samtools-1.2*, elprep-2.2*.

Download the code ›

Reproduce these results with this optimization recipe ›


Sequence analysis generally consists of a mapping phase followed by an analysis phase. In the mapping phase, an alignment tool maps the reads produced by the wet lab to a known reference genome. Afterwards, the mapped reads are processed by an analysis tool, for example for variant detection.

Alignment and analysis tools communicate via sequence alignment/map (SAM) files, a standardized format for storing mapped reads (Li et al., 2009), or the compressed variants thereof (BAM/CRAM). In practice, different alignment tools produce slightly different outputs, and different analysis tools depend on slightly different SAM structures to work properly.

This is why there are typically a number of steps in between the alignment and analysis tools to rewrite the SAM files into a form that is accepted by the analysis tool. For example, the GATK best-practice pipeline (Van der Auwera et al., 2013) requires five preparation steps between alignment (BWA) and analysis (GATK). These steps take up roughly 30% of the runtime of the complete pipeline.

Pipeline Execution Without elPrep*

We developed elPrep*, a new tool that is designed as a high-performance alternative to existing tools for manipulating SAM, BAM, and CRAM files. elPrep* is designed as a multi-threaded program from the ground up: all preparation steps are executed in parallel. The application is designed to run entirely in memory, avoiding repeated file I/O between the preparation steps and merging their computations to execute more efficiently.

Hypothetical Execution with Parallelized Tools

We had to reformulate preparation steps as filters. In many cases, this was straightforward, but some steps required finding alternative algorithms. For example, the algorithm for marking duplicates in Picard* is based on comparing adapted mapping positions of all reads. Its implementation traverses the entire read set multiple times to compare the reads' mapping positions one by one. We reformulate this as a single-pass algorithm, and use memoization to keep track of the reads with the best mapping positions. If a subsequent read maps to the same position as a previous one, but with a better quality score, it replaces the old one in the memoization table, and the old one is marked as a duplicate. Despite such algorithmic reformulations, the output of elPrep* is 100% equivalent to the output produced by SAMtools* and Picard.

Pipeline Execution with elPrep*

Once all data is streamed into memory and all filters are applied, the operations that work on the whole data set, such as sorting, are executed. elPrep* implements this phase using fork-join patterns, which are executed on a work-stealing scheduler for load balancing. After the sorting phase, the worker threads transform the data back into SAM file entries in parallel, while possibly applying additional filters, to write the result to the output file.


Charlotte Herzeel, Pascal Costanza, Dries Decap, Jan Fostier, and Joke Reumers. "elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling." PLoS ONE 10, no. 7 (2015). doi:10.1371/journal.pone.0132868.

Configuration Table

System Overview



picard-tools-1.229*, samtools-1.2*, elprep-2.32*, CentOS* release 7.0.1406 (Core), Python* 2.7.5, GCC* 4.8.2 (optional), GNU parallel* 20150222 (optional)


2x 12-core Intel® Xeon® E5-2690 processor (2.6 GHz)


256 GB


2 TB Intel® P3700 SSD

Información sobre productos y rendimiento


Los resultados de análisis se obtuvieron antes de la aplicación de los parches de software y actualizaciones de firmware más recientes, pensados para solucionar los ataques "Spectre" y "Meltdown". Puede que, al implementar estas actualizaciones, los resultados mostrados no sean aplicables a su dispositivo o sistema.

El software y las cargas de trabajo utilizadas en las pruebas de rendimiento han sido optimizados para el rendimiento solamente en microprocesadores Intel®. Las pruebas de rendimiento, como SYSmark* y MobileMark*, se han medido utilizando sistemas, componentes, software, operaciones y funciones informáticas específicas. Cualquier cambio realizado en cualquiera de estos factores puede hacer que los resultados varíen. Es conveniente consultar otras fuentes de información y pruebas de rendimiento que le ayudarán a evaluar a fondo sus posibles compras, incluyendo el rendimiento de un producto concreto en combinación con otros. Si desea obtener más información, visite http://www.intel.es/benchmarks.