Assembly by Short Sequences* (ABySS*)

ABySS* is an open-source de novo genome assembler for short paired-end reads.

Wall Clock Time Sees 4X Improvement1

The Michael Smith Genome Sciences Centre at the BC Cancer Agency was faced with two challenges: Reduce the execution time of their parallel de novo genome assembler, the ABySS software application, and reduce the memory requirements for general alignment tools such as BWA, Bowtie, Novoalign, and ABySS-map. Intel worked with the agency to help enable improved parallelization in ABySS version 1.9.0.

ABySS is differentiated in its ability to scale to large genomes due to its message-passing interface (MPI)-based implementation of the de Bruijn graph assembly algorithm. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The relevant code optimizations are included and enabled by default in ABySS* 1.9.0.

Performance Results

ABySS 1.3.5, the baseline version, required 25 hours to perform a human genome assembly. The optimized version, ABySS 1.9.0, took only 6 hours of wall clock time to recreate the genome when run on multiple processors and taking further advantage of that parallelism by splitting the input file. This indicates a 4X improvement over the baseline version on the same data set1.

Wall clock times for the main genome assembly stage of the ABySS pipeline, using a human genome read dataset (NA12878) are shown in the figure to the right. The first leftmost bar is the base run time before optimization. The second (middle) bar represents the run time for optimized version with all data contained within a single, monolithic input file; the third (rightmost) bar shows the effect of both code optimizations and splitting the input file into 10 equal-sized parts.

Download the code ›

Reproduce these results with this optimization recipe ›

Related Codes

Distributed Indexing Dispatched Alignment* (DIDA*) ›


J.T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. "ABySS: A Parallel Assembler for Short Read Sequence Data." Genome Research 19, no. 6 (2009): 1117-1123. doi:10.1101/gr.089532.108. Genome ResearchPubMed.

İnanç Birol, Shaun D. Jackman, Cydney Nielsen, Jenny Q. Qian, Richard Varhol, Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Stephen J. M. Jones. "De Novo Transcriptome Assembly with ABySS." Bioinformatics 25, no. 21 (2009): 2872-2877. doi:10.1093/bioinformatics/btp367. Bioinformatics Advance Access.

Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D. Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q. Qian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron S. Butterfield, Richard Newsome, Simon K. Chan, Rong She, Richard Varhol, Baljit Kamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard A. Moore, Martin Hirst, Marco A. Marra, Steven J. M. Jones, Pamela A. Hoodless, and İnanç Bairol. "De Novo Assembly and Analysis of RNA-seq Data" Nature Methods. 10 October, 2010. Nature.

Configuration Table

System Overview



Eight HPC nodes interconnected by 40Gbps Infiniband


Each node has two Intel® Xeon® X5650 processors (2.67 GHz)


Each node has 48GB RAM

Operating System

CentOS 5.4
Intel® Cluster Studio 2013


ABySS version 1.3.5


ABySS version 1.9.0

Input dataset: Subset of the following BAM file (272GB)

Input data were split into 10 approximately equal-sized BAM files. Equivalent gzipped FASTQ files should perform equally well.

Data subset: The data subset corresponds to the following eight-lane IDs:

1. 20FUKAAXX100202_1

2. 20FUKAAXX100202_2

3. 20FUKAAXX100202_3

4. 20FUKAAXX100202_4

5. 20FUKAAXX100202_5

6. 20FUKAAXX100202_6

7. 20FUKAAXX100202_7

8. 20FUKAAXX100202_8

Información sobre productos y rendimiento


Los resultados de análisis se obtuvieron antes de la aplicación de los parches de software y actualizaciones de firmware más recientes, pensados para solucionar los ataques "Spectre" y "Meltdown". Puede que, al implementar estas actualizaciones, los resultados mostrados no sean aplicables a su dispositivo o sistema.

El software y las cargas de trabajo utilizadas en las pruebas de rendimiento han sido optimizados para el rendimiento solamente en microprocesadores Intel®. Las pruebas de rendimiento, como SYSmark* y MobileMark*, se han medido utilizando sistemas, componentes, software, operaciones y funciones informáticas específicas. Cualquier cambio realizado en cualquiera de estos factores puede hacer que los resultados varíen. Es conveniente consultar otras fuentes de información y pruebas de rendimiento que le ayudarán a evaluar a fondo sus posibles compras, incluyendo el rendimiento de un producto concreto en combinación con otros. Si desea obtener más información, visite