Workflow Wednesdays – Creating alignments – BWA

Drumroll! Hurrah! We are finally going to do some alignments today!

BWA is probably the most widely used short read aligner ever. By now, it’s even part of the official Illumina analysis pipeline. BWA actually contains three different aligner algorithms. We’ll try all of these today.

We need an indexed reference sequence for all BWA alignments, so let’s make this as a first step!

#We have already downloaded the reference sequence: NC_000913.2.fa
$bwa index NC_000913.2.fa
[bwa_index] Pack FASTA... 0.07 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.81 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.05 sec
[bwa_index] Construct SA from BWT and Occ... 0.54 sec
[main] Version: 0.6.2-r126
[main] CMD: bwa index NC_000913.2.fa
[main] Real time: 2.759 sec; CPU: 2.516 sec

The indexing step should finish in a few seconds.

Let’s try the original algorithm first, which is now called BWA backtrack. This was designed for the first generation of letter space data, which consisted of very short reads. Now, that the new BWA mem algorithm is public, the original algorithm is only suggested for “short short” (i.e. less than 70 bp long) reads. This “old” method fits our “old” Illumina example data like a glove, so let’s do a paired Illumina alignment. The alignment can be done in two steps: first, the “aln” command has to be run for both read files separately, then the “sampe” command should be run for paired data.

bwa aln NC_000913.2.fa reads/SRR022913_1.fastq > SRR022913_1.sai
bwa aln NC_000913.2.fa reads/SRR022913_2.fastq > SRR022913_2.sai
bwa sampe NC_000913.2.fa SRR022913_1.sai SRR022913_2.sai  reads/SRR022913_1.fastq reads/SRR022913_2.fastq > SRR022913_bwa_backtrack.sam

Note, that the sampe step can take a while and unfortunately, there’s no multithreading available for sampe or samse (i.e. the single end equivalent of sampe).

The old “long read aligner” BWA bwasw is now suggested for data with frequent alignment gaps. Let’s try this version of BWA with our IonTorrent and 454 example data sets!

bwa bwasw NC_000913.2.fa reads/SRR797242.fastq > SRR797242_bwa_bwasw.sam
bwa bwasw NC_000913.2.fa reads/SRR515927.fastq > SRR515927_bwa_bwasw.sam

The newest BWA algorithm is BWA mem. This should be used for “70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences”, so basically for any “non-historical” data. BWA mem is a fairly new product in the BWA package, which has its negative and positive consequences: it is not that well tested yet, as the other BWA methods, on the other hand, this algorithm was designed with much longer reads in mind, so it fits current NGS data much better. Let’s try BWA mem with all our example data!

bwa mem NC_000913.2.fa reads/SRR797242.fastq > SRR797242_bwa_mem.sam
bwa mem NC_000913.2.fa reads/SRR515927.fastq > SRR515927_bwa_mem.sam
bwa mem NC_000913.2.fa reads/SRR022913_1.fastq reads/SRR022913_2.fastq > SRR022913_bwa_mem.sam