Workflow Wednesdays – Extracting reads from an alignment file

If you work with sam/bam files regularly, sooner or later you’ll end up trying to get a subset of an alignment. There can be several reasons for doing this: you might want to remove off-target reads, investigate alignment problems or just get rid off parts that you’re not interested in to speed up the variant call.

When getting a subset of an alignment, you can have two different approaches: extract aligned reads for further analysis or get unaligned reads for realignment.

To get a subset of the reads in aligned format from a specific region, you can use the Samtools view command (before running the command, make sure, that your bam file is sorted by coordinate and indexed):

 samtools view SRR797242_bwa_bwasw_sorted.bam gi\|49175990\|ref\|NC_000913.2\|:500-1000 > subset.sam

Note, that the name of the reference is actually  “gi|49175990|ref|NC_000913.2|”, the backslashes are added to the name as an “escape character“, the backslash basically removes every special meaning associated to a character. In this case, I am working in a Bash terminal where the character ‘|’ (called pipe) has a special function: it can be used to send the output of a program directly into another program, without printing/saving it first (see this page for a very short tutorial on pipe).

You have several possible ways for getting the reads mapped to a specific region in fastq format. I’ll show two: a simple awk + grep based approach and a Samtools + Picard based method.

To get the reads in fastq from the subset alignment file we’ve created with Samtools view, you should use Picard’s SamToFastq command. As the subset sam file doesn’t have a header, we should add a header first. We can use the header of the original alignment file:

#First, take a look at the original header:
$samtools view -H SRR797242_bwa_bwasw_sorted.bam
@SQ     SN:gi|49175990|ref|NC_000913.2| LN:4639675
#Save the header to a file:
$ samtools view -H SRR797242_bwa_bwasw_sorted.bam > SRR797242_bwa_bwasw_sorted.header
#Add the header to the sam file
$ cat SRR797242_bwa_bwasw_sorted.header subset.sam > subset_with_header.sam

Now, we can run Picard:

$ java -jar SamToFastq.jar I=subset_with_header.sam FASTQ=subset.fastq
[Wed Sep 18 10:53:35 CEST 2013] net.sf.picard.sam.SamToFastq INPUT=subset_with_header.sam FASTQ=subset.fastq    OUTPUT_PER_RG=false RE_REVERSE=true INCLUDE_NON_PF_READS=false READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Sep 18 10:53:35 CEST 2013] Executing as rk@enceladus on Linux 3.8.0-30-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_38-b05; Picard version: 1.86(1363)
[Wed Sep 18 10:53:35 CEST 2013] net.sf.picard.sam.SamToFastq done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=125173760

Now, let’s see the quick and dirty solution with awk and grep! First get the read ids of reads mapped between two coordinates:

awk '{if ($4 <1000 && $4>500) print $1}' SRR797242_bwa_bwasw.sam > subset.ids

Now, get rid of possible duplicates:

sort subset.ids | uniq > subset_ids_uniq.txt

Tip: if you have paired data, you should remove pair specific parts of the read ids before getting the uniq ids. You can use for example sed substitute for this.

Now, you can simply grep the reads you’re interested in from the original fastq file:

grep -A3 -f subset_ids_uniq.txt SRR797242.fastq

Note, that this solution assumes, that the nucleotide sequence and quality of a single read are both represented as single lines of data.

Tip: If you’re not only interested in reads from a particular region, but you want to reduce the size of the alignment, you can randomly downsample a sam/bam file, using Picard’s DownsampleSam.