Workflow Wednesdays – Part 3. Read preprocessing – Subsampling

It”s often a good idea (especially for large datasets) to use a subset of the reads for some preliminary analyses. A smaller dataset can be useful for figuring out the alignment parameters or checking for alignment problems.

The easiest way to get a subset of a fastq file is to take the reads in the beginning of the file. Of course, this is not a random subset of the reads, but it can still be useful and it”s very easy to do.

Just open a linux terminal and use the “head” command to get the first N lines of a file. Make sure to use a number that is divisible by four, otherwise you”ll end up with an incomplete record for one of the reads.

head -n 400 SRR515927.fastq > SRR515927_subset.fastq

Now, if you check the length of the new file (remember the word count [wc] command?), you can see that it contains 400 lines (i.e. 100 reads).

$ wc -l SRR515927_subset.fastq
400 SRR515927_subset.fastq

Remember, that this is often a bad representation of your dataset!

There are quite a few possible methods for getting a random subsample from fastq files. I will show you a few:

Subsampling using the HTSeq Python package (source SeqAnswers):

After installing HTSeq, you have to create a python script file containing the following lines:

import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )


To get approximately 10% of the reads, you can use the following command:

python 0.1 SRR022913_1.fastq SRR022913_2.fastq  SRR022913_1_htseq.fastq SRR022913_2_htseq.fastq

Let”s check the length of the original and new files!

$wc -l *_htseq.fastq
  2479600 SRR022913_1_htseq.fastq
  2479600 SRR022913_2_htseq.fastq
$wc -l reads/SRR022913_1.fastq
24818636 reads/SRR022913_1.fastq

You can also use the R BioConductor ShortRead package (source):

f1 <- FastqSampler("SRR022913_1.fastq", n=1000)
f2 <- FastqSampler("SRR022913_2.fastq", n=1000)
set.seed(123L); p1 <- yield(f1)
set.seed(123L); p2 <- yield(f2)

Other solutions:

If you”re interested in other solutions, check out this SeqAnswer and this BioStar threads. You can also try the Seqtk toolkit, there”s also a Perl solution available here.