Workflow Wednesdays – Part 3. Read preprocessing – Filtering

After last weeks post about randomly subsampling fastq files, let’s take a look at a different method for making read files smaller.

A very easy method, that can be used for 454 or IonTorrent data (but usually not for Illumina) is getting rid of very short reads. If you take a look at the quality control results of the IonTorrent example file, you can see, that read length range is between 5 and 399, while most reads are around 200 bases long. As an exercise, let’s get rid of reads shorter than 50 bases.

A possible solution for this is the “fastx_clipper” command of the FastX toolkit:

fastx_clipper -Q33 -l 50 -n -v -i SRR515927.fastq -o SRR515927_mt50len.fastq

It’s always a goof policy to give descriptive filenames (hence the mt50len part). Believe me, two months from now, you’ll have absolutely no idea what SRR515921_filtered.fastq contains exactly.

You can also filter by length using for example Prinseq. To mix it up a little, let’s filter out all reads that are shorter than 50 or longer than 350 bases:

perl prinseq-lite.pl -fastq SRR515927.fastq -min_len 50 -max_len 350 -out_good SRR515927_mt50_lt350len

Let’s try some quality filtering as well! You might remember, that the Illumina example data set has very low quality:

Workflow Wednesdays - Part 3. Read preprocessing - Filtering

Per sequence quality diagram for one of the Illumina read files, generated by FastQC.

Let’s filter out all reads with less than 10 average quality!

I will use Prinseq again, for two reasons: it has surprisingly diverse filtering options (e.g. GC-content based filtering) and it can deal with paired reads in a single step, so it saves us the additional step of fixing read pairs (you can use e.g. Biopieces for that).

 perl prinseq-lite.pl -fastq SRR022913_1.fastq -fastq2 SRR022913_2.fastq -min_qual_score 10 -out_good SRR022913_mt10qual

If we take a look at the produced files, we can see that two filtered and two singleton files were produced and the “non-singleton” files have the exact same length:

$ wc -l SRR022913*
  24818636 SRR022913_1.fastq
  24818636 SRR022913_2.fastq
    362100 SRR022913_mt10qual_1.fastq
   1410188 SRR022913_mt10qual_1_singletons.fastq
    362100 SRR022913_mt10qual_2.fastq
    770604 SRR022913_mt10qual_2_singletons.fastq