Workflow Wednesdays – Part 3. Read preprocessing – Read quality control 1. – Running QC tools

Read quality control tools can provide very useful information about the the success of your sequencing experiment, without the need to run time consuming alignments and variant calls. Based on read length, per base quality, base content and other basic statistics you can find out a lot about your data. You can decide, whether a pre-alignment processing is needed for the particular read file (e.g. adaptor trimming, quality based trimming). You can sometimes find interesting clues, that can lead you to problems with basically any step of the sequencing workflow, from sample collection to the actual sequencing step. For example, an unusually high GC-content in a subset of the samples can lead you to bacterial plasmid contamination in the  library preparation step.

There are a few great read quality control tools around, the most widely used is probably FatsQC. Another popular choice is PRINSEQ. Other tools are available as well, for example: FASTX-Toolkit, NGS QC Toolkit.

FastQC

Supported operation systems: Linux, Mac, Windows. A GUI is available, but I usually just use the command line version, because it’s much easier to use for multiple files. To run the command line version of the tool, you just have to unpack the downloaded archive, make the fastqc command executable:

chmod +x fastqc

#Alternatively:

chmod 755 fastqc

After this, you can optionally create a link for the application or run the software, using the full path to the executable. I suggest the first version, as it’s much easier on the long run, but for just trying out the app, you can go without the link.

To create the link to FastQC, run the following command in the terminal (don’t forget to change the path):

sudo ln -s /tasks/blog/workflow/reads/QC/FastQC/fastqc /usr/local/bin/fastqc

After this, you can navigate to the folder containing the example fastq files, and run FastQC for all of the using a single command:

fastqc *.fastq

For each file, some text files and figures, plus a html report is generated. There are some example results for Illumina, 454 and PacBio data on the FastQC website.

PRINSEQ

Supported operation systems: browser based, but a lite standalone version is available for Linux and Mac. Not only read quality statistics, but some read processing options (e.g. trimming, filtering, format conversion) are available in this toolkit.

To create all basic statistics for a read file, run:

perl prinseq-lite-0.20.3/prinseq-lite.pl -fastq ../SRR515927.fastq -stats_all ../SRR515927.fastq_prinseq.stats

To generate the data file, for the graph creating part of the tool, use the following command:

perl prinseq-lite-0.20.3/prinseq-lite.pl -fastq ../SRR515927.fastq -graph_data

To create all available graphs in png format, run the following command:

perl prinseq-graphs.pl -i SRR515927.fastq.gd -png_all -o SRR515927

You can generate an html report very similarly:

perl prinseq-graphs.pl -i SRR515927.fastq.gd -html_all -o SRR515927

Note: if you get an error message about a missing perl module, just use “prinseq-graphs-noPCA.pl” instead of “prinseq-graphs.pl”.