Workflow Wednesdays – Part 2. Read preprocessing – File format conversion

Possibly, the easiest way to give these Linux-based tools a fair chance is to get a secondary Linux installation alongside your current operation system, so you can play around and learn without struggling with OS related problems. In my experience, a secondary (dual boot) Linux installation can be done in about an hour, tops. Check out for example the Ubuntu-website (of course, other Linux/Unix versions would do just fine).

I will mostly present command-line tools, which can be run from the Linux terminal (search for “Terminal” among your applications or press ALT+CTRL+T, to open a terminal window). By default, you start out in your user’s home folder (in my case, this is /home/rk/). You can enter a folder using the “cd” command (and go back to the parent folder, using “cd ..”).

(For downloading files using a command line command, you can use “wget”. You might want to take a look at this very useful  wget tutorial on LifeHacker).

To open a folder, use the following command (you will have to change the PATH to the location of the folder you downloaded the reads to):

cd /blog/workflow/reads/

To list the contents of the folder you can use for example:

ls -lah

This lists all the files (and folders) in the current folder (-a), in a long list format (-l), with human readable file sizes (-h). If you downloaded the short read files mentioned in my last post, the output of the “ls -lah” command should look very similar to this:

total 6.8G
drwxrwxr-x 3 rk rk 4.0K Jun 12 13:53 .
drwxrwxr-x 4 rk rk 4.0K Jun 10 14:19 ..
-rw-rw-r-- 1 rk rk 866M May 14 16:30 SRR022913_1.fastq
-rw-rw-r-- 1 rk rk 866M May 14 16:35 SRR022913_2.fastq
-rw-rw-r-- 1 rk rk 916M May 14 16:26 SRR022913.sra
-rw-rw-r-- 1 rk rk 1.1G May 30 12:33 SRR515927.fastq
-rw-rw-r-- 1 rk rk 426M May 30 12:39 SRR515927.sra
-rw-rw-r-- 1 rk rk 1.2G Jun 12 13:52 SRR797242.fastq
-rw-rw-r-- 1 rk rk 1.6G Jun 12 14:12 SRR797242.sra


Extracting gzipped fastq files

You can unzip the gzipped files easily, using the gunzip command (note, that several bioinformatics tools take gzipped files as an input, so this step is not always neccessary):

gunzip SRR022913_1.fastq.gz
#You can unzip all files in a single step, gunzip will automatically ignore the sra files:
gunzip *

Converting sra to fastq

For detailed information about data download, read the NCBI Large Data Download Best Practices guide.

I downloaded all files in fastq format already, but I will do an sra conversion anyway. First, I make a folder for the sra files and copy them in the new folder

mkdir sra cp *.sra sra/

You can download the SRA Toolkit from here and you can read the documentation here.

After downloading the SRA Toolkit, you have to extract the tar archive first. You can do this with the following command:

tar xvzf sratoolkit.current-centos_linux64.tar.gz

You can find the executables within the sratoolkit folder, in the bin subfolder. To create fastq from sra files, you have to use the fastq-dump command. First, run the jar file, to configure the toolkit:

java -jar sratoolkit.jar

Alternatively, you can run the configuration assistant script:

perl configuration-assistant.perl

If the configuration step finished without any errors, you can execute the fastq-dump command. You can do this similarly to the command below:

sratoolkit.2.3.2-5-ubuntu64/bin/fastq-dump /home/rk/tasks/blog/workflow/reads/sra/SRR515927.sra

Make sure, to specify the whole path to your sra file, otherwise the conversion won’t work (for some reason, the sra toolkit doesn’t “see” the current folder). If you’re having problems, take a look at this SeqAnswers thread.

SFF formatted read files

Another frequently used short-read file format is sff, which is a binary format and used for  Ion Torrent and 454 reads. Both Ion Torrent and Roche provides sff conversion tools (SFF Read and SFF Info, respectively). There are also several open source methods around for SFF to fastq (or fasta+qual) conversion.

A short list of a few sff converting methods:

A few relevant SeqAnswers threads : 1, 2.