Workflow Wednesdays – Part 2. Read preprocessing – File format conversion

A little bit about the Linux/Unix command line

First of all, let me tell you about the working environment I will use for this data analyis series. I have a laptop with a 4 core Intel processor and 2*4G RAM. I use 64-bit Ubuntu (version 13.04 to be exact).

Several open source bioinformatics tools are Linux/Unix only and if you’re not a Linux user, you can try some Linux terminal-like environments within your current operation system: check out CygWin, if you’re a Windows user. Mac supposedly has a native bash environment, but to tell you the truth, I don’t really have any experience with that.

Continue reading

Bioinformatics for Beginners – How to get NGS data? Part 1. Short reads

I recently ran into an email on one of the numerous mailing lists I’m subscribed to. The email was written by a student who was desperately looking for NGS data for testing a pipeline. This letter made me realise that finding short read data is probably not an easy task for people who are just starting to use next generation sequencing techniques. If you know the right search phrases, you can easily find the short read databases, but if you’re googling “NGS data” you’ll get basically no relevant hits. So here is a collection of NGS data resources to make the life of the newbies a little easier.

Continue reading

Flashcard Fridays – Part 1. Comparison of next generation sequencing technologies

Recommended articles:

Article 1. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers – Quail et al. 2012

This is an incredibly useful review about IonTorrent, Illumina and PacBio technologies. The paper was published in 2012, and contains valuable information about error rates, throughput. The article contains some information about pricing, which is kind of hard to come by, especially if you”re interested in more than one sequencing technologies.

Continue reading

Workflow Wednesdays – Part 1. Introduction, datasets, plans

We created an in-house training material for new employees a while ago, which contained a single page, very short workflow for bioinformatics analysis tasks we usually do. I thought it would be a great idea, to go through this workflow step-by-step, collect some tools that can be used for each task and show some examples using data from all main sequencing platforms. Note, that we mostly do sequence based alignments, so de novo assemblies will only be briefly mentioned.

I will use open access data sets, so you can reproduce each step if you want to. Strain K12 substrain MG1655 of E. coli was selected as an example, because I could find Illumina, 454 and Ion Torrent reads for this substrain. I know, that there are a few very useful blog posts around (e.g. here and here), using the same (or very similar) datasets. Unlike the previously mentioned blog posts, comparison of different sequencing platforms is not an objective of these posts.

Continue reading