Workflow Wednesdays – Part 3. Read preprocessing – Read quality control 1. – Running QC tools

Read quality control tools can provide very useful information about the the success of your sequencing experiment, without the need to run time consuming alignments and variant calls. Based on read length, per base quality, base content and other basic statistics you can find out a lot about your data. You can decide, whether a pre-alignment processing is needed for the particular read file (e.g. adaptor trimming, quality based trimming). You can sometimes find interesting clues, that can lead you to problems with basically any step of the sequencing workflow, from sample collection to the actual sequencing step. For example, an unusually high GC-content in a subset of the samples can lead you to bacterial plasmid contamination in the  library preparation step.

Continue reading <span class="meta-nav">→</span>

Bioinformatics for Beginners – How to get NGS data? Part 2. Reference sequences

The most obvious (and probably the most used) source of reference sequences (or any kind of sequences) is the NCBI Nucleotide database and its “sister” sites: the EBI European Nucleotide Archive, and the DNA Data Bank of Japan.

All three sites provide some kind of search functionality and a few (shorter) sequences can be downloaded from the result pages directly, in multiple formats. For larger reference sequences (e.g. full human or mouse chromosomes or full genomes) or a long list of references the ftp sites or batch query tools should be used.

Ftp sites:

Batch search/Download tools:

There are some other pages with a more limited focus, that can be very useful for retrieving reference sequences:

Tip: the Genome Analysis Toolkit (GATK) needs the human chromosomes in a special order (karyotypic, to be precise). Different versions of the karyotypically sorted human genome are provided by the GATK team as a resource bundle, which can be downloaded from their ftp site.

Workflow Wednesdays – Part 2. Read preprocessing – File format conversion

A little bit about the Linux/Unix command line

First of all, let me tell you about the working environment I will use for this data analyis series. I have a laptop with a 4 core Intel processor and 2*4G RAM. I use 64-bit Ubuntu (version 13.04 to be exact).

Several open source bioinformatics tools are Linux/Unix only and if you’re not a Linux user, you can try some Linux terminal-like environments within your current operation system: check out CygWin, if you’re a Windows user. Mac supposedly has a native bash environment, but to tell you the truth, I don’t really have any experience with that.

Continue reading <span class="meta-nav">→</span>