Workflow Wednesdays – Reference sequence – Part 2. Subsampling and statistics

10/07/2013

Extracting regions from a fasta file

If you are doing targeted sequencing, it’s usually a good idea to use a relatively large reference sequence (e.g. a whole chromosome) to avoid problems caused by mismappings. It can still be very useful sometimes to use a “subset” or “subsample” of the reference sequence for an alignment to save computing time, investigate alignment problems or other reasons. To get a specific region from a fasta file, you can use the Bedtools suite’s “getfasta” function. To use this function, you’ll need a bed file, containing the coordinates of the required region(s). As this is just a test run, let’s select the first gene from the NCBI record of the reference genome and create a bed file by hand! The first gene is the “thrL” which is between positions 190 and 255.

Continue reading →

Illumina Applies CE Mark to MiSeqDx™ Cystic Fibrosis System: Omixon Target is Featured!

09/07/2013

We have amazing news today as the first CE mark for next generation sequencing-based molecular diagnostics was just announced. Illumina applied CE mark to MiSeqDx™ Cystic Fibrosis System and Omixon Target is used for analyzing CFTR gene in prenatal labs!

An excerpt from the announcement relevant to Omixon Target:

Cystic fibrosis is a life-threatening, inherited single-gene disorder that affects more than 70,000 people worldwide. Caused by mutations in the CFTR gene, the disease has a wide clinical presentation depending on which CFTR gene mutations are present. Most CFTR mutations are rare, and their distribution and frequency vary among different world populations.

Bioinformatics for Beginners – File Formats Part 3. – Alignments

08/07/2013

The generally used file formats for sequence based alignments are the SAM and BAM formats. These files can contain information about mapped and unmapped reads, the contigs of the reference sequence that was used and many more things.

SAM

You can find the SAM format specification here and the article about the SAM format and SAMtools here.

The SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Continue reading →

Flashcard Fridays – The 1000 Genomes Project

05/07/2013

After the “whole” genome sequence of Craig Venter and James Watson were published and newly developed sequencing technologies made the runtime and price of whole genome and whole exome sequencing acceptable, 1000 Genomes was the first public sequencing project that sequenced a high amount of individuals. The aim of the project was ” to provide a comprehensive resource on human genetic variation.” Altogether, 2500 individuals were selected for sequencing from diverse ancestry.

If you are interested in the 1000 Genomes project, you might want to read the pilot paper of the project:

A map of human genome variation from population-scale sequencing

You should also check out the project’s website.

Short reads, alignments, variant calls and other data can be downloaded from the 1000 genomes ftp site via NCBI or EBI.

Workflow Wednesdays – Reference sequence – Part 1. Indexing

03/07/2013

I’ve already written some posts about fasta files and reference sequences (about file formats and where to get reference sequences.

You might remember from my first “Workflow Wednesdays” post, that I selected short reads created with the three major sequencing technologies from the same E. coli strain: strain K-12 substrain MG1655.

Continue reading →