Flashcard Fridays – The 1000 Genomes Project

05/07/2013

After the “whole” genome sequence of Craig Venter and James Watson were published and newly developed sequencing technologies made the runtime and price of whole genome and whole exome sequencing acceptable, 1000 Genomes was the first public sequencing project that sequenced a high amount of individuals. The aim of the project was ” to provide a comprehensive resource on human genetic variation.” Altogether, 2500 individuals were selected for sequencing from diverse ancestry.

If you are interested in the 1000 Genomes project, you might want to read the pilot paper of the project:

A map of human genome variation from population-scale sequencing

You should also check out the project’s website.

Short reads, alignments, variant calls and other data can be downloaded from the 1000 genomes ftp site via NCBI or EBI.

Workflow Wednesdays – Reference sequence – Part 1. Indexing

03/07/2013

I’ve already written some posts about fasta files and reference sequences (about file formats and where to get reference sequences.

You might remember from my first “Workflow Wednesdays” post, that I selected short reads created with the three major sequencing technologies from the same E. coli strain: strain K-12 substrain MG1655.

Continue reading →

Bioinformatics for Beginners – File Formats Part 2. – Short reads

01/07/2013

Short reads can be stored in several different formats. The best known (and most used) of these is the fastq format, which contains both the base and quality values for each read within a single file. Note, that there are different fastq formats around. The main difference between these is the calculation and representation of quality values.

Continue reading →

Flashcard Fridays – SEQanswers

28/06/2013

If you are looking for an open access, interactive knowledge base and forum about basically anything NGS related, check out SEQanswers!

With more than 35000 registered members (and most likely even more “lurkers”), it”s possibly the largest and fastest growing NGS community around.

Continue reading →

Workflow Wednesdays – Part 3. Read preprocessing – Read quality control 2. – QC results

26/06/2013

Text output

Both FastQC and PRINSEQ can generate some basic statistics for each fastq file. For FASTQC these are the following: “Total Sequences” (i.e. the number of reads), “Filtered Sequences”, “Sequence Length” (read length range) and “%GC” (average GC-content). Additionally, tables for “Overrepresented sequences” and “Kmer content” are generated. PRINSEQ calculates the following measures:

Continue reading →