Flashcard Fridays – The 1000 Genomes Project

After the “whole” genome sequence of Craig Venter and James Watson were published and  newly developed sequencing technologies made the runtime and price of whole genome and whole exome sequencing acceptable, 1000 Genomes was the first public sequencing project that sequenced a high amount of individuals. The aim of the project was ” to provide a comprehensive resource on human genetic variation.” Altogether, 2500 individuals were selected for sequencing from diverse ancestry.

If you are interested in the 1000 Genomes project, you might want to read the pilot paper of the project:

A map of human genome variation from population-scale sequencing

You should also check out the project’s website.

Short reads, alignments, variant calls and other data can be downloaded from the 1000 genomes ftp site via NCBI or EBI.

 

 

Workflow Wednesdays – Reference sequence – Part 1. Indexing

I’ve already written some posts about fasta files and reference sequences (about file formats and where to get reference sequences.

You might remember from my first “Workflow Wednesdays” post, that I selected short reads created with the three major sequencing technologies from the same E. coli strain: strain K-12 substrain MG1655.

Continue reading <span class="meta-nav">→</span>

Workflow Wednesdays – Part 3. Read preprocessing – Read quality control 2. – QC results

Text output

Both FastQC and PRINSEQ can generate some basic statistics for each fastq file. For FASTQC these are the following: “Total Sequences” (i.e. the number of reads), “Filtered Sequences”, “Sequence Length” (read length range) and “%GC” (average GC-content). Additionally, tables for “Overrepresented sequences” and “Kmer content” are generated. PRINSEQ calculates the following measures:

Continue reading <span class="meta-nav">→</span>