A survey of tools for variant analysis of next-generation genome sequencing data. Pabinger et al. 2013
Following last week”s variant caller themed FF post, I will present you a very great article about variant detection and annotation. The paper gives a great review of basically every aspect of NGS based variant detection from a whole NGS analysis workflow and available tools to specific problems (e.g. detection of somatic mutations).
My favourite part is Figure 2, which illustrates perfectly, that there is no perfect variant caller around, even for germline mutations. Each tool finds (and misses) a different set of variants.
Figure 2 from Pabinger et al. 2013: Venn diagrams showing the number of identified variants for tested germline (A), somatic (B), CNV (C) and exome CNV (D) tools. The depicted numbers in (A) and (B) report identified SNPs and INDELs.
Come and join our CEO, Attila Berces online in a presentation and demo of how Omixon Target HLA typing will bring you value during the analysis of NGS data.
Our guest speaker is Dr. Dimitri Monos, University of Pennsylvania and The Children”s Hospital of Philadelphia, who will talk about his experiences with HLA typing protocol development on NGS platform.
Currently, there are 69 on-going clinical trials investigating HLA as a potential biomarker for safety or efficacy. It appears that beside being the most important marker in transplantation, HLA is becoming a more important biomarker for cancer therapeutic development as well. The analytical performance of NGS-based genetic tests highly depends on the bioinformatics software. Although the current false variant rate can be acceptable for research market, it is simply unsuitable for making clinical decisions. Omixon tailors the analysis for the sequencer, amplification method, primer kit, and the characteristics of the gene target itself. This approach results in a robust and highly accurate method to identify genetic variants.
Read quality control tools can provide very useful information about the the success of your sequencing experiment, without the need to run time consuming alignments and variant calls. Based on read length, per base quality, base content and other basic statistics you can find out a lot about your data. You can decide, whether a pre-alignment processing is needed for the particular read file (e.g. adaptor trimming, quality based trimming). You can sometimes find interesting clues, that can lead you to problems with basically any step of the sequencing workflow, from sample collection to the actual sequencing step. For example, an unusually high GC-content in a subset of the samples can lead you to bacterial plasmid contamination in the library preparation step.
All three sites provide some kind of search functionality and a few (shorter) sequences can be downloaded from the result pages directly, in multiple formats. For larger reference sequences (e.g. full human or mouse chromosomes or full genomes) or a long list of references the ftp sites or batch query tools should be used.
Tip: the Genome Analysis Toolkit (GATK) needs the human chromosomes in a special order (karyotypic, to be precise). Different versions of the karyotypically sorted human genome are provided by the GATK team as a resource bundle, which can be downloaded from their ftp site.
If you’re interested in the underlying algorithm of variant callers, it’s always a safe bet to check out the paper(s) about the variant caller. So here are the papers for the two most commonly used variant callers: