Workflow Wednesdays – Coverage analysis 2. and other alignment statistics

As I promised last week, I’ll show you some command line based alignment analysis tools this week. As there are basically an endless number of methods for generating alignment statistics, feel free to share additional tips and tricks in the comment section!

Picard

Picard offers a set of different alignment statistics tools and a “master” tool for running some of these with a single command. The tools are the following:

    • CollectInsertSizeMetrics
      • Collects information about the insert size of read pairs in the alignment.
      • It can generate a table and/or histogram.
    • CollectGcBiasMetrics
      • Generates GC-content statistics for the aligned reads throughout the genome using “windows”.
      • It can create a table and/or chart.
    • CollectAlignmentSummaryMetrics
      • Generates basic statistics from the alignment.
      • Can be run for all reads, a specific readgroup, sample or library.
    • QualityScoreDistribution
      • Creates a quality score distribution chart for different subsets of reads.
    • MeanQualityByCycle
      • Generates statistics per sequencing cycle.
    • CollectMultipleMetrics
      • This is the master tool.
      • Can run CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, QualityScoreDistribution and MeanQualityByCycle in a single step.
Workflow Wednesdays – Coverage analysis 2. and other alignment statistics

Quality score distribution figure generated by Picard.

Samtools

Samtools has a few tools that can be used for generating different statistics, eventhough some of these have been developed for totally different purposes.

view

Samtools view combined with some Linux commands is one of the best tools for creating alignment statistics.

A few tasks that can be easily done with samtools view:

  • Output/count reads from a specific region.
  • Output/count reads with a mapping quality above a user defined threshold.
  • Output/count reads from a specific library or libraries.
  • Output/count/skip reads with a specific bitwise flags.

mpileup

Samtools mpileup can do the following tasks:

  • Create pileup for an alignment or a specific region of an alignment using all aligned reads or a filtered subset of the reads. (Note, that only coordinates with non-zero coverage are reported).
  • Calculate per-sample read depth.
  • Calculate p-value for strand bias.

idxstats

Unlike the aforementioned tools, idxstats was actually designed for generating statistics based on a bam index. It outputs contig length and number of mapped and unmapped reads.

Bedtools

Although it’s not obvious from the name of the toolset, Bedtools offers a handful of tools that can be used with alignment files. Here’s a list of functions you might want to check out:

  • coverage
    • Can calculates the average coverage depth and the coverage % for overlapping features of a bed file and a BAM file.
    • Can report per-base coverage depth and unlike mpileup, it reports positions with zero coverage as well.
    • Can generate a histogram.
  • genomecov
    • Very similar to coverage, but works for the whole of a single input file, not just overlapping features of two files.
  • multicov
    • Reports coverage for specified regions of multiple alignment files.

GATK

GATK offers a whole suite of alignment diagnostic and quality control tools. A few examples for tools that can be used for creating alignment statistics:

  • DepthOfCoverage
    • Can generate a long list of coverage statistics. Allows RefSeq genes based statistics as well.
  • CountReadEvents
    • Counts read events based on CIGAR strings in the alignment file.
  • BaseCoverageDistribution
    • Per base coverage analysis tool.
  • DiagnoseTargets
    • Calculates coverage and analyses mate information in specified intervals.
  • Pileup
    • Creates an output similar to the Samtools pileup output.
  • FlagStat
    • Reimpementation of the (deprecated) Samtools flagstat command.
  • FindCoveredIntervals
    • Outputs a list of intervals that are covered above threshold.