Bioinformatics for Beginners – File Formats Part 3. – Alignments

The generally used file formats for sequence based alignments are the SAM and BAM formats. These files can contain information about mapped and unmapped reads, the contigs of the reference sequence that was used and many more things.

SAM

You can find the SAM format specification here and the article about the SAM format and SAMtools here.

The SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

A sam file has two sections:

  1. Header section:
    • The header section is not mandatory, but most NGS softwares require it.
    • It contains information about five main topics:
      • alignment file: format version, sorting;
      • reference sequence(s): e.g. name, length, species, url;
      • read group: sequencing lane, sample, sequencing center, library etc.;
      • program: aligner name and version, parameters used for the alignment;
      • custom comment(s).
    •  Each line of the header section starts with ‘@’ and a two letter record type code.
  2. Alignment section:
    • Every read in the alignment (and sometimes unmapped reads) are represented by one row consisting of tab delimited fields (basically columns).
    • If a read is mapped to more than one location, every mapping will have its own row in the sam file.
    • There are 11 mandatory fields in each row:
      • read name
      • bitwise flag (it codes information about the read e.g. mapped/unmapped, paired/not paired, mapped to forward/reverse strand etc.) -> for a “flag decoder”, see here
      • reference sequence name
      • starting position of the mapped reads on the reference sequence
      • mapping quality
      • CIGAR string (this is basically a short description of the alignment)
      • reference name for the mate (for paired data)
      • position of the mate (for paired data)
      • distance between paired reads (for paired data)
      • nucleotide sequence of the read
      • per base quality of the read
      • there are several optional fields, for these, see the format specificatio.n

For a short(ish) introduction with some examples, see here.

Very simple header example:

@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

The same example with explanations:

@HD <- This just means, that we have a header VN:1.3 <- file format version is 1.3 SO:coordinate <- reads are sorted by mapping coordinate
@SQ <- In this row, we have information about (one of) the reference sequence(s) SN:ref <- reference sequence is named ‘ref’  LN:45 -< reference sequence is 45 bps long

Very simple read example:

1:497:R:-272+13M17D24M   113    chr1    497    37    37M    chr15    100338662    0    CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG    0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>

read name: 1:497:R:-272+13M17D24M
bitwise flag: 113 -> this means that the read is paired, mapped on the reverse strand, the first in the pair and the mate read is mapped on the reverse strand as well
reference name: chr1 (as in chromosome 1)
mapping position: 497 (the first mapped nucleotide of the read is at chr1:497)
mapping quality: 37
CIGAR string: 37M (all 37 nucleotides of the read mapped to the reference are matches)
reference for the mate: chr15 -> the other read of the pair is mapped to chr15
position of the mate: 100338662 (the first mapped nucleotide of the pair is at chr15:100338662)
distance between paired reads: 0 -> this value is only valid if the two reads map to the same reference, so in this case it’s 0
read sequence: CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG
read quality: 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>

File extension: .sam

BAM

Contains the same information as the SAM file. Stores the data in a compressed, binary form.

File extension: .bam