Bioinformatics for Beginners – File formats Part 4. – Variants and annotations

VCF

The VCF format specification is available here.

VCF (Variant Call Format) is a text file format (most likely stored in a compressed (for example gzipped) form). It contains meta-information lines, a header line and data lines containing information about a position in the genome.

Meta information lines:

  • lines start with “##” must be key=value pairs (e.g. ##fileDate=20110705)
  • mandatory field: fileformat
  • optional fields: INFO, FILTER, FORMAT etc. (for details, see specification)

Header line:

  • starts with “#”
  • 8 mandatory fields:
  • CHROM: reference sequence name
  • POS: reference sequence position
  • ID: list of unique variant identifiers if available (usually rs IDs from dbSNP)
  • REF: reference base(s) at the specified position
  • ALT: comma separated list of alternate non-reference alleles
  • QUAL: variant quality
  • FILTER: results of filtering (usually based on quality, allele frequency etc.)
  • INFO: additional information, see format specs for more

Data lines: data described by the header line, fields are tab delimited

A simplified example:
##fileformat=VCFv4.0
##fileDate=20110705
##FILTER=
#CHROM POS ID REF ALT QUAL FILTER INFO
2 4370 rs6057 G A 29 . NS=2;DP=13;AF=0.5;DB;H2
2 7330 . T A 3 q10 NS=5;DP=12;AF=0.017
2 110696 rs6055 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB

BED

More information about the BED format is available here.

BED format provides a flexible way to define the data lines that are displayed in an annotation track.

There are three mandatory and nine optional fields. Fields are delimited by tabs.

The three mandatory fields are:

  • chrom – The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
    chromStart – The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
    chromEnd – The ending position of the feature in the chromosome or scaffold.

Example:

chr1 213941196 213942363
chr2 12342 12500

The nine optional fields are containing additional information about the annotation (e.g. name, strand, colour, line thickness).

File extension: .bed

GFF

Format specification is available here.

There are nine tab separated fields, each field must contain a value, but “empty” columns can be replaced with a dot (“.”):

  • seqname – name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix.
  • source – name of the program that generated this feature, or the data source (database or project name).
  • feature – feature type name, e.g. Gene, Variation, Similarity.
  • start – start position of the feature, with sequence numbering starting at 1.
  • end – end position of the feature, with sequence numbering starting at 1.
  • score – a floating point value.
  • strand – defined as + (forward) or – (reverse).
  • frame – One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
    attribute – A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Example:

X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21
X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303

File extension: .gff, .gtf

Tip: Note the difference in the starting coordinates! BED uses 0-based, while VCF and GFF use 1-based coordinates.