Bioinformatics for Beginners – File formats: Part 1. Reference sequences

The most widely used file format for reference sequences is the fasta format. Both nucleotide and protein sequences can be represented in fasta format.

A fasta formatted file begins with a single-line description, followed by the sequence data. The description line starts with a greater-than (“>”) symbol. In the next line, the nucleotide or protein sequence starts. This sequence can be in a single line, but usually it’s broken into shorter, uniform length lines. Coding of the sequences follows the IUPAC code.

Example:
>seq1
CTGAAGGGTGACATGGANNTGGATCCTGARYCCCTTAGTCATAG

A multifasta is a file that contains multiple descriptions and sequences.
Example:
>seq1
CTGAAGGGTGACATGGANNTGGATCCTGARYCCCTTAGTCATAG
>seq2
AAAAANNTGCTGRAAGTTTCGTA

Usual file extensions are the following: .fa, .fasta, .fna, .fsa, .mfa,. .mpfa

Before doing any kind of manipulation or analysis using a reference sequence, the fasta file usually has to be indexed. Most NGS related softwares and algorithms either have their own indexing algorithm or accept fasta indexes created by other tools.

A fasta index usually contains basic information about the reference file. For example a fasta index created by the “samtools faidx” command is stored in a .fai formatted file, which contains the following information about each contig of the multifasta:

  • name of the contig,

  • length of the contig,

  • offset of the first base in the file (this basically mean the byte position of the beginning of the contig within the file),

  • number of bases for each fasta line and number of bytes for each fasta line.

Note, that contig “officially” means a consensus sequence defined by a set of overlapping DNA segments. But it’s often used in a slightly broader meaning, basically as a synonym of “continuous piece of sequence” or “continuous piece of DNA”.

Tip1: Different aligner and assembly softwares can deal with non-ATCG IUPAC characters differently. For example, BWA transforms them randomly into one of the canonical bases.

Tip2: If you have a fasta with the nucleotide sequence in a single line or a multifasta with different line lengths for the different contigs, you can easily fix the length of the lines using Picard’s NormalizeFasta tool.

There are also several other sequence formats around, for example the Genbank flatfile format, the EMBL flatfile format, the raw sequence format and many others.