Bioinformatics for Beginners – How to get NGS data? Part 2. Reference sequences

The most obvious (and probably the most used) source of reference sequences (or any kind of sequences) is the NCBI Nucleotide database and its “sister” sites: the EBI European Nucleotide Archive, and the DNA Data Bank of Japan.

All three sites provide some kind of search functionality and a few (shorter) sequences can be downloaded from the result pages directly, in multiple formats. For larger reference sequences (e.g. full human or mouse chromosomes or full genomes) or a long list of references the ftp sites or batch query tools should be used.

Ftp sites:

Batch search/Download tools:

There are some other pages with a more limited focus, that can be very useful for retrieving reference sequences:

Tip: the Genome Analysis Toolkit (GATK) needs the human chromosomes in a special order (karyotypic, to be precise). Different versions of the karyotypically sorted human genome are provided by the GATK team as a resource bundle, which can be downloaded from their ftp site.