Workflow Wednesdays – Reference sequence – Part 2. Subsampling and statistics

Extracting regions from a fasta file

If you are doing targeted sequencing, it’s usually a good idea to use a relatively large reference sequence (e.g. a whole chromosome) to avoid problems caused by mismappings. It can still be very useful sometimes to use a “subset” or “subsample” of the reference sequence for an alignment to save computing time, investigate alignment problems or other reasons. To get a specific region from a fasta file, you can use the Bedtools suite’s “getfasta” function. To use this function, you’ll need a bed file, containing the coordinates of the required region(s). As this is just a test run, let’s select the first gene from the NCBI record of the reference genome and create a bed file by hand! The first gene is  the “thrL” which is between positions 190 and 255.

To create a bed file from the terminal, you can use for example vim:

vim example.bed

To add some text to the file, you have to press the “Insert” button first. After that, you type (or copy) the the following line:

gi|49175990|ref|NC_000913.2|    190     255

To save and quit the file, press “Escape” then “x”+”Enter”. you can check the first lines of the file using the head command.

$head example.bed
gi|49175990|ref|NC_000913.2|    190     255

Now, that we have the bed file, we can get the fasta sequence for that region with the following command:

bedtools getfasta -fi NC_000913.2.fa -bed example.bed -fo region.fa

The result file looks like this:

$head region.fa 
>gi|49175990|ref|NC_000913.2|:190-255
TGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA

If you specify multiple regions in the bed file, the output will be a multifasta file.

Of course, the same task can be done with several different tools (e.g. FastaHack).

Sequence statistics

Knowing the nucleotide content of a region can be very useful sometimes. For example, high or low GC-content can cause problems during sequencing. When I find low coverage or uncovered targeted regions in a reference based alignment, the first thing I check is the GC-content of the problematic region (another thing worth checking is the uniqueness of the region).

Bedtools has a command for generating nucleotide statistics as well:

$bedtools nuc -fi NC_000913.2.fa -bed example.bed 
#1_usercol      2_usercol       3_usercol       4_pct_at        5_pct_gc        6_num_A 7_num_C 8_num_G 9_num_T 10_num_N        11_num_oth   12_seq_len
gi|49175990|ref|NC_000913.2|    190     255     0.476923        0.523077        20      22      12      11      0       0       65

The output contains the AT and GC frequencies, a count for all nucleotides and N and the length for all the contigs.

For a few shorter sequences, web-based tools can be used as well (e.g. see here).

A visual representation of the GC-content (and some more useful information) can be viewed in the UCSC Genome Browser.

Another very useful bioinformatics tool which can be used for generating sequence statistics is R (check out the Bioconductor and seqinR package collections). You can find instructions for the installation of R on Ubuntu here and for Bioconductor here. SeqinR can be installed within R, using the “install.packages” command:

install.packages("seqinr")

There is an excellent tutorial on using R for bioinformatics here.

To start R, just write “R” in the Linux terminal and hit “Enter”. An R terminal will be opened. It should look something like this:

$ R

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

We will use the seqinR library, to read the fasta file and generate some basic statistics. First, seqinR package has to be loaded. You can do this using the “library” command:

library("seqinr")

To get some information about the seqinr package (and in general, to get help in R) use the following syntax:

?seqinr

To import the fasta file into R, you should use the “read.fasta” command:

ecoli=read.fasta("NC_000913.2.fa")

After this, you have to create a variable that only contains the nucleotide sequence:

ecoliseq=ecoli[[1]]

Now, we are ready for the statistics!
First, check the length of the whole sequence:

> length(ecoliseq)
[1] 4639675

Then count the different nucleotides:

> table(ecoliseq)
ecoliseq
      a       c       g       t 
1142228 1179554 1176923 1140970

You can calculate the GC-content, by hand, using the output of the “table” command, but a function is also available for this:

> GC(ecoliseq)
[1] 0.507897

To calculate the GC-content of a specific region, as we did above, you can use the following method:

> GC(ecoliseq[191:255])
[1] 0.5230769

Note the difference in the coordinates! The bed format has 0-based numbering, while seqinR uses a 1-based system.
You can also count kmers in your sequence. For example, to count all possible 4-mers, use the “count” command like this:

> count(ecoliseq,4)

 aaaa  aaac  aaag  aaat  aaca  aacc  aacg  aact  aaga  aagc  aagg  aagt  aata 
35134 25256 22794 25740 21875 20441 24407 15859 17049 19789 13633 12898 20906 
 aatc  aatg  aatt  acaa  acac  acag  acat  acca  accc  accg  acct  acga  acgc 
20877 21559 19653 16656 10966 16746 14269 23978 12200 25278 13441 14218 26409 
 acgg  acgt  acta  actc  actg  actt  agaa  agac  agag  agat  agca  agcc  agcg 
18091 14545  6525  9754 20435 13151 18469 10650 10843 16659 23235 17658 26634 
 agct  agga  aggc  aggg  aggt  agta  agtc  agtg  agtt  ataa.... 
13333 11215 16774  9141 13494 10597  9377 13594 16204 22051....