Read quality control tools can provide very useful information about the the success of your sequencing experiment, without the need to run time consuming alignments and variant calls. Based on read length, per base quality, base content and other basic statistics you can find out a lot about your data. You can decide, whether a pre-alignment processing is needed for the particular read file (e.g. adaptor trimming, quality based trimming). You can sometimes find interesting clues, that can lead you to problems with basically any step of the sequencing workflow, from sample collection to the actual sequencing step. For example, an unusually high GC-content in a subset of the samples can lead you to bacterial plasmid contamination in the library preparation step.
Year: 2013
Bioinformatics for Beginners – How to get NGS data? Part 2. Reference sequences
The most obvious (and probably the most used) source of reference sequences (or any kind of sequences) is the NCBI Nucleotide database and its “sister” sites: the EBI European Nucleotide Archive, and the DNA Data Bank of Japan.
All three sites provide some kind of search functionality and a few (shorter) sequences can be downloaded from the result pages directly, in multiple formats. For larger reference sequences (e.g. full human or mouse chromosomes or full genomes) or a long list of references the ftp sites or batch query tools should be used.
Ftp sites:
Batch search/Download tools:
There are some other pages with a more limited focus, that can be very useful for retrieving reference sequences:
Tip: the Genome Analysis Toolkit (GATK) needs the human chromosomes in a special order (karyotypic, to be precise). Different versions of the karyotypically sorted human genome are provided by the GATK team as a resource bundle, which can be downloaded from their ftp site.
Flashcard Fridays – Part 2. Let’s talk about variant callers!
If you’re interested in the underlying algorithm of variant callers, it’s always a safe bet to check out the paper(s) about the variant caller. So here are the papers for the two most commonly used variant callers:
US Supreme Court Rules on DNA Patents
A long debate comes to an end today as the US Supreme Court upends patentability of genes! The supreme court decision opens opportunities for BRCA testing by Next Generation Sequencing, but how you find those long indels? This is where we help!
Here is how our method was checked and its quality was proved: Considerations for clinical read alignment and mutational profiling using next-generation sequencing
Workflow Wednesdays – Part 2. Read preprocessing – File format conversion
A little bit about the Linux/Unix command line
First of all, let me tell you about the working environment I will use for this data analyis series. I have a laptop with a 4 core Intel processor and 2*4G RAM. I use 64-bit Ubuntu (version 13.04 to be exact).
Several open source bioinformatics tools are Linux/Unix only and if you’re not a Linux user, you can try some Linux terminal-like environments within your current operation system: check out CygWin, if you’re a Windows user. Mac supposedly has a native bash environment, but to tell you the truth, I don’t really have any experience with that.