Monday, September 16, 2013

Detecting mutations from RNA-Seq (with a demo)

Detecting mutations from RNA-Seq is not a typical approach to detect mutations, but has drawn much attention in recent days. There are several reasons for that:

First, many RNA-Seq experiments are done mainly to detect expression changes of genes, transcripts, alternative splicings, and it has no extra costs to use computational approach to identify mutations. Second, several inverstigations show that mutations detectable from RNA-Seq accounts for up to 40% of those detected by DNA sequencing, and for some highly expression genes, this percentage may reach 80% (for example, see this paper). Third, it is much easier to do the RNA-Seq than whole genome sequencing or whole exome sequencing in terms of the sequencing cost, the amount of samples required, etc.

There are currently many different tools to detect mutations from high-throughput sequencing, for example, GATK. However, these algorithms are specifically designed for mutation detection in DNA sequencing, not RNA sequencing. The statistical evaluation of mutations in these softwares may not be suitable for RNA-Seq.

Here we present rnaseqmut, an easy-to-use program to detection variants in RNA-Seq read alignments. rnaseqmut is a light-weight C++ program to detect variants (or mutations, including SNPs, indels, RNA editing events) from a single or multiple RNA-Seq BAM files. It offers the following features:
 
  • Perform de-novo mutation discovery from a given BAM file;
  • For a user-defined mutation list, calculate the read coverage, including reads that support reference allele (reference reads) and alternative allele (alternative reads), from a given BAM file; 
  • For a series of RNA-Seq samples, filter interesting mutations based on user-defined criteria. 

This software package includes a "core" C++ program, rnaseqmut, to call variants from a single BAM file, and a series of optional scripts, written in Python3 and bash, to identify putative interesting mutations in a group of RNA-Seq samples. Besides mutation detection from RNA-Seq, the "core" program (rnaseqmut) can also be used to call mutations from other high-throughput sequencing platforms, including ChIP-seq, DNA-Seq, etc.

For more information, check out in GitHub:

https://github.com/davidliwei/rnaseqmut



Demo: detecting mutations from a series of RNA-Seq BAM files



A demo script is provided in the demo directory, together with 4 sample RNA-Seq BAM files. In this demo, we provide two normal samples and two tumor samples, and would like to search for mutations that only occurs in one or more tumor samples, but not (or have low frequency) in any of the normal samples. We split this job into five steps:

  • Step 1, scan the samples individually and get the mutation list for each sample. In this demo we output all possible mutations (those even occur in only 1 RNA-Seq read), but in reality you may just need those with enough read support (controlled by -i/--min_read option).
  • Step 2, merge the mutation lists from individual samples. This will be the potentially interesting mutations we would like to investigate.
  • Step 3, scan the samples again, using the provided mutation list in Step 2.
  • Step 4, merge the mutations in step 3 into a big table. In this table, each row represents a mutation, and columns record the number of supporting reads (and the number of reference reads that do not support this mutation) in all samples. Based on this table, you can use your own criteria to search interesting mutations.
  • Step 5 illustrates an example of filtering interesting mutations using a python3 script "filtermut.py". In this demo, we define "control" samples as two normal samples, and would like to look for mutations that satisfy the ALL of the following criteria:
    1. occur in at least 1 non-control sample (controlled by -t/--min-recurrent option in filtermut.py) with at least 20% frequency (controlled by -f/--min-recfrac option) and 10 supporting reads (-d/--min-recread);
    2. do not have any supporting reads in control samples (-b/--max-alt); 
    3. have at least 4 non-supporting reads (reference reads) in control samples (-a/--min-ref). This requirement will exclude mutations with 0 read coverage in control samples thus we have no idea whether these mutations occur in control samples.

The final output is a VCF file recording mutations and read coverages in all 4 samples. You can also output tab-delimited file instead of VCF file for further downstream analysis (use -z/--no-vcf option in Step 5). For interpreting results, see the next section.

For more details, refer to the demo script and the usage of each programs/scripts.




Monday, February 11, 2013

Which tool is used for RNA-seq analysis?

Which tool is popular to perform RNA-seq analysis, including read mapping, gene/transcript expression level estimation and differential analysis? Here is the survey of which tool to used for RNA-seq analysis.

In summary, the majority will:


  • aligns the reads to the genome (or genome+transcriptome) using Tophat;
  • count reads using Cufflinks (the second choice is HTSeq-count, which is becoming popular);
  • perform differential expression using DESeq/DEXSeq (followed by CuffDiff);
  • use Ensemble (followed by Refseq/UCSC) as the annotation resource;
  • use GOSeq (followed by IPA and Genego Metacore) for downstream analysis.






Thursday, January 24, 2013

XML_C14N error when compiling libxmlsec


OK, as an bioinformatician I always work with compiling and running with many different tools. Here comes with the problem: I need to install the genefilter package in R, which is a great package for performing microarray analysis. In our institute cluster, it works well in the head node but not in other nodes. Further inspection shows that a static library, libxmlsec, is missing in these nodes. Compiling this library is painful as you have to deal with many different errors...

After a few days of struggle, I successfully compiled libxmlsec and now genefilter works fine for all nodes in the cluster. Here I just write down some error messages and my solutions, in cases others have the same problem.

XMLSEC compiling errors: XML_C14N_1_0 undeclared

The error message is something like:

c14n.c: In function 'xmlSecTransformC14NExecute':
c14n.c:423: error: 'XML_C14N_1_0' undeclared (first use in this function)
c14n.c:423: error: (Each undeclared identifier is reported only once
c14n.c:423: error: for each function it appears in.)
c14n.c:431: error: 'XML_C14N_1_1' undeclared (first use in this function)
c14n.c:439: error: 'XML_C14N_EXCLUSIVE_1_0' undeclared (first use in this function)

Solution:

XML_C14N_1_0 is defined in libxml2 header files, so check whether your libxml2 has some problems. More specifically, XML_C14N_1_0 is defined in "c14n.h", and older libxml2 headers this is not defined. For me this error occurs since I have an older version of libxml2 in the system, which the compiler overrides a newer version in my home directory. Here is the following command the MAKE used when error occurs:

gcc -DHAVE_CONFIG_H -I. -I.. -DPACKAGE=\"xmlsec1\" -I../include -I../include -D__XMLSEC_FUNCTION__=__FUNCTION__ -DXMLSEC_NO_SIZE_T -DXMLSEC_NO_GOST=1 -DXMLSEC_NO_XKMS=1 -DXMLSEC_DL_LIBLTDL=1, -I/usr/include/libxml2 -I/home/include/libxml2 -g -O2 -MT c14n.lo -MD -MP -MF .deps/c14n.Tpo -c c14n.c  -fPIC -DPIC -o .libs/c14n.o

Look at the include option (-I) in the above example: the system libxml2 headers are in front of my own libxml2 headers, so the system libxml2 headers (which are the older ones) are used instead of the newer libxml2 headers. My solution is to add a symbolic link in the "include" directory of libxml2 pointing to the new libxml2 headers in my home directory:

wl948@header:~/prog/libxmlsec/xmlsec1-1.2.18/include$ ln -s $HOME/include/libxml2/libxml libxml

which solve this issue because right now the newer libxml2 headers are in front of the system libxml2 files.

I'm sure there should be alternative solutions in the configure program by providing a proper option (probably by setting some environment variables) but I didn't check. Anyone has any ideas?