All About Bioinformatics: Detecting mutations from RNA-Seq (with a demo)

Monday, September 16, 2013

Detecting mutations from RNA-Seq (with a demo)

Detecting mutations from RNA-Seq is not a typical approach to detect mutations, but has drawn much attention in recent days. There are several reasons for that:

First, many RNA-Seq experiments are done mainly to detect expression changes of genes, transcripts, alternative splicings, and it has no extra costs to use computational approach to identify mutations. Second, several inverstigations show that mutations detectable from RNA-Seq accounts for up to 40% of those detected by DNA sequencing, and for some highly expression genes, this percentage may reach 80% (for example, see this paper). Third, it is much easier to do the RNA-Seq than whole genome sequencing or whole exome sequencing in terms of the sequencing cost, the amount of samples required, etc.

There are currently many different tools to detect mutations from high-throughput sequencing, for example, GATK. However, these algorithms are specifically designed for mutation detection in DNA sequencing, not RNA sequencing. The statistical evaluation of mutations in these softwares may not be suitable for RNA-Seq.

Here we present rnaseqmut, an easy-to-use program to detection variants in RNA-Seq read alignments. rnaseqmut is a light-weight C++ program to detect variants (or mutations, including SNPs, indels, RNA editing events) from a single or multiple RNA-Seq BAM files. It offers the following features:

Perform de-novo mutation discovery from a given BAM file;
For a user-defined mutation list, calculate the read coverage, including reads that support reference allele (reference reads) and alternative allele (alternative reads), from a given BAM file;
For a series of RNA-Seq samples, filter interesting mutations based on user-defined criteria.

This software package includes a "core" C++ program, rnaseqmut, to call variants from a single BAM file, and a series of optional scripts, written in Python3 and bash, to identify putative interesting mutations in a group of RNA-Seq samples. Besides mutation detection from RNA-Seq, the "core" program (rnaseqmut) can also be used to call mutations from other high-throughput sequencing platforms, including ChIP-seq, DNA-Seq, etc.

For more information, check out in GitHub:

https://github.com/davidliwei/rnaseqmut

Demo: detecting mutations from a series of RNA-Seq BAM files

A demo script is provided in the demo directory, together with 4 sample RNA-Seq BAM files. In this demo, we provide two normal samples and two tumor samples, and would like to search for mutations that only occurs in one or more tumor samples, but not (or have low frequency) in any of the normal samples. We split this job into five steps:

Step 1, scan the samples individually and get the mutation list for each sample. In this demo we output all possible mutations (those even occur in only 1 RNA-Seq read), but in reality you may just need those with enough read support (controlled by -i/--min_read option).

Step 2, merge the mutation lists from individual samples. This will be the potentially interesting mutations we would like to investigate.

Step 3, scan the samples again, using the provided mutation list in Step 2.

Step 4, merge the mutations in step 3 into a big table. In this table, each row represents a mutation, and columns record the number of supporting reads (and the number of reference reads that do not support this mutation) in all samples. Based on this table, you can use your own criteria to search interesting mutations.

Step 5 illustrates an example of filtering interesting mutations using a python3 script "filtermut.py". In this demo, we define "control" samples as two normal samples, and would like to look for mutations that satisfy the ALL of the following criteria:

occur in at least 1 non-control sample (controlled by -t/--min-recurrent option in filtermut.py) with at least 20% frequency (controlled by -f/--min-recfrac option) and 10 supporting reads (-d/--min-recread);
do not have any supporting reads in control samples (-b/--max-alt);
have at least 4 non-supporting reads (reference reads) in control samples (-a/--min-ref). This requirement will exclude mutations with 0 read coverage in control samples thus we have no idea whether these mutations occur in control samples.

The final output is a VCF file recording mutations and read coverages in all 4 samples. You can also output tab-delimited file instead of VCF file for further downstream analysis (use -z/--no-vcf option in Step 5). For interpreting results, see the next section.

For more details, refer to the demo script and the usage of each programs/scripts.

36 comments:

tommySeptember 17, 2013 at 6:11 AM
Thanks! it looks very promising!
I am analyzing some RNA-seq data for differential expressed genes. So, why not try to find some mutations at the same time! one stone two birds!
ReplyDelete
Replies
AnonymousOctober 8, 2013 at 3:53 PM
Look promising! But, I'm getting the following error:

rnaseqmut: error while loading shared libraries: libbamtools.so.2.3.0: cannot open shared object file: No such file or directory

Would you have any idea why? Thanks for the code!
ReplyDelete
Replies
RyanNovember 6, 2013 at 6:13 AM
Nice. Has this been submitted for publication?
ReplyDelete
Replies
MichaelAngeloJuly 30, 2014 at 6:25 AM
Hi Li,

I am using rnaseqmut version 0.6

I am using following command

rnaseqmut -r hg19.fa -i 5 K_RNAseq.bam

Now, how can I specify output file?

Thank you
ReplyDelete
Replies
UnknownOctober 31, 2014 at 2:57 PM
Hi Li!
I have a question for you.. The input bam for your program has to be the result of aligning my rnaseq reads against my reference genome or my reference transcriptome? or it could work with either of them?
Thanks in advance!
ReplyDelete
Replies
UnknownMay 11, 2015 at 1:14 PM
Hi Wei Li

Can you please provide python script for analysis of cuffdiff results. I did it with R using cummeRbund. But, now I want to do this with python can you help me. Thanks in Advance.
ReplyDelete
Replies
UnknownJuly 16, 2015 at 6:54 AM
Hi Li,
I have a question for you , how to track the genomic coordinates of amino acids
ReplyDelete
Replies
UnknownAugust 4, 2015 at 8:29 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownAugust 5, 2015 at 3:23 AM
hi Li and others,
This tools looks really promising. For my analysis, I dont have genome and I use only reference transcriptome. I dont have Control samples but only treated samples where mutation was induced. All I want to find point mutation for every transcript in treated sample against my reference transcriptome. For an example I want find number of times nucleotide "A" in reference transcript has been mutated into "G" in treated transcript and vice versa. Is it possible to obtain this information using rnaseqmut ? if it is possible should I do all the 5 steps?

Kindly guide me
ReplyDelete
Replies
Candy SwiftOctober 21, 2015 at 1:44 AM
RNA sequencing is indeed a powerful tool in detecting mutations. But in terms of conducting it, there are many details we have to pay attention to.
ReplyDelete
Replies
UnknownJanuary 16, 2016 at 5:47 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 16, 2016 at 5:56 PM
Hi Li,

Thanks. I was really looking for this kind of software.
I have one question.

I compared the output data by rnaseqmut with IGV. I tried to remove all the filters as I can, but there were some discrepancies.
Could you tell us the details of the built-in filters of rnaseqmut?

Many thanks, again.

Tak
ReplyDelete
Replies
UnknownMarch 21, 2016 at 3:52 AM
Hi Li,

I'm also interested in the answer of Tak's question and is the tool now published?

Thanks for the great tool.

Best, Danny.
ReplyDelete
Replies
YifangtJune 3, 2016 at 2:42 PM
This comment has been removed by the author.
ReplyDelete
Replies
YifangtJune 3, 2016 at 2:51 PM
This comment has been removed by the author.
ReplyDelete
Replies
YifangtJune 3, 2016 at 2:53 PM
Hello Wei!
While I was trying this tool for my RNAseq data, I got bunch of error messages such as:
Error while processing read 1101827 Read mapped position: 1AL:17223038, read ID:7001343F:49:C8DNJANXX:7:1204:12146:101020.
Error: reading MD tags in the read alignments, skipping the reads.
Error: Cannot obtain mutation information
......
The error messages came from step 3 for the second calling with the mutation list, and my command line is:
rnaseqmut -l ${OUTPUT_DIR}/ALL_MUTATION_LIST.tab ${INPUT_DIR}/HPI48_BB.Aligned.sortedByCoord.out.bam > ${OUTPUT_DIR}/BB.2nd.tab

My bam files were created with STAR and sorted & indexed with samtools 1.3.

Do you have any clue of what this means, or what I have missed?
Thank you!
ReplyDelete
Replies
YifangtJune 8, 2016 at 8:43 AM
Sorry for the late response.
$ samtools view -B HPI48_BB.Aligned.sortedByCoord.out.bam | head
7001343F:52:C8GPHANXX:8:1112:7727:41602 419 1AS 4958 3 126M = 5008 176 CTGCAATGTCAGTGTTTGAGTCGAAGTACAGAGAAGGCCTCACAGTAAGTGAACAAACGCAACAGTTCATGCACTTGACTATGTGGATATCTTAGTATTGCACGGGTGGTGACTTTGCTTTAGCAG BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:2 AS:i:250 nM:i:0
7001343F:52:C8GPHANXX:8:1112:7727:41602 339 1AS 5008 3 126M = 4958 -176 GAACAAACGCAACAGTTCATGCACTTGACTATGTGGATATCTTAGTATTGCACGGGTGGTGACTTTGCTTTAGCAGATCAGTGCTTGTTCTTTGTGTATCATTTTTTGTAACAATATGTACTAACC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NH:i:2 HI:i:2 AS:i:250 nM:i:0
7001343F:52:C8GPHANXX:8:1315:7390:40843 163 1AS 58926 1 126M = 59053 253 CTCGCATCGACTCGACGCTCTGGCCTGAGGAGACACTTCAGAATGACCTGGAGTCGCTGATGGCCCGCTTGAACACGATCCCTGGTCAAGTGCAGGAGTGGAAGAAGTCCTCGGCTCGGTGTGGTG BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF7BBFFFF NH:i:3 HI:i:1 AS:i:246 nM:i:2
7001343F:52:C8GPHANXX:8:2212:15939:9932 163 1AS 58980 1 126M = 59039 185 CGCTGATGGCCCGCTTGAACACGATCCCTGGTCGAGTGCAGGAGTGGAAGAAGTCCTCGGCTCGGTGTGGTGCAGATGTCGCTCTGTGTCTGGCCCGAGTCCACTGCAAAGATGCGCGAGAGGAGA BBBBBBFFFFBFFFFFFFFFFFFFFFFFFFFFFFBFB/F<FFBFFFFFFFFFFFFFFFFBF/FFFFFFBFFBFFFFFFFFFFFFBFFBFFFF/BF<B/FFFFBFFFF/FFFFFFFFFFFFFFFFFB NH:i:3 HI:i:1 AS:i:250 nM:i:0
7001343F:52:C8GPHANXX:8:1106:12532:6240 163 1AS 59034 1 126M = 59035 127 CCTCGGCTCGGTGTGGTGCAGATGTCGCTCTGTGTCTGGCCCGAGTCCACTGCAAAGATGCGCGAGAGGAGAAGCTGGCGGCCCTTCGGGTGGCCAATACCAAGAAGCACGACTTCAGGTCCTTTA BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB NH:i:3 HI:i:1 AS:i:250 nM:i:0
7001343F:52:C8GPHANXX:8:1106:12532:6240 83 1AS 59035 1 126M = 59034 -127 CTCGGCTCGGTGTGGTGCAGATGTCGCTCTGTGTCTGGCCCGAGTCCACTGCAAAGATGCGCGAGAGGAGAAGCTGGCGGCCCTTCGGGTGGCCAATACCAAGAAGCACGACTTCAGGTCCTTTAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NH:i:3 HI:i:1 AS:i:250 nM:i:0
7001343F:52:C8GPHANXX:8:2212:15939:9932 83 1AS 59039 1 126M = 58980 -185 GCTCGGTGTGGTGCAGATGTCGCTCTGTGTCTGGCCCGAGTCCACTGCAAAGATGCGCGAGAGGAGAAGCTGGCGGCCCTTCGGGTGGCCAATAC

Thanks again!
ReplyDelete
Replies
YifangtJune 8, 2016 at 2:04 PM
I found out the problem is the missing MD tag!
After the MD tag is restored (samtools calmd -beAr), the error disappeared.
Thank you anyway!
ReplyDelete
Replies
RohanSeptember 28, 2016 at 2:02 PM
I have only tumor samples.So I ran my mutation analysis until step4 to get the mutations.
I have passed the output from Step4 to do annotations using snpEff but I am getting error:

VcfFileIterator.parseVcfLine(115): Fatal error reading file '/tmp/713110.1.standard.q/mutation_anno/top1000.txt' (line: 2):
chr1 13684 C T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '0'

How can I get rid of this,y modifiying the output of this file so that read by snpEFF?
ReplyDelete
Replies
nhóc trứng gàJuly 7, 2017 at 7:21 AM
Hi Li,

For step 2, usage: merge2ndvcf.py [-h] [-l LABEL] [-x REGION] [-m] [-v] [-r MIN_READ] ...I am not sure how to merge my control and treated sample mutation files (.txt file) obtained rnaseqmut together with this command. First, how could I specify the files to be merged here? is that -x REGION is where I should specific the directory of .txt file? Could you give me a specific example by using your demo files in this case? Thank you very much
ReplyDelete
Replies
UnknownJuly 11, 2017 at 12:50 AM
Hi,

I would like to use rnaseqmut to extract the read counts for mutant counts. I am using the following command:

bin/rnaseqmut -l AB1.list.txt AB1.RNA.mm10.bam

The list is a tab delimited text file that looks like the following:
chr10 4353031 A C
chr10 5318418 C A
chr10 10689642 G C
chr10 11079958 T A
chr10 11157228 T A

What the program shows is this:

BAM file:AB1.RNA.mm10.bam
Mut span:4
Mutation list:AB1.list.txt
Min read:1
Max mismatch:1
Reference genome:
Reading 1 lines, 0 chromosomes.

I am not sure what I am doing wrong. Can you help?
ReplyDelete
Replies
AnonymousSeptember 2, 2018 at 9:32 AM
I'm 15 years old. I was born with HIV my mother passed away because of the HIV infection And I regret why i never met Dr Itua he could have cured my mum for me because as a single mother it was very hard for my mother I came across Dr itua healing words online about how he cure different disease in different races diseases like HIV/Aids Herpes Copd Diabetes Hepatitis even Cancer I was so excited but frighten at same time because I haven't come across such thing article online then I contacted Dr Itua on Mail drituaherbalcenter@gmail.com I also chat with him on what's app +2348149277967 he tells me how it works then I tell him I want to proceed I paid him so swiftly Colorado post office I receive my herbal medicine within 4/5 working days he gave me guild lines to follow and here am I living healthy again can imagine how god use men to manifest his works am I writing in all articles online to spread the god work of Dr Itua Herbal Medicine Yes can ask me anything about this on my twitter @ericamilli or text 7205992850.He's a Great Man.
ReplyDelete
Replies
UnknownJanuary 7, 2019 at 10:24 PM
Hi,

I have an problem with processing RNA-seq data mapped to HG38. In the rnaseqmut results I got some strange chromosom positions, see example below.

chr10 4353031 A C
chr10 5318418 C A
chr10 -1 G T
chr10 10689642 G C
...

I hove some positions with -1, what can be the reason for that?

Would be nice if someone can help solving my confusion.

Best, Danny.
ReplyDelete
Replies
PrincessJune 4, 2019 at 8:32 AM
Hi,

Thank you so much for this tool. It works perfectly for me, however I am wondering why all of the QUAL score for the resulting VCF files are set to 1.0. Is there a way to calculate the actual QUAL scores?

I am doing a further analysis comparing results from WGS mutations to RNAseq mutations for the same samples.

Thanks, Princesca!
ReplyDelete
Replies

Add comment