tag:blogger.com,1999:blog-59675650478112984922024-03-27T00:21:38.819-07:00All About BioinformaticsHi, I'm maintaining this blog to post news, tips, ideas about bioinformatics, computational biology, next generation sequencing (NGS) and everything. Hope you will enjoy it!Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-5967565047811298492.post-11930190992017448012013-09-16T14:05:00.000-07:002013-09-17T08:32:35.932-07:00Detecting mutations from RNA-Seq (with a demo)<span style="font-family: inherit;">Detecting mutations from RNA-Seq is not a typical approach to detect mutations, but has drawn much attention in recent days. There are several reasons for that:</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">First, many RNA-Seq experiments are done mainly to detect expression changes of genes, transcripts, alternative splicings, and it has no extra costs to use computational approach to identify mutations.</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">Second, several inverstigations show that mutations detectable from RNA-Seq accounts for up to 40% of those detected by DNA sequencing, and for some highly expression genes, this percentage may reach 80% (for example, see </span><a href="http://genomebiology.com/content/11/5/R57" style="font-family: inherit;" target="_blank">this paper</a><span style="font-family: inherit;">). </span><span style="font-family: inherit;">Third, it is much easier to do the RNA-Seq than whole genome sequencing or whole exome sequencing in terms of the sequencing cost, the amount of samples required, etc.</span><br />
<br />
There are currently many different tools to detect mutations from high-throughput sequencing, for example, <a href="http://www.broadinstitute.org/gatk/" target="_blank">GATK</a>. However, these algorithms are specifically designed for mutation detection in DNA sequencing, not RNA sequencing. The statistical evaluation of mutations in these softwares may not be suitable for RNA-Seq.<br />
<pre style="color: #333333; font-size: 15px; white-space: pre-wrap;"></pre>
Here we present <a href="https://github.com/davidliwei/rnaseqmut" target="_blank"><span style="color: black; font-size: large;"><b>rnaseqmut</b></span></a>, an easy-to-use program to detection variants in RNA-Seq read alignments. rnaseqmut is a light-weight C++ program to detect variants (or mutations, including SNPs, indels, RNA editing events) from a single or multiple RNA-Seq BAM files. It offers the following features:<br />
<div>
<br />
<ul>
<li>Perform de-novo mutation discovery from a given BAM file;</li>
<li>For a user-defined mutation list, calculate the read coverage, including reads that support reference allele (reference reads) and alternative allele (alternative reads), from a given BAM file; </li>
<li>For a series of RNA-Seq samples, filter interesting mutations based on user-defined criteria. </li>
</ul>
</div>
<div>
<br /></div>
<div>
This software package includes a "core" C++ program, rnaseqmut, to call variants from a single BAM file, and a series of optional scripts, written in Python3 and bash, to identify putative interesting mutations in a group of RNA-Seq samples.
Besides mutation detection from RNA-Seq, the "core" program (rnaseqmut) can also be used to call mutations from other high-throughput sequencing platforms, including ChIP-seq, DNA-Seq, etc.<br />
<div>
<br /></div>
<span style="font-family: inherit; font-size: large;">For more information, check out in GitHub:</span><br />
<span style="font-family: inherit;"><br /></span>
<a href="https://github.com/davidliwei/rnaseqmut"><span style="font-family: inherit;">https://github.com/davidliwei/rnaseqmut</span></a><br />
<span style="font-family: inherit;"><br /></span>
<br />
<br />
<h3>
<span style="font-size: large;"><b>Demo: detecting mutations from a series of RNA-Seq BAM files</b></span></h3>
<br />
<br />
A demo script is provided in the demo directory, together with 4 sample RNA-Seq BAM files. In this demo, we provide two normal samples and two tumor samples, and would like to search for mutations that only occurs in one or more tumor samples, but not (or have low frequency) in any of the normal samples. We split this job into five steps:</div>
<div>
<br /></div>
<div>
<ul>
<li><span style="font-size: large;">Step 1</span>, scan the samples individually and get the mutation list for each sample. In this demo we output all possible mutations (those even occur in only 1 RNA-Seq read), but in reality you may just need those with enough read support (controlled by <span style="color: blue;">-i/--min_read</span> option).</li>
</ul>
<ul>
<li><span style="font-size: large;">Step 2</span>, merge the mutation lists from individual samples. This will be the potentially interesting mutations we would like to investigate.</li>
</ul>
<ul>
<li><span style="font-size: large;">Step 3</span>, scan the samples again, using the provided mutation list in Step 2.</li>
</ul>
<ul>
<li><span style="font-size: large;">Step 4,</span> merge the mutations in step 3 into a big table. In this table, each row represents a mutation, and columns record the number of supporting reads (and the number of reference reads that do not support this mutation) in all samples. Based on this table, you can use your own criteria to search interesting mutations.</li>
</ul>
<ul>
<li><span style="font-size: large;">Step 5</span> illustrates an example of filtering interesting mutations using a python3 script "filtermut.py". In this demo, we define "control" samples as two normal samples, and would like to look for mutations that satisfy the ALL of the following criteria:</li>
</ul>
<ol><ol>
<li>occur in at least 1 non-control sample (controlled by <span style="color: blue;">-t/--min-recurrent</span> option in filtermut.py) with at least 20% frequency (controlled by <span style="color: blue;">-f/--min-recfrac</span> option) and 10 supporting reads (<span style="color: blue;">-d/--min-recread</span>);</li>
<li>do not have any supporting reads in control samples (<span style="color: blue;">-b/--max-alt</span>); </li>
<li>have at least 4 non-supporting reads (reference reads) in control samples (<span style="color: blue;">-a/--min-ref</span>). This requirement will exclude mutations with 0 read coverage in control samples thus we have no idea whether these mutations occur in control samples.</li>
</ol>
</ol>
<br />
The final output is a VCF file recording mutations and read coverages in all 4 samples. You can also output tab-delimited file instead of VCF file for further downstream analysis (use <span style="color: blue;">-z/--no-vcf</span> option in Step 5). For interpreting results, see the next section. <br />
<br />
For more details, refer to the demo script and the usage of each programs/scripts.<br />
<br />
<br />
<br />
<span style="font-family: inherit;"><br /></span></div>
Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com36tag:blogger.com,1999:blog-5967565047811298492.post-15241830580714223342013-02-11T12:38:00.002-08:002013-03-05T08:24:25.152-08:00Which tool is used for RNA-seq analysis?Which tool is popular to perform RNA-seq analysis, including read mapping, gene/transcript expression level estimation and differential analysis? <a href="http://www.biostars.org/p/58299/" target="_blank">H</a><a href="http://www.biostars.org/p/58299/" target="_blank">ere</a> is the survey of which tool to used for RNA-seq analysis.<br />
<br />
In summary, the majority will:<br />
<br />
<br />
<ul>
<li>aligns the reads to the genome (or genome+transcriptome) using Tophat;</li>
<li>count reads using Cufflinks (the second choice is HTSeq-count, which is becoming popular);</li>
<li>perform differential expression using DESeq/DEXSeq (followed by CuffDiff);</li>
<li>use Ensemble (followed by Refseq/UCSC) as the annotation resource;</li>
<li>use GOSeq (followed by IPA and Genego Metacore) for downstream analysis.</li>
</ul>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com4tag:blogger.com,1999:blog-5967565047811298492.post-11117272118230201232013-01-24T14:22:00.000-08:002013-01-24T14:22:00.942-08:00XML_C14N error when compiling libxmlsec<br />
OK, as an bioinformatician I always work with compiling and running with many different tools. Here comes with the problem: I need to install the <a href="http://www.bioconductor.org/packages/2.12/bioc/html/genefilter.html" target="_blank">genefilter</a> package in R, which is a great package for performing microarray analysis. In our institute cluster, it works well in the head node but not in other nodes. Further inspection shows that a static library, <a href="http://www.aleksey.com/xmlsec/" target="_blank">libxmlsec</a>, is missing in these nodes. Compiling this library is painful as you have to deal with many different errors...<br />
<br />
After a few days of struggle, I successfully compiled libxmlsec and now genefilter works fine for all nodes in the cluster. Here I just write down some error messages and my solutions, in cases others have the same problem.<br />
<br />
<h3>
XMLSEC compiling errors: XML_C14N_1_0 undeclared</h3>
<div>
The error message is something like:</div>
<div>
<br /></div>
<div>
<pre style="white-space: pre-wrap;">c14n.c: In function 'xmlSecTransformC14NExecute':
c14n.c:423: error: 'XML_C14N_1_0' undeclared (first use in this function)
c14n.c:423: error: (Each undeclared identifier is reported only once
c14n.c:423: error: for each function it appears in.)
c14n.c:431: error: 'XML_C14N_1_1' undeclared (first use in this function)
c14n.c:439: error: 'XML_C14N_EXCLUSIVE_1_0' undeclared (first use in this function)
</pre>
</div>
<div>
<br /></div>
<div>
<b>Solution:</b></div>
<div>
<br /></div>
<div>
<span style="white-space: pre-wrap;">XML_C14N_1_0 is defined in libxml2 header files, so c</span>heck whether your libxml2 has some problems. More specifically, <span style="white-space: pre-wrap;">XML_C14N_1_0 is defined in "c14n.h", and</span> older libxml2 headers this is not defined. For me this error occurs since I have an older version of libxml2 in the system, which the compiler overrides a newer version in my home directory. Here is the following command the MAKE used when error occurs:</div>
<div>
<br /></div>
<div>
<i>gcc -DHAVE_CONFIG_H -I. -I.. -DPACKAGE=\"xmlsec1\" -I../include -I../include -D__XMLSEC_FUNCTION__=__FUNCTION__ -DXMLSEC_NO_SIZE_T -DXMLSEC_NO_GOST=1 -DXMLSEC_NO_XKMS=1 -DXMLSEC_DL_LIBLTDL=1, <b>-I/usr/include/libxml2 -I/home/include/libxml2</b> -g -O2 -MT c14n.lo -MD -MP -MF .deps/c14n.Tpo -c c14n.c -fPIC -DPIC -o .libs/c14n.o</i></div>
<div>
<br /></div>
<div>
Look at the include option (-I) in the above example: the system libxml2 headers are in front of my own libxml2 headers, so the system libxml2 headers (which are the older ones) are used instead of the newer libxml2 headers. My solution is to add a symbolic link in the "<span style="color: blue;">include</span>" directory of libxml2 pointing to the new libxml2 headers in my home directory:</div>
<div>
<br /></div>
<div>
<div>
wl948@header:~/prog/libxmlsec/xmlsec1-1.2.18/include$ ln -s $HOME/include/libxml2/libxml libxml</div>
</div>
<div>
<br /></div>
<div>
which solve this issue because right now the newer libxml2 headers are in front of the system libxml2 files.</div>
<div>
<br /></div>
<div>
I'm sure there should be alternative solutions in the configure program by providing a proper option (probably by setting some environment variables) but I didn't check. Anyone has any ideas?</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com3tag:blogger.com,1999:blog-5967565047811298492.post-43549816458227631082012-04-06T15:23:00.001-07:002013-03-05T08:18:08.953-08:00Estimating paired-end read insert length from SAM/BAM filesI wrote a single Python script to estimate the paired-end read insert length (or fragment length) from read mapping information (i.e., SAM/BAM files). The algorithm is simple: check the TLEN field in the SAM format, throw out pair-end reads whose pairs are too far away, and use them to estimate the mean and variance of the insert length.<br />
<br />
This script is also able to provide a detailed distribution of read length and read span for your convenience. Please refer to the detailed usage below.<br />
<br />
This script is distributed in GitHub now.<br />
<div>
<br /></div>
<div>
<b><span style="font-size: large;">Usage</span></b>:</div>
<div>
<br /></div>
<div>
<span style="font-family: Verdana, sans-serif;">getinsertsize.py [ SAM file | -]</span><br />
<span style="font-family: Verdana, sans-serif;"><sam file=""><br /></sam></span>
<sam file=""><span style="font-family: inherit;">or</span></sam><br />
<span style="font-family: Verdana, sans-serif;"><sam file=""><br /></sam></span>
<span style="font-family: Verdana, sans-serif;"><sam file="">samtools view [ BAM file ] <bam file="">| getinsertsize.py - </bam></sam></span><br />
<span style="font-family: Verdana, sans-serif;"><sam file=""><bam file=""><br /></bam></sam></span>
<span style="font-family: Verdana, sans-serif;"><sam file=""><bam file=""><br /></bam></sam></span>
<span style="font-family: Verdana, sans-serif;"><sam file=""><bam file=""><b>Detailed Usage:</b></bam></sam></span><br />
<span style="font-family: Verdana, sans-serif;"><sam file=""><bam file=""><br /></bam></sam></span>
<span style="font-family: Verdana, sans-serif;"><sam file=""><bam file=""></bam></sam></span><br />
<span style="font-family: Verdana, sans-serif;">usage: getinsertsize.py [-h] [--span-distribution-file SPAN_DISTRIBUTION_FILE]</span><br />
<span style="font-family: Verdana, sans-serif;"> [--read-distribution-file READ_DISTRIBUTION_FILE]</span><br />
<span style="font-family: Verdana, sans-serif;"> SAMFILE</span><br />
<span style="font-family: Verdana, sans-serif;"><br /></span>
<span style="font-family: Verdana, sans-serif;">Automatically estimate the insert size of the paired-end reads for a given</span><br />
<span style="font-family: Verdana, sans-serif;">SAM/BAM file.</span><br />
<span style="font-family: Verdana, sans-serif;"><br /></span>
<span style="font-family: Verdana, sans-serif;">positional arguments:</span><br />
<span style="font-family: Verdana, sans-serif;"> SAMFILE Input SAM file (use - from standard input)</span><br />
<span style="font-family: Verdana, sans-serif;"><br /></span>
<span style="font-family: Verdana, sans-serif;">optional arguments:</span><br />
<span style="font-family: Verdana, sans-serif;"> -h, --help show this help message and exit</span><br />
<span style="font-family: Verdana, sans-serif;"> --span-distribution-file SPAN_DISTRIBUTION_FILE, -s SPAN_DISTRIBUTION_FILE</span><br />
<span style="font-family: Verdana, sans-serif;"> Write the distribution of the paired-end read span</span><br />
<span style="font-family: Verdana, sans-serif;"> into a text file with name SPAN_DISTRIBUTION_FILE.</span><br />
<span style="font-family: Verdana, sans-serif;"> This text file is tab-delimited, each line containing</span><br />
<span style="font-family: Verdana, sans-serif;"> two numbers: the span and the number of such paired-</span><br />
<span style="font-family: Verdana, sans-serif;"> end reads.</span><br />
<span style="font-family: Verdana, sans-serif;"> --read-distribution-file READ_DISTRIBUTION_FILE, -r READ_DISTRIBUTION_FILE</span><br />
<span style="font-family: Verdana, sans-serif;"> Write the distribution of the paired-end read length</span><br />
<span style="font-family: Verdana, sans-serif;"> into a text file with name READ_DISTRIBUTION_FILE.</span><br />
<span style="font-family: Verdana, sans-serif;"> This text file is tab-delimited, each line containing</span><br />
<span style="font-family: Verdana, sans-serif;"> two numbers: the read length and the number of such</span><br />
<span style="font-family: Verdana, sans-serif;"> paired-end reads.</span><br />
<div>
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
</div>
<div>
<br /></div>
<div>
<b><span style="font-size: large;">Sample output:</span></b></div>
<div>
<div>
<br /></div>
<div>
<span style="color: blue;">Read length: mean 90.6697303194, STD=15.9446036414</span></div>
<div>
<span style="color: blue;">Possible read length and their counts:</span></div>
<div>
<span style="color: blue;">{108: 43070882, 76: 50882326}</span></div>
<div>
<span style="color: blue;">Read span: mean 165.217445903, STD=32.8914834802</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<b><span style="font-size: large;">Note</span></b>: If the SAM/BAM file size is too large, it is accurate enough to estimate based on a few reads (like 1 millioin). In this case, you can run the script as follows:</div>
<div>
<br /></div>
<div>
<span style="font-family: Verdana, sans-serif;">head -n 1000000 [ SAM file ] <sam file="">| getinsertsize.py -</sam></span></div>
<div>
<br /></div>
<div>
or</div>
<div>
<br /></div>
<div>
<span style="font-family: Verdana, sans-serif;">samtools view [ BAM file ] <bam file="">| head -n 1000000 | getinsertsize.py -</bam></span></div>
<div>
<br />
<b><span style="font-size: large;">Note</span></b>: According to the SAM definition, the read span "equals the number of bases from the leftmost mapped base to the rightmost mapped base". This span is the distance between two reads in a paired-end read PLUS 2 times read length. Read span is different from the "mate-inner-distance" in Tophat (-r option), which measures only the distance between two reads in a paired-end read.<br />
<br /></div>
<script src="https://gist.github.com/2323462.js?file=getinsertsize.py">
</script>Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com47tag:blogger.com,1999:blog-5967565047811298492.post-20928443414312891042012-02-08T16:41:00.000-08:002012-02-16T14:58:26.464-08:00RNA-Seq Read SimulatorRNA-Seq is now a common protocol to study the expression of genes or transcripts. For research purposes, there are many simulators to simulate RNA-Seq reads, like <a href="http://flux.sammeth.net/simulator.html" target="_blank">Flux Simulator</a>. But many times I found it hard to use because: 1) there are complicated parameters, 2) it requires large memory and 3) it crushes frequently. So I wrote a few scripts to generate simulated RNA-Seq reads, and publish them in a package "RNASeqReadSimulator".<br />
<br />
RNASeqReadSimulator is a set of scripts generating simulated RNA-Seq reads. It provides users a simple tool to generate RNA-Seq reads for research purposes, and a framework to allow experienced users to expand functions. RNASeqReadSimulator offers the following features:<br />
<br />
<br />
<ol>
<li>It allows users to randomly assign expression levels of transcripts and generate simulated single-end or paired-end RNA-Seq reads. </li>
<li>It is able to generate RNA-Seq reads that have a specified positional bias profile. </li>
<li>It is able to simulate random read errors from sequencing platforms. </li>
<li>The simulator consists of a few simple Python scripts. All scripts are command line driven, allowing users to invoke and design more functions.</li>
</ol>
<div>
The webpage of RNASeqReadSimulator is <a href="http://www.cs.ucr.edu/~liw/rnaseqreadsimulator.html">here</a>. You can find the source code in <a href="https://github.com/davidliwei/RNASeqReadSimulator">GitHub</a>.</div>Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com3tag:blogger.com,1999:blog-5967565047811298492.post-88993441501073010222011-08-24T11:35:00.000-07:002011-08-24T11:35:00.758-07:00Extract insert size of paired-end reads from Cufflinks outputThe new version of <a href="http://cufflinks.cbcb.umd.edu/manual.html">Cufflinks </a>tries to estimate the mean and std value of paired-end read insert size automatically. Sometimes we need to extract such information to check our sequencing data. Here I wrote a simple script to search Cufflinks outputs.<br />
<br />
The script is written in Python. The implementation is easy: search for key words (like "Estimated Mean", etc) in log files. The usage is as follows:<br />
<br />
<br />
<span class="Apple-style-span" style="font-size: x-small;">This script extracts insert size information from Cufflinks logs.</span><br />
<span class="Apple-style-span" style="font-size: x-small;">Usage: getinsertsize [cufflinks log file]</span><br />
<span class="Apple-style-span" style="font-size: x-small;">Note: you may specify different log files using filename wildcards.</span><br />
<div><br />
</div><br />
<br />
Sample output:<br />
<br />
<span class="Apple-style-span" style="font-size: x-small;">File MapMass ReadLength Mean Std</span><br />
<span class="Apple-style-span" style="font-size: x-small;">cufflinksinvoke.sh.e2148442 34108402.45 85 150.35 19.18</span><br />
<span class="Apple-style-span" style="font-size: x-small;">cufflinksinvoke.sh.e2148443 31007287.59 85 155.19 17.45</span><br />
<span class="Apple-style-span" style="font-size: x-small;">cufflinksinvoke.sh.e2148444 37286038.02 85 123.87 25.60</span><br />
<br />
Source code:<br />
<br />
<script src="https://gist.github.com/1168809.js?file=getinsertsize.py">
</script><br />
<br />
Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com6tag:blogger.com,1999:blog-5967565047811298492.post-5762098218388510082011-08-19T17:45:00.000-07:002011-09-02T16:28:59.752-07:00Qseq and export file format of IlluminaMost Illumina NGS data files we face are <a href="http://en.wikipedia.org/wiki/FASTQ_format">FASTQ/FASTA</a> formats, which include the read sequence and (possible) quality scores. If reads are mapped to the chromosome or reference sequence, <a href="http://samtools.sourceforge.net/">SAM/BAM</a> file formats are common. However, sometimes we see other formats instead of these common file formats, for example .qseq or _export files. They are generated by earlier Illumina machines and it's best to convert them to commonly used FASTQ/SAM formats.<br />
<br />
QSEQ file is the raw read file; export file format records the mapping location of the read, but the read sequence and quality is retained (like SAM/BAM format).<br />
<span class="Apple-style-span" style="font-size: large;"><b><br />
</b></span><br />
<span class="Apple-style-span" style="font-size: large;"><b>QSEQ File Formats</b></span><br />
<br />
A sample qseq line is as follows:<br />
<br />
<span class="Apple-style-span" style="font-size: x-small;"><b>SOLEXA5</b> 1 <b>4</b> 1 <b>1137</b> 6698 <b>0</b> 1 <span class="Apple-style-span" style="color: red;">TAAATCAAAAGCACAATGAGATATCAATTTTCACCCACTGGAATGGCTATA</span> aa]a]WY]F]aWaZWa]a][a]^]aaaaa_]Y``QaaaUa\aa]]YU`^]P <b> 1</b></span><br />
<br />
According to the Illumina Documentation (<a href="http://bioinfo.cgrb.oregonstate.edu/docs/solexa/Pipeline_CASAVA_User_Guide_15003807_A1.pdf">Pipeline CASAVA</a>), here are the meanings of each field:<br />
<ol><li><b>Machine name</b>: (hopefully) unique identifier of the sequencer.</li>
<li>Run number: (hopefully) unique number to identify the run on the sequencer.</li>
<li><b>Lane number</b>: positive integer (currently 1-8).</li>
<li>Tile number: positive integer.</li>
<li><b>X</b>: x coordinate of the spot. Integer (can be negative).</li>
<li>Y: y coordinate of the spot. Integer (can be negative).</li>
<li><b>Index</b>: positive integer. No indexing should have a value of 1. </li>
<li>Read Number: 1 for single reads; 1 or 2 for paired ends.</li>
<li><span class="Apple-style-span" style="color: red;">Sequence </span></li>
<li>Quality: the calibrated quality string.</li>
<li><b>Filter</b>: Did the read pass filtering? 0 - No, 1 - Yes</li>
</ol><span class="Apple-style-span" style="font-size: large;"><b>EXPORT File Format</b></span><br />
<br />
For export files, there are even more fields. Here is two sample lines of _export.txt files (notice that "|" is inserted between fields for better visualization; they are not included in original fields):<br />
<br />
<span class="Apple-style-span" style="font-size: x-small;">SOLEXA7_25_12N32AACC | |5 |1 |<b>459 </b>|<b>1646 </b>| |1 |TGGGCCNACAACCCCGCACAGTCCCCNCCGCAACCCCCAGCGCTTGCCNC |ENXXXUCXXXEEXXDXEXXXXPCXEXCLEGSNASSNJNKNHKAEHA?N?A |<b>NM </b>| | | | | | | | | | |N</span><br />
<span class="Apple-style-span" style="font-size: x-small;">SOLEXA7_25_12N32AACC | |5 |1 |<b>560 </b>|<b>1611 </b>| |1 |GGCTTGGGAGCTGGTGCTTTCTTTTTTTCTTTTCTTTCTTTTTTTTTTTT |YYYYYYXYYYYYYXYYYYYYYYYYYYYYYYSSSSSOOOOOOOOOOOOOON |<b>chr19 </b>| | <span class="Apple-style-span" style="color: red;">2649567 </span> | R |<b><span class="Apple-style-span" style="color: red;">48C1 </span></b>|46 |0 | | |0 |N |Y</span><br />
<div><br />
</div><br />
The fields are (also from Illumina Documentation):<br />
<br />
<ol><li>Machine (Parsed from Run Folder name)</li>
<li>Run Number (Parsed from Run Folder name)</li>
<li>Lane</li>
<li>Tile</li>
<li><b>X Coordinate</b> of cluster</li>
<li><b>Y Coordinate</b> of cluster</li>
<li>Index string (Blank for a non-indexed run)</li>
<li>Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)</li>
<li>Read</li>
<li>Quality string—In symbolic ASCII format (ASCII character code = quality value + 64)</li>
<li><b>Match chromosome</b>—Name of chromosome match OR code indicating why no match resulted</li>
<li>Match Contig—Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)</li>
<li><span class="Apple-style-span" style="color: red;">Match Position</span>—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)</li>
<li>Match Strand—“F” for forward, “R” for reverse (Blank if no match found)</li>
<li><b><span class="Apple-style-span" style="color: red;">Match Descriptor</span></b>—Concise description of alignment (Blank if no match found)</li>
<ul><li>A numeral denotes a run of matching bases</li>
<li>A letter denotes substitution of a nucleotide: For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution of a “C” at the 33rd position</li>
</ul><li>Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat</li>
<li>Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat</li>
<li>Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)</li>
<li>Partner Contig—Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs (Blank for single-read analysis)</li>
<li>Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)</li>
<li>Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)</li>
<li>Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no</li>
</ol><br />
<br />
<span class="Apple-style-span" style="font-size: large;"><b>Conversion </b></span><br />
<br />
Here I provide some useful tools and scripts for conversion of QSEQ and EXPORT files.<br />
<br />
Samtools provides a script "export2sam.pl" to convert export format to SAM format.<br />
<br />
For qseq to FASTQ conversion, Xi Wang (from <a href="http://seqanswers.com/forums/showthread.php?t=1801&page=2">SeqAnswer.com</a>) provides a very simple Perl script. I modified it a little bit to filter quality control reads (those with the 11th field is 0). If you want to keep these reads (which are not recommended), remove the if condition below.<br />
<br />
<pre class="alt2" dir="ltr" style="background-attachment: initial; background-clip: initial; background-color: #e1e4f2; background-image: initial; background-origin: initial; border-bottom-style: inset; border-bottom-width: 1px; border-color: initial; border-left-style: inset; border-left-width: 1px; border-right-style: inset; border-right-width: 1px; border-top-style: inset; border-top-width: 1px; color: black; height: 226px; overflow-x: auto; overflow-y: auto; padding-bottom: 6px; padding-left: 6px; padding-right: 6px; padding-top: 6px; text-align: left; width: 640px;">#!/usr/bin/perl
use warnings;
use strict;
while (<>) {
chomp;
my @parts = split /\t/;
if($parts[10] == 1){ # remove this if you want to keep quality control reads
print "@","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
print "$parts[8]\n";
print "+\n";
print "$parts[9]\n";
}
}</pre><div><br />
</div>It works pretty well for export file formats.<br />
<br />
For export to FASTQ convert, the latestst <a href="http://maq.sourceforge.net/">MAQ </a>package (>0.6.6) provides a script fq_all2std.pl to do this:<br />
<br />
<span class="Apple-style-span" style="background-color: #f5f5ff; font-family: verdana, geneva, lucida, 'lucida grande', arial, helvetica, sans-serif; font-size: 13px;"><i> Usage: fq_all2std.pl <in.txt><br />
Command:</in.txt></i></span><br />
<span class="Apple-style-span" style="background-color: #f5f5ff; font-family: verdana, geneva, lucida, 'lucida grande', arial, helvetica, sans-serif; font-size: 13px;"><in.txt><i>scarf2std Convert SCARF format to the standard/Sanger FASTQ<br />
fqint2std Convert FASTQ-int format to the standard/Sanger FASTQ<br />
sol2std Convert Solexa/Illumina FASTQ to the standard FASTQ<br />
fa2std Convert FASTA to the standard FASTQ<br />
seqprb2std Convert .seq and .prb files to the standard FASTQ<br />
fq2fa Convert various FASTQ-like format to FASTA<br />
export2sol Convert Solexa export format to Solexa FASTQ<br />
export2std Convert Solexa export format to Sanger FASTQ<br />
csfa2std Convert AB SOLiD read format to Sanger FASTQ<br />
instruction Explanation to different format<br />
example Show examples of various formats</i></in.txt></span><br />
<br />
Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com3tag:blogger.com,1999:blog-5967565047811298492.post-55120622312321020582011-08-18T16:56:00.000-07:002011-08-18T22:45:19.798-07:00Converting Cufflinks .GTF predictions to .BED filesCufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). <a href="http://www.broadinstitute.org/igv/home">IGV</a> browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!<br />
<br />
Of course there are general tools to convert .GTF to .BED. For example, <a href="http://noble.gs.washington.edu/~shobhitg/proj/hs/bin/gtf2bed">here </a>is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.<br />
<br />
Here is one example of .GTF file predicted by Cufflinks:<br />
<br />
<br />
<span class="Apple-style-span" style="font-family: 'Trebuchet MS', sans-serif;"><i>chr1 Cufflinks <span class="Apple-style-span" style="color: red;">transcript</span> 934797 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";</i></span><br />
<span class="Apple-style-span" style="font-family: 'Trebuchet MS', sans-serif;"><i>chr1 Cufflinks exon 934797 934812 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";</i></span><br />
<span class="Apple-style-span" style="font-family: 'Trebuchet MS', sans-serif;"><i>chr1 Cufflinks exon 934906 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";</i></span><br />
<br />
The converted .BED file includes only one line as follows:<br />
<br />
<i><span class="Apple-style-span" style="font-family: 'Trebuchet MS', sans-serif;">chr1 934796 935655 <span class="Apple-style-span" style="color: red;">CUFF.29.3</span> <span class="Apple-style-span" style="color: red;"> 24</span> - 934796 935655 0,0,255 2 16,750 0,109</span></i><br />
<br />
The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.<br />
<br />
Feel free to comment or ask questions.<br />
<br />
<br />
<script src="https://gist.github.com/1155568.js?file=gtf2bed.py">
</script>Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com9tag:blogger.com,1999:blog-5967565047811298492.post-60668128780275371682011-08-18T10:06:00.000-07:002011-08-18T13:36:12.161-07:00Bases in sequence positions<div>There are two different coordinate base systems in different files: 0-base and 1-base. Different files use different base systems, and sometimes it causes confusions (especially when one tries to calculate the length of the region). Here I show the differences in two systems, and summarize several file formats that use both sytems.</div><div>
<br /></div><div><b>0-base system</b>: the first base is 0. You represent a region as [a, b). This is also called "<b>half-close-half-open</b>", "<b>0-base end exclusive</b>", or "<b>1-base end inclusive</b>". When calculating the length of the region, subtract a from b directly:</div><div>
<br /></div><div><i>L=b-a</i></div><div>
<br /></div><div><b>1-base system</b>: the first base is 1, and you represent a region as [a,b]. When calculating the length of the region, don't forget to add 1:</div><div>
<br /></div><div><i>L=b-a+1</i></div><div>
<br /></div><div>Here is an example. Suppose you want to represent a region of X in the following sequence:</div><div>
<br /></div><div><span class="Apple-style-span">
<br /></span>000<span class="Apple-style-span" ><b>XXX</b></span>0000</div><div>
<br /></div><div>
<br /></div><div><span class="Apple-style-span">In the 0-base system, this is represented as [3,6), and the length is 6-3=3. In the 1-base system, use [4, 6], and the length is 6-4+1=3.</span></div>
<br /><div>
<br /></div><div>0-base system files: <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format1">BED</a>, <a href="http://samtools.sourceforge.net/SAM1.pdf">BAM</a></div><div>1-base system files: <a href="http://samtools.sourceforge.net/SAM1.pdf">SAM</a>, <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format3">GFF</a>, <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format4">GTF</a>, <a href="http://genome.ucsc.edu/goldenPath/help/wiggle.html">Wig</a>, <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format2">PSL</a></div>Anonymoushttp://www.blogger.com/profile/15813086589943781474noreply@blogger.com6