Wednesday, August 24, 2011

Extract insert size of paired-end reads from Cufflinks output

The new version of Cufflinks tries to estimate the mean and std value of paired-end read insert size automatically. Sometimes we need to extract such information to check our sequencing data. Here I wrote a simple script to search Cufflinks outputs.

The script is written in Python. The implementation is easy: search for key words (like "Estimated Mean", etc) in log files. The usage is as follows:


This script extracts insert size information from Cufflinks logs.
Usage: getinsertsize [cufflinks log file]
Note: you may specify different log files using filename wildcards.



Sample output:

File    MapMass ReadLength      Mean    Std
cufflinksinvoke.sh.e2148442     34108402.45     85      150.35  19.18
cufflinksinvoke.sh.e2148443     31007287.59     85      155.19  17.45
cufflinksinvoke.sh.e2148444     37286038.02     85      123.87  25.60

Source code:



Friday, August 19, 2011

Qseq and export file format of Illumina

Most Illumina NGS data files we face are FASTQ/FASTA formats, which include the read sequence and (possible) quality scores. If reads are mapped to the chromosome or reference sequence, SAM/BAM file formats are common. However, sometimes we see other formats instead of these common file formats, for example .qseq or _export files. They are generated by earlier Illumina machines and it's best to convert them to commonly used FASTQ/SAM formats.

QSEQ file is the raw read file; export file format records the mapping location of the read, but the read sequence and quality is retained (like SAM/BAM format).


QSEQ File Formats

A sample qseq line is as follows:

SOLEXA5 1       4       1       1137    6698    0       1       TAAATCAAAAGCACAATGAGATATCAATTTTCACCCACTGGAATGGCTATA     aa]a]WY]F]aWaZWa]a][a]^]aaaaa_]Y``QaaaUa\aa]]YU`^]P     1

According to the Illumina Documentation (Pipeline CASAVA), here are the meanings of each field:
  1. Machine name: (hopefully) unique identifier of the sequencer.
  2. Run number: (hopefully) unique number to identify the run on the sequencer.
  3. Lane number: positive integer (currently 1-8).
  4. Tile number: positive integer.
  5. X: x coordinate of the spot. Integer (can be negative).
  6. Y: y coordinate of the spot. Integer (can be negative).
  7. Index: positive integer. No indexing should have a value of 1. 
  8. Read Number: 1 for single reads; 1 or 2 for paired ends.
  9. Sequence 
  10. Quality: the calibrated quality string.
  11. Filter: Did the read pass filtering? 0 - No, 1 - Yes
EXPORT File Format

For export files, there are even more fields. Here is two sample lines of _export.txt files (notice that "|" is inserted between fields for better visualization; they are not included in original fields):

SOLEXA7_25_12N32AACC    |       |5      |1      |459    |1646   |       |1  |TGGGCCNACAACCCCGCACAGTCCCCNCCGCAACCCCCAGCGCTTGCCNC     |ENXXXUCXXXEEXXDXEXXXXPCXEXCLEGSNASSNJNKNHKAEHA?N?A     |NM     |       |       |       |       |       |       |       |       |       |       |N
SOLEXA7_25_12N32AACC    |       |5      |1      |560    |1611   |       |1      |GGCTTGGGAGCTGGTGCTTTCTTTTTTTCTTTTCTTTCTTTTTTTTTTTT     |YYYYYYXYYYYYYXYYYYYYYYYYYYYYYYSSSSSOOOOOOOOOOOOOON     |chr19  |       | 2649567        | R      |48C1   |46     |0      |       |       |0      |N      |Y


The fields are (also from Illumina Documentation):

  1. Machine (Parsed from Run Folder name)
  2. Run Number (Parsed from Run Folder name)
  3. Lane
  4. Tile
  5. X Coordinate of cluster
  6. Y Coordinate of cluster
  7. Index string (Blank for a non-indexed run)
  8. Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)
  9. Read
  10. Quality string—In symbolic ASCII format (ASCII character code = quality value + 64)
  11. Match chromosome—Name of chromosome match OR code indicating why no match resulted
  12. Match Contig—Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)
  13. Match Position—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
  14. Match Strand—“F” for forward, “R” for reverse (Blank if no match found)
  15. Match Descriptor—Concise description of alignment (Blank if no match found)
    • A numeral denotes a run of matching bases
    • A letter denotes substitution of a nucleotide: For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution of a “C” at the 33rd position
  16. Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
  17. Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
  18. Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)
  19. Partner Contig—Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs (Blank for single-read analysis)
  20. Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)
  21. Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)
  22. Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no


Conversion 

Here I provide some useful tools and scripts for conversion of QSEQ and EXPORT files.

Samtools provides a script "export2sam.pl" to convert export format to SAM format.

For qseq to FASTQ conversion, Xi Wang (from SeqAnswer.com) provides a very simple Perl script. I modified it a little bit to filter quality control reads (those with the 11th field is 0). If you want to keep these reads (which are not recommended), remove the if condition below.

#!/usr/bin/perl

use warnings;
use strict;

while (<>) {
	chomp;
	my @parts = split /\t/;
	if($parts[10] == 1){ # remove this if you want to keep quality control reads
		print "@","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
		print "$parts[8]\n";
		print "+\n";
		print "$parts[9]\n";
	}
}

It works pretty well for export file formats.

For export to FASTQ convert, the latestst MAQ package (>0.6.6) provides a script fq_all2std.pl to do this:

Usage: fq_all2std.pl
Command:

scarf2std Convert SCARF format to the standard/Sanger FASTQ
fqint2std Convert FASTQ-int format to the standard/Sanger FASTQ
sol2std Convert Solexa/Illumina FASTQ to the standard FASTQ
fa2std Convert FASTA to the standard FASTQ
seqprb2std Convert .seq and .prb files to the standard FASTQ
fq2fa Convert various FASTQ-like format to FASTA
export2sol Convert Solexa export format to Solexa FASTQ
export2std Convert Solexa export format to Sanger FASTQ
csfa2std Convert AB SOLiD read format to Sanger FASTQ
instruction Explanation to different format
example Show examples of various formats


Thursday, August 18, 2011

Converting Cufflinks .GTF predictions to .BED files

Cufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). IGV browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!

Of course there are general tools to convert .GTF to .BED. For example, here is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.

Here is one example of .GTF file predicted by Cufflinks:


chr1    Cufflinks       transcript      934797  935655  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";
chr1    Cufflinks       exon    934797  934812  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
chr1    Cufflinks       exon    934906  935655  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";

The converted .BED file includes only one line as follows:

chr1    934796  935655  CUFF.29.3       24      -       934796  935655  0,0,255 2       16,750  0,109

The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.

Feel free to comment or ask questions.


Bases in sequence positions

There are two different coordinate base systems in different files: 0-base and 1-base. Different files use different base systems, and sometimes it causes confusions (especially when one tries to calculate the length of the region). Here I show the differences in two systems, and summarize several file formats that use both sytems.

0-base system: the first base is 0. You represent a region as [a, b). This is also called "half-close-half-open", "0-base end exclusive", or "1-base end inclusive". When calculating the length of the region, subtract a from b directly:

L=b-a

1-base system: the first base is 1, and you represent a region as [a,b]. When calculating the length of the region, don't forget to add 1:

L=b-a+1

Here is an example. Suppose you want to represent a region of X in the following sequence:


000XXX0000


In the 0-base system, this is represented as [3,6), and the length is 6-3=3. In the 1-base system, use [4, 6], and the length is 6-4+1=3.


0-base system files: BED, BAM
1-base system files: SAM, GFF, GTF, Wig, PSL