Friday, August 19, 2011

Qseq and export file format of Illumina

Most Illumina NGS data files we face are FASTQ/FASTA formats, which include the read sequence and (possible) quality scores. If reads are mapped to the chromosome or reference sequence, SAM/BAM file formats are common. However, sometimes we see other formats instead of these common file formats, for example .qseq or _export files. They are generated by earlier Illumina machines and it's best to convert them to commonly used FASTQ/SAM formats.

QSEQ file is the raw read file; export file format records the mapping location of the read, but the read sequence and quality is retained (like SAM/BAM format).


QSEQ File Formats

A sample qseq line is as follows:

SOLEXA5 1       4       1       1137    6698    0       1       TAAATCAAAAGCACAATGAGATATCAATTTTCACCCACTGGAATGGCTATA     aa]a]WY]F]aWaZWa]a][a]^]aaaaa_]Y``QaaaUa\aa]]YU`^]P     1

According to the Illumina Documentation (Pipeline CASAVA), here are the meanings of each field:
  1. Machine name: (hopefully) unique identifier of the sequencer.
  2. Run number: (hopefully) unique number to identify the run on the sequencer.
  3. Lane number: positive integer (currently 1-8).
  4. Tile number: positive integer.
  5. X: x coordinate of the spot. Integer (can be negative).
  6. Y: y coordinate of the spot. Integer (can be negative).
  7. Index: positive integer. No indexing should have a value of 1. 
  8. Read Number: 1 for single reads; 1 or 2 for paired ends.
  9. Sequence 
  10. Quality: the calibrated quality string.
  11. Filter: Did the read pass filtering? 0 - No, 1 - Yes
EXPORT File Format

For export files, there are even more fields. Here is two sample lines of _export.txt files (notice that "|" is inserted between fields for better visualization; they are not included in original fields):

SOLEXA7_25_12N32AACC    |       |5      |1      |459    |1646   |       |1  |TGGGCCNACAACCCCGCACAGTCCCCNCCGCAACCCCCAGCGCTTGCCNC     |ENXXXUCXXXEEXXDXEXXXXPCXEXCLEGSNASSNJNKNHKAEHA?N?A     |NM     |       |       |       |       |       |       |       |       |       |       |N
SOLEXA7_25_12N32AACC    |       |5      |1      |560    |1611   |       |1      |GGCTTGGGAGCTGGTGCTTTCTTTTTTTCTTTTCTTTCTTTTTTTTTTTT     |YYYYYYXYYYYYYXYYYYYYYYYYYYYYYYSSSSSOOOOOOOOOOOOOON     |chr19  |       | 2649567        | R      |48C1   |46     |0      |       |       |0      |N      |Y


The fields are (also from Illumina Documentation):

  1. Machine (Parsed from Run Folder name)
  2. Run Number (Parsed from Run Folder name)
  3. Lane
  4. Tile
  5. X Coordinate of cluster
  6. Y Coordinate of cluster
  7. Index string (Blank for a non-indexed run)
  8. Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)
  9. Read
  10. Quality string—In symbolic ASCII format (ASCII character code = quality value + 64)
  11. Match chromosome—Name of chromosome match OR code indicating why no match resulted
  12. Match Contig—Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)
  13. Match Position—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
  14. Match Strand—“F” for forward, “R” for reverse (Blank if no match found)
  15. Match Descriptor—Concise description of alignment (Blank if no match found)
    • A numeral denotes a run of matching bases
    • A letter denotes substitution of a nucleotide: For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution of a “C” at the 33rd position
  16. Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
  17. Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
  18. Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)
  19. Partner Contig—Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs (Blank for single-read analysis)
  20. Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)
  21. Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)
  22. Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no


Conversion 

Here I provide some useful tools and scripts for conversion of QSEQ and EXPORT files.

Samtools provides a script "export2sam.pl" to convert export format to SAM format.

For qseq to FASTQ conversion, Xi Wang (from SeqAnswer.com) provides a very simple Perl script. I modified it a little bit to filter quality control reads (those with the 11th field is 0). If you want to keep these reads (which are not recommended), remove the if condition below.

#!/usr/bin/perl

use warnings;
use strict;

while (<>) {
	chomp;
	my @parts = split /\t/;
	if($parts[10] == 1){ # remove this if you want to keep quality control reads
		print "@","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
		print "$parts[8]\n";
		print "+\n";
		print "$parts[9]\n";
	}
}

It works pretty well for export file formats.

For export to FASTQ convert, the latestst MAQ package (>0.6.6) provides a script fq_all2std.pl to do this:

Usage: fq_all2std.pl
Command:

scarf2std Convert SCARF format to the standard/Sanger FASTQ
fqint2std Convert FASTQ-int format to the standard/Sanger FASTQ
sol2std Convert Solexa/Illumina FASTQ to the standard FASTQ
fa2std Convert FASTA to the standard FASTQ
seqprb2std Convert .seq and .prb files to the standard FASTQ
fq2fa Convert various FASTQ-like format to FASTA
export2sol Convert Solexa export format to Solexa FASTQ
export2std Convert Solexa export format to Sanger FASTQ
csfa2std Convert AB SOLiD read format to Sanger FASTQ
instruction Explanation to different format
example Show examples of various formats


3 comments:

  1. Wow, great pictures. I was in Arkansas last summer for a week. Mount Magazine. It was beautiful.
    Split face Tile

    ReplyDelete
  2. I'm 15 years old. I was born with HIV my mother passed away because of the HIV infection And I regret why i never met Dr Itua he could have cured my mum for me because as a single mother it was very hard for my mother I came across Dr itua healing words online about how he cure different disease in different races diseases like HIV/Aids Herpes Copd Diabetes Hepatitis even Cancer I was so excited but frighten at same time because I haven't come across such thing article online then I contacted Dr Itua on Mail drituaherbalcenter@gmail.com I also chat with him on what's app +2348149277967 he tells me how it works then I tell him I want to proceed I paid him so swiftly Colorado post office I receive my herbal medicine within 4/5 working days he gave me guild lines to follow and here am I living healthy again can imagine how god use men to manifest his works am I writing in all articles online to spread the god work of Dr Itua Herbal Medicine Yes can ask me anything about this on my twitter @ericamilli or text 7205992850.He's a Great Man.

    ReplyDelete
  3. Thanks for sharing this informative post, nowadays everyone wants to look beautiful because beauty is symbol of Confidence. If you feel inner confidence you are able to stand in front of anyone and can respond with confidence. We are also serving for the Best salon services .We are providing best salon packages salon in chinthal for men and women at very reasonable and affordable prices and further more salon in chinthal for men and women.

    ReplyDelete