Most Illumina NGS data files we face are
FASTQ/FASTA formats, which include the read sequence and (possible) quality scores. If reads are mapped to the chromosome or reference sequence,
SAM/BAM file formats are common. However, sometimes we see other formats instead of these common file formats, for example .qseq or _export files. They are generated by earlier Illumina machines and it's best to convert them to commonly used FASTQ/SAM formats.
QSEQ file is the raw read file; export file format records the mapping location of the read, but the read sequence and quality is retained (like SAM/BAM format).
QSEQ File Formats
A sample qseq line is as follows:
SOLEXA5 1 4 1 1137 6698 0 1 TAAATCAAAAGCACAATGAGATATCAATTTTCACCCACTGGAATGGCTATA aa]a]WY]F]aWaZWa]a][a]^]aaaaa_]Y``QaaaUa\aa]]YU`^]P 1
According to the Illumina Documentation (
Pipeline CASAVA), here are the meanings of each field:
- Machine name: (hopefully) unique identifier of the sequencer.
- Run number: (hopefully) unique number to identify the run on the sequencer.
- Lane number: positive integer (currently 1-8).
- Tile number: positive integer.
- X: x coordinate of the spot. Integer (can be negative).
- Y: y coordinate of the spot. Integer (can be negative).
- Index: positive integer. No indexing should have a value of 1.
- Read Number: 1 for single reads; 1 or 2 for paired ends.
- Sequence
- Quality: the calibrated quality string.
- Filter: Did the read pass filtering? 0 - No, 1 - Yes
EXPORT File Format
For export files, there are even more fields. Here is two sample lines of _export.txt files (notice that "|" is inserted between fields for better visualization; they are not included in original fields):
SOLEXA7_25_12N32AACC | |5 |1 |459 |1646 | |1 |TGGGCCNACAACCCCGCACAGTCCCCNCCGCAACCCCCAGCGCTTGCCNC |ENXXXUCXXXEEXXDXEXXXXPCXEXCLEGSNASSNJNKNHKAEHA?N?A |NM | | | | | | | | | | |N
SOLEXA7_25_12N32AACC | |5 |1 |560 |1611 | |1 |GGCTTGGGAGCTGGTGCTTTCTTTTTTTCTTTTCTTTCTTTTTTTTTTTT |YYYYYYXYYYYYYXYYYYYYYYYYYYYYYYSSSSSOOOOOOOOOOOOOON |chr19 | | 2649567 | R |48C1 |46 |0 | | |0 |N |Y
The fields are (also from Illumina Documentation):
- Machine (Parsed from Run Folder name)
- Run Number (Parsed from Run Folder name)
- Lane
- Tile
- X Coordinate of cluster
- Y Coordinate of cluster
- Index string (Blank for a non-indexed run)
- Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)
- Read
- Quality string—In symbolic ASCII format (ASCII character code = quality value + 64)
- Match chromosome—Name of chromosome match OR code indicating why no match resulted
- Match Contig—Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)
- Match Position—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
- Match Strand—“F” for forward, “R” for reverse (Blank if no match found)
- Match Descriptor—Concise description of alignment (Blank if no match found)
- A numeral denotes a run of matching bases
- A letter denotes substitution of a nucleotide: For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution of a “C” at the 33rd position
- Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
- Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
- Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)
- Partner Contig—Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs (Blank for single-read analysis)
- Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)
- Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)
- Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no
Conversion
Here I provide some useful tools and scripts for conversion of QSEQ and EXPORT files.
Samtools provides a script "export2sam.pl" to convert export format to SAM format.
For qseq to FASTQ conversion, Xi Wang (from
SeqAnswer.com) provides a very simple Perl script. I modified it a little bit to filter quality control reads (those with the 11th field is 0). If you want to keep these reads (which are not recommended), remove the if condition below.
#!/usr/bin/perl
use warnings;
use strict;
while (<>) {
chomp;
my @parts = split /\t/;
if($parts[10] == 1){ # remove this if you want to keep quality control reads
print "@","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
print "$parts[8]\n";
print "+\n";
print "$parts[9]\n";
}
}
It works pretty well for export file formats.
For export to FASTQ convert, the latestst
MAQ package (>0.6.6) provides a script fq_all2std.pl to do this:
Usage: fq_all2std.pl
Command:
scarf2std Convert SCARF format to the standard/Sanger FASTQ
fqint2std Convert FASTQ-int format to the standard/Sanger FASTQ
sol2std Convert Solexa/Illumina FASTQ to the standard FASTQ
fa2std Convert FASTA to the standard FASTQ
seqprb2std Convert .seq and .prb files to the standard FASTQ
fq2fa Convert various FASTQ-like format to FASTA
export2sol Convert Solexa export format to Solexa FASTQ
export2std Convert Solexa export format to Sanger FASTQ
csfa2std Convert AB SOLiD read format to Sanger FASTQ
instruction Explanation to different format
example Show examples of various formats