Cufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). IGV browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!
Of course there are general tools to convert .GTF to .BED. For example, here is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.
Here is one example of .GTF file predicted by Cufflinks:
chr1 Cufflinks transcript 934797 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";
chr1 Cufflinks exon 934797 934812 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
chr1 Cufflinks exon 934906 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
The converted .BED file includes only one line as follows:
chr1 934796 935655 CUFF.29.3 24 - 934796 935655 0,0,255 2 16,750 0,109
The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.
Feel free to comment or ask questions.
Greatscript. However I got this error
ReplyDeletegisnb111:datasets bogugk$ python gtf2bed.py cufflinks-TranscriptDeNovo.gtf
File "gtf2bed.py", line 36
print('Warning: no gene_id field ',file=sys.stderr);
^
SyntaxError: invalid syntax
This is due to the different syntax between python 2.7 and python 3. I wrote this script for python 3, so it will cause problems if you use python 2.7.
DeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi, Thanks for this great script. However, I don't know how to save the output. When I run the script to convert my gtf file, everything was outputted to my screen rather than saving a file. What should I do? Thanks in advance!
ReplyDeleteUse redirection (">"). See the following link: http://linuxcommand.org/lts0060.php
ReplyDeleteHi, thanks for the script.
ReplyDeleteI am using gtf output from StringTie tool. When I ran the script it warns that 'Warning: no FPKM field'
The FPKM value on field[8] is different from cufflinks
e.g
gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "transcript:Sb01g000200.1"; ref_gene_id "gene:Sb01g000200"; cov "0.064565"; FPKM "0.019950"; TPM "0.026536";
gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; reference_id "transcript:Sb01g000200.1"; ref_gene_id "gene:Sb01g000200"; cov "0.064565";
How do I alter the code to catch the FPKM value the current regular expression syntax is not working.
Thanks
Hi, I tried the script, it works well with plus strand ("+"), however for minus strand ("-"), it does not seem to give right result, result in negative numbers
ReplyDeleteBondi Two Seat Sofa Beach. Store Price: $1,550.00. Black Rice Price: $850.00. Picture of Bondi Three Seat Sofa Beach. Bondi Three Seat Sofa Beach. queen size rollaway beds
ReplyDelete