All About Bioinformatics: Converting Cufflinks .GTF predictions to .BED files

Thursday, August 18, 2011

Converting Cufflinks .GTF predictions to .BED files

Cufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). IGV browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!

Of course there are general tools to convert .GTF to .BED. For example, here is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.

Here is one example of .GTF file predicted by Cufflinks:

chr1 Cufflinks transcript 934797 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";
chr1 Cufflinks exon 934797 934812 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
chr1 Cufflinks exon 934906 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";

The converted .BED file includes only one line as follows:

chr1 934796 935655 CUFF.29.3 24 - 934796 935655 0,0,255 2 16,750 0,109

The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.

Feel free to comment or ask questions.

9 comments:

Gireesh Kumar BoguDecember 9, 2011 at 7:50 AM
Greatscript. However I got this error
gisnb111:datasets bogugk$ python gtf2bed.py cufflinks-TranscriptDeNovo.gtf

File "gtf2bed.py", line 36
print('Warning: no gene_id field ',file=sys.stderr);
^
SyntaxError: invalid syntax
ReplyDelete
Replies
Gireesh Kumar BoguDecember 9, 2011 at 7:52 AM
This comment has been removed by the author.
ReplyDelete
Replies
historypakSeptember 17, 2015 at 2:34 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownDecember 23, 2016 at 9:03 PM
Hi, Thanks for this great script. However, I don't know how to save the output. When I run the script to convert my gtf file, everything was outputted to my screen rather than saving a file. What should I do? Thanks in advance!
ReplyDelete
Replies
UnknownDecember 24, 2016 at 8:29 PM
Use redirection (">"). See the following link: http://linuxcommand.org/lts0060.php
ReplyDelete
Replies
UnknownJanuary 3, 2017 at 12:44 PM
Hi, thanks for the script.
I am using gtf output from StringTie tool. When I ran the script it warns that 'Warning: no FPKM field'

The FPKM value on field[8] is different from cufflinks
e.g
gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "transcript:Sb01g000200.1"; ref_gene_id "gene:Sb01g000200"; cov "0.064565"; FPKM "0.019950"; TPM "0.026536";
gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; reference_id "transcript:Sb01g000200.1"; ref_gene_id "gene:Sb01g000200"; cov "0.064565";

How do I alter the code to catch the FPKM value the current regular expression syntax is not working.

Thanks
ReplyDelete
Replies
UnknownMay 14, 2018 at 10:05 AM
Hi, I tried the script, it works well with plus strand ("+"), however for minus strand ("-"), it does not seem to give right result, result in negative numbers
ReplyDelete
Replies
Mr Bob BillsMay 13, 2019 at 10:50 AM
Bondi Two Seat Sofa Beach. Store Price: $1,550.00. Black Rice Price: $850.00. Picture of Bondi Three Seat Sofa Beach. Bondi Three Seat Sofa Beach. queen size rollaway beds
ReplyDelete
Replies

Add comment