Thursday, August 18, 2011

Converting Cufflinks .GTF predictions to .BED files

Cufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). IGV browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!

Of course there are general tools to convert .GTF to .BED. For example, here is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.

Here is one example of .GTF file predicted by Cufflinks:

chr1    Cufflinks       transcript      934797  935655  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";
chr1    Cufflinks       exon    934797  934812  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
chr1    Cufflinks       exon    934906  935655  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";

The converted .BED file includes only one line as follows:

chr1    934796  935655  CUFF.29.3       24      -       934796  935655  0,0,255 2       16,750  0,109

The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.

Feel free to comment or ask questions.


  1. Greatscript. However I got this error
    gisnb111:datasets bogugk$ python cufflinks-TranscriptDeNovo.gtf

    File "", line 36
    print('Warning: no gene_id field ',file=sys.stderr);
    SyntaxError: invalid syntax

    1. This is due to the different syntax between python 2.7 and python 3. I wrote this script for python 3, so it will cause problems if you use python 2.7.

  2. This comment has been removed by the author.

  3. This comment has been removed by the author.

  4. Hi, Thanks for this great script. However, I don't know how to save the output. When I run the script to convert my gtf file, everything was outputted to my screen rather than saving a file. What should I do? Thanks in advance!

  5. Use redirection (">"). See the following link:

  6. Hi, thanks for the script.
    I am using gtf output from StringTie tool. When I ran the script it warns that 'Warning: no FPKM field'

    The FPKM value on field[8] is different from cufflinks
    gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "transcript:Sb01g000200.1"; ref_gene_id "gene:Sb01g000200"; cov "0.064565"; FPKM "0.019950"; TPM "0.026536";
    gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; reference_id "transcript:Sb01g000200.1"; ref_gene_id "gene:Sb01g000200"; cov "0.064565";

    How do I alter the code to catch the FPKM value the current regular expression syntax is not working.


  7. Hi, I tried the script, it works well with plus strand ("+"), however for minus strand ("-"), it does not seem to give right result, result in negative numbers

  8. Bondi Two Seat Sofa Beach. Store Price: $1,550.00. Black Rice Price: $850.00. Picture of Bondi Three Seat Sofa Beach. Bondi Three Seat Sofa Beach. queen size rollaway beds