Friday, April 6, 2012

Estimating paired-end read insert length from SAM/BAM files

I wrote a single Python script to estimate the paired-end read insert length (or fragment length) from read mapping information (i.e., SAM/BAM files). The algorithm is simple: check the TLEN field in the SAM format, throw out pair-end reads whose pairs are too far away, and use them to estimate the mean and variance of the insert length.

This script is also able to provide a detailed distribution of read length and read span for your convenience. Please refer to the detailed usage below.

This script is distributed in GitHub now.

Usage:

getinsertsize.py [ SAM file | -]

or

samtools view [ BAM file ] | getinsertsize.py - 


Detailed Usage:


usage: getinsertsize.py [-h] [--span-distribution-file SPAN_DISTRIBUTION_FILE]
                        [--read-distribution-file READ_DISTRIBUTION_FILE]
                        SAMFILE

Automatically estimate the insert size of the paired-end reads for a given
SAM/BAM file.

positional arguments:
  SAMFILE               Input SAM file (use - from standard input)

optional arguments:
  -h, --help            show this help message and exit
  --span-distribution-file SPAN_DISTRIBUTION_FILE, -s SPAN_DISTRIBUTION_FILE
                        Write the distribution of the paired-end read span
                        into a text file with name SPAN_DISTRIBUTION_FILE.
                        This text file is tab-delimited, each line containing
                        two numbers: the span and the number of such paired-
                        end reads.
  --read-distribution-file READ_DISTRIBUTION_FILE, -r READ_DISTRIBUTION_FILE
                        Write the distribution of the paired-end read length
                        into a text file with name READ_DISTRIBUTION_FILE.
                        This text file is tab-delimited, each line containing
                        two numbers: the read length and the number of such
                        paired-end reads.


Sample output:

Read length: mean 90.6697303194, STD=15.9446036414
Possible read length and their counts:
{108: 43070882, 76: 50882326}
Read span: mean 165.217445903, STD=32.8914834802


Note: If the SAM/BAM file size is too large, it is accurate enough to estimate based on a few reads (like 1 millioin). In this case, you can run the script as follows:

head -n 1000000 [ SAM file ] |  getinsertsize.py -

or

samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py -

Note: According to the SAM definition, the read span "equals the number of bases from the leftmost mapped base to the rightmost mapped base". This span is the distance between two reads in a paired-end read PLUS 2 times read length. Read span is different from the "mate-inner-distance" in Tophat (-r option), which measures only the distance between two reads in a paired-end read.

47 comments:

  1. Hi wei I ma using your script to calculate the insert size but it is saying this gaurav@gaurav-OptiPlex-980:~/Desktop$ samtools view bowtie_out.nameSorted.PropMapPairsForRSEM.bam | head -n 1000000 | python getinsertsize.py
    1M...
    Read length: mean 100.0, STD=0.0
    Possible read length and their counts:
    {100: 1000000}
    No qualified paired-end reads found. Are they single-end reads?

    Can you please tell me how to check whether my reads are single end or paired end.

    ReplyDelete
  2. Probably these reads are single-end reads. You can check a few lines of your BAM file, and look at the 7th and 8th field of each line. If they are single-end reads, you will see marks like "0 *", which means there is no corresponding paired information.

    ReplyDelete
  3. Hi Wei li,

    I checked my read are paired end. Actually I built a denovo transcriptome and mapped those reads back to the transcriptome and made the BAM file. So will this script will work in this case also or it is specifically written for a reference based assembly.

    ReplyDelete
  4. Hi Wei li,

    I also get a message saying:
    No qualified paired-end reads found. Are they single-end reads?

    but my reads are paired-ends. I can see the pairing with igv. Do you have any suggestion?

    ReplyDelete
    Replies
    1. it's possible that your bam format is older version. can you paste a few lines of your bam file? use "samtools view test.bam | head -n 10 " print the first 10 lines of your bam record (test.bam).

      Delete
  5. Hello. Thanks for sharing the script! btw, can I ask why you used PNEXT field(8th column in a SAM file) rather than TLEN field(9th column in a SAM file). From my understanding, TLEN indicates an inferred insert size.

    ReplyDelete
    Replies
    1. In most cases you can use either way to calculate the insert size.

      Delete
  6. Thanks for the script Wei.

    I had to modify it a little to get it to work, and I think this is the same problem the commenters above me were probably having. The NH:i: field is an optional field, and many short read mapper don't actually include it, including some popular mappers.

    Since your script required the optional field to be present and equal to 1 to count that read, all the reads were getting skipped (giving that message: no qualified paired-end reads). I commented out those lines pertaining to this NH:i: field, and then the script ran successfully and counted the reads.

    Cheers

    ReplyDelete
  7. Hi craig,

    Can you please pot the code here. I am still getting the error.

    ReplyDelete
    Replies
    1. Just comment line 63 to line 66 and it should be fine with the "NH:i" issues. I've also updated the script so you can try on the newest one.

      Delete
  8. I get the following error:

    $samtools view 10081.marked.realigned.recal.bam | head -n 1000000 | python getinsertsize.py
    File "getinsertsize.py", line 45
    print(str(nline/1000000)+'M...',file=sys.stderr);
    ^
    SyntaxError: invalid syntax

    Does any one know what the problem is?

    ReplyDelete
    Replies
    1. what's your python version? It works for 2.7, but I'm not sure if it works under 2.6 or lower.

      Delete
    2. Okay I updated my version and I got the following output:

      1.0M...
      Read length: mean 101.0, STD=0.0
      Possible read length and their counts:
      {101: 951678}
      Read span: mean 0.0, STD=0.0

      I am certain these are paired end reads, why do I get 0 span?

      Here is some of my bam file too:

      HWI-1KL153:29:D1N72ACXX:8:1209:12276:12090 1187 chrM 9 60 1 01M = 200 292 GTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA TTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAG ?@>B?>EBBACB@??@>@?@@BC? B>B?ADCABBCBCADCC?>@DC=<@A@?C<@DBCABBBC:=&<@>FFCB@CF9BFBE@G<=CA?BB9>>B X 0:i:1 X1:i:0 BD:Z:KKOLLKNLJLILLJJIJKLMJLLLJMMIMMONLKLMNNNONNKBMNMMKKBBLNMOLON JJJJMKJKOOKNNNONKNPOOLNPNPONOOPQQQQQTTMNNN MD:Z:101 RG:Z:Exome_10081 XG:i:0 BI:Z:NNRPQNPPMOKPPNMNLNORMPLPMQOLQPRQLOLORQRQQQNFPPQQONGGPRQRPRQMMMMRNOO RRORQQOQPRSRSOQSRPRRSTSUUTUUWUQPRQ AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 M Q:i:60 XT:A:U

      Delete
    3. Do you mind sending me the first 10000 lines of your bam file (use head -n 10000)? I can take a look at what happens. You can zip the text file and send to li.david.wei AT gmail.com. Thanks!

      Delete
    4. Hi David,I've sent it to you, please let me know if you got it.

      Thanks for your help!

      Delete
    5. I've looked at your data. The problem is there are too many reads with insert size 0, which confuses the program and filter all the reads. This is a bug in my program actually, and I've updated the source code and it should work well now. Thanks & give it a try!

      PS: you need to run the program like "samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py -" now instead of "samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py".

      Delete
    6. Hi Wei,

      it seems to work well thank you for your very quick responses!

      For my one sample I have the following

      Read span: mean 364.3316085392387, STD=123.087158684682

      The STD seems to be fairly high, but I'm not exactly sure. Does anyone have experience with what the expected STD should be?

      Thanks,

      Mike Chong

      Delete
    7. Probably using more reads will have a better estimation of the span. Also, it seems that many of your first 10M reads are from chrM, which may not be a good choice. You can try to use reads from other chromosomes?

      Delete
    8. Hi Wei,

      I tried using the whole bam and I got pretty much the same number (~100 million reads). So this seems to be a real result.

      I was also wondering can the script be modified so that the span-distribution-file be outputted so that read counts are for 10bp bins? I am new at python and help would be greatly appreciated.

      Thanks,

      MC

      Delete
    9. You can probably load the file into R and check the histogram. That should give you an intuitive overview of the span distributions..

      Delete
  9. Sorry the arrow should be pointing at the equal sign from "file= sys.stderr"

    ReplyDelete
  10. Hi! and thank you very much for your script.

    I would suggest to improve its performances to get rid of '.keys()' at lines 61 and 76.
    if readlen in plrdlen.keys() would become if readlen in plrdlen.

    By the way, could you define exactly what you mean by read span?

    Thanks again!

    ReplyDelete
    Replies
    1. Thanks for your reply. read span is the distance between two reads in a paired-end read plus 2* read length. It is equal to the insert size of the fragment.

      Delete
  11. Hi wei,

    Great script!! Many thanks from Malaysia...

    Wondering if you would know how to integrate the read span (assuming this read span is referring to inset length) with tophat2 alignment?

    More specifically, one of the tophat2 parameters is "-r" (also known as mate-inner-distance) http://tophat.cbcb.umd.edu/manual.html

    So if for example the read span value from your script is 200, would you know what to use for the tophat2 -r parameter??

    Would you need to consider the read length of the sample (example 100bp) to determine a correct tophat2 -r value???

    Many thanks

    ReplyDelete
    Replies
    1. the tophat2 -r value is the read span minus 2*read length, so it's read length dependent. Since it is the distance between two reads in a paired-end read, it's called "mate-inner-distance".

      Delete
    2. thanks for the reply wei :)

      Hmmmmmmm...following up from the question,

      To include the -r parameter seems straightforward thanks to your clarification....Now that I included the (read span - 2*readlength) value in Tophat..... To add a layer of complication, if I wish also to include the standard deviation value in the mapping parameters...I can not just use the value from your script am I correct?

      For example the output from your script
      read span : 242 SD :90

      lets say readlength = 100

      Therefore -r value would be : 42 (242-2*100). However the same cant be done with SD, we cant just minus a value..(if my understanding is correct)


      The question I am asking...would it be possible to get an adjusted SD value based modified read-span length?? I am thinking of simply subtracting 200 from each read length to get a new SD estimate..

      I been staring at the script and see were would be a place to insert the readlength*2 value and get a new correct SD...but I have cant pin point where...


      Kind regards

      Delete
    3. Hi, the SD is the same (90 in your sample). This is because for a random variable x and a constant a, SD(x)=SD(x-a). The read span comes directly from the definition of the field[8], which already includes the read length.

      Delete
    4. Thanks for the insight Wei!

      I am afraid I still don't fully understand.... I think I need to dive deeper into statistics as well as sam field :)

      I sill fail to see why if we adjusted the read span (by subtracting 200) the SD will remain the same??

      Would a alternate workflow be
      1)Run your script first, get an estimate for read span
      2)Subtract filed[8] with the value of readspan-2*read-length
      3)Rerun an adjusted script and get a more accurate estimate for SD

      Or is it just I need to read more on basic statistics and SD :)

      Cheers

      Delete
  12. Thanks, just used this, very useful!

    ReplyDelete
  13. Hi,

    thank you for this post.
    I have a question : why with different subsample of the same bam I can have so manydifferences in read span ?
    For example :

    1) samtools view accepted_hits_chr22.bam | head -n 100000 | python getinsertsize.py -
    Read length: mean 101.0, STD=0.0
    Read span: mean 236.150649351, STD=88.6332394609
    2) samtools view accepted_hits_chr22.bam | head -n 400000 | python getinsertsize.py -
    Read length: mean 101.0, STD=0.0
    Read span: mean 221.92447156, STD=79.0142612548
    3) samtools view accepted_hits_chr22.bam | head -n 800000 | python getinsertsize.py -
    Read length: mean 101.0, STD=0.0
    Read span: mean 3383.2321526, STD=3460.01998406
    4) samtools view accepted_hits_chr22.bam | head -n 1000000 | python getinsertsize.py -
    Read length: mean 101.0, STD=0.0
    Read span: mean 3111.18687862, STD=3442.69889313

    Thank you in advance

    ReplyDelete
    Replies
    1. So this means there're some really long paired-end reads that disrupt the calculation of read span. You can just use the results of the top 100k reads.

      Delete
  14. This comment has been removed by the author.

    ReplyDelete
  15. Hi Everyone,

    I'm trying your script on sam format. > igave follwoing command

    python getinsertsize.py xyz.sam | -

    it showed error at line 16 # no module argparse

    above command need any modification or whats command to run your code on sam format . I tried with view option in samtools but it's for bam format not for sam. Please can you guide me How can I run for sam format to calculate Mean, read length and SD from sam alignment file as input

    Thanks in advance

    Thanks

    ReplyDelete
    Replies
    1. Hi, this means there's no argparse module in your Python system. What python version did you use? Python 2.7 should have no problem running this script.

      Delete
    2. k thanks, I'll update python. My python version 2.6.6 Thanks for reply.

      Delete
    3. I tried on updated python 2.7 now. It run completely but at last rather to print output.It showed following error.

      close failed in file object destructor:
      sys.excepthook is missing
      lost sys.stderr

      Could you suggest me why this error appearing

      Thanks

      Delete
    4. Thanks a lot Wei Li, Now Its working perfectly

      Delete
  16. Hi Wei,
    Thanks a lot for this. would you know how to adapt the script so one could select or exclude specific reads based on the span length? Eg: say you wanted only read pairs with span <500.
    thanks!

    ReplyDelete
  17. I'm mapping some illumina reads to a closely related species using BWA and getting some odd results. My mean read span is non-zero but is less than the mean read length. Any idea what could cause that? I'm fairly new to this and don't know python.

    Read length: mean 94.7053421448, STD=15.8237819615
    Read span: mean 74.962962963, STD=48.8084808847

    ReplyDelete
    Replies
    1. Probably your bam file includes a mixture of reads with different lengths. This may make it complicated for calculating read span.

      Delete
    2. I have exactly the same issue.

      Delete
  18. Nice..blog you can also visit our website that is a estimating software.http://www.accurateestimator.com/

    ReplyDelete
  19. Any improvement for un-even read length reads?
    I have trouble with different (MP/PE) libraries, but the insert size are all ~150bp! Not sure if that's due to un-even read length.

    ReplyDelete
  20. Thank you for good working tools
    i got this output: Read length of paired-end reads : mean 296.22, STD=22.51
    Read span: mean 484.54, STD=108.36
    Does it mean my insert size is 484-296=184 ?

    ReplyDelete
  21. I'm 15 years old. I was born with HIV my mother passed away because of the HIV infection And I regret why i never met Dr Itua he could have cured my mum for me because as a single mother it was very hard for my mother I came across Dr itua healing words online about how he cure different disease in different races diseases like HIV/Aids Herpes Copd Diabetes Hepatitis even Cancer I was so excited but frighten at same time because I haven't come across such thing article online then I contacted Dr Itua on Mail drituaherbalcenter@gmail.com I also chat with him on what's app +2348149277967 he tells me how it works then I tell him I want to proceed I paid him so swiftly Colorado post office I receive my herbal medicine within 4/5 working days he gave me guild lines to follow and here am I living healthy again can imagine how god use men to manifest his works am I writing in all articles online to spread the god work of Dr Itua Herbal Medicine Yes can ask me anything about this on my twitter @ericamilli or text 7205992850.He's a Great Man.

    ReplyDelete