This script is also able to provide a detailed distribution of read length and read span for your convenience. Please refer to the detailed usage below.
This script is distributed in GitHub now.
Usage:
getinsertsize.py [ SAM file | -]
or
samtools view [ BAM file ] | getinsertsize.py -
Detailed Usage:
usage: getinsertsize.py [-h] [--span-distribution-file SPAN_DISTRIBUTION_FILE]
[--read-distribution-file READ_DISTRIBUTION_FILE]
SAMFILE
Automatically estimate the insert size of the paired-end reads for a given
SAM/BAM file.
positional arguments:
SAMFILE Input SAM file (use - from standard input)
optional arguments:
-h, --help show this help message and exit
--span-distribution-file SPAN_DISTRIBUTION_FILE, -s SPAN_DISTRIBUTION_FILE
Write the distribution of the paired-end read span
into a text file with name SPAN_DISTRIBUTION_FILE.
This text file is tab-delimited, each line containing
two numbers: the span and the number of such paired-
end reads.
--read-distribution-file READ_DISTRIBUTION_FILE, -r READ_DISTRIBUTION_FILE
Write the distribution of the paired-end read length
into a text file with name READ_DISTRIBUTION_FILE.
This text file is tab-delimited, each line containing
two numbers: the read length and the number of such
paired-end reads.
usage: getinsertsize.py [-h] [--span-distribution-file SPAN_DISTRIBUTION_FILE]
[--read-distribution-file READ_DISTRIBUTION_FILE]
SAMFILE
Automatically estimate the insert size of the paired-end reads for a given
SAM/BAM file.
positional arguments:
SAMFILE Input SAM file (use - from standard input)
optional arguments:
-h, --help show this help message and exit
--span-distribution-file SPAN_DISTRIBUTION_FILE, -s SPAN_DISTRIBUTION_FILE
Write the distribution of the paired-end read span
into a text file with name SPAN_DISTRIBUTION_FILE.
This text file is tab-delimited, each line containing
two numbers: the span and the number of such paired-
end reads.
--read-distribution-file READ_DISTRIBUTION_FILE, -r READ_DISTRIBUTION_FILE
Write the distribution of the paired-end read length
into a text file with name READ_DISTRIBUTION_FILE.
This text file is tab-delimited, each line containing
two numbers: the read length and the number of such
paired-end reads.
Sample output:
Read length: mean 90.6697303194, STD=15.9446036414
Possible read length and their counts:
{108: 43070882, 76: 50882326}
Read span: mean 165.217445903, STD=32.8914834802
Note: If the SAM/BAM file size is too large, it is accurate enough to estimate based on a few reads (like 1 millioin). In this case, you can run the script as follows:
head -n 1000000 [ SAM file ] | getinsertsize.py -
or
samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py -
Note: According to the SAM definition, the read span "equals the number of bases from the leftmost mapped base to the rightmost mapped base". This span is the distance between two reads in a paired-end read PLUS 2 times read length. Read span is different from the "mate-inner-distance" in Tophat (-r option), which measures only the distance between two reads in a paired-end read.
Hi wei I ma using your script to calculate the insert size but it is saying this gaurav@gaurav-OptiPlex-980:~/Desktop$ samtools view bowtie_out.nameSorted.PropMapPairsForRSEM.bam | head -n 1000000 | python getinsertsize.py
ReplyDelete1M...
Read length: mean 100.0, STD=0.0
Possible read length and their counts:
{100: 1000000}
No qualified paired-end reads found. Are they single-end reads?
Can you please tell me how to check whether my reads are single end or paired end.
Probably these reads are single-end reads. You can check a few lines of your BAM file, and look at the 7th and 8th field of each line. If they are single-end reads, you will see marks like "0 *", which means there is no corresponding paired information.
ReplyDeleteHi Wei li,
ReplyDeleteI checked my read are paired end. Actually I built a denovo transcriptome and mapped those reads back to the transcriptome and made the BAM file. So will this script will work in this case also or it is specifically written for a reference based assembly.
Hi Wei li,
ReplyDeleteI also get a message saying:
No qualified paired-end reads found. Are they single-end reads?
but my reads are paired-ends. I can see the pairing with igv. Do you have any suggestion?
it's possible that your bam format is older version. can you paste a few lines of your bam file? use "samtools view test.bam | head -n 10 " print the first 10 lines of your bam record (test.bam).
DeleteHello. Thanks for sharing the script! btw, can I ask why you used PNEXT field(8th column in a SAM file) rather than TLEN field(9th column in a SAM file). From my understanding, TLEN indicates an inferred insert size.
ReplyDeleteIn most cases you can use either way to calculate the insert size.
DeleteThanks for the script Wei.
ReplyDeleteI had to modify it a little to get it to work, and I think this is the same problem the commenters above me were probably having. The NH:i: field is an optional field, and many short read mapper don't actually include it, including some popular mappers.
Since your script required the optional field to be present and equal to 1 to count that read, all the reads were getting skipped (giving that message: no qualified paired-end reads). I commented out those lines pertaining to this NH:i: field, and then the script ran successfully and counted the reads.
Cheers
Hi craig,
ReplyDeleteCan you please pot the code here. I am still getting the error.
Just comment line 63 to line 66 and it should be fine with the "NH:i" issues. I've also updated the script so you can try on the newest one.
DeleteI get the following error:
ReplyDelete$samtools view 10081.marked.realigned.recal.bam | head -n 1000000 | python getinsertsize.py
File "getinsertsize.py", line 45
print(str(nline/1000000)+'M...',file=sys.stderr);
^
SyntaxError: invalid syntax
Does any one know what the problem is?
what's your python version? It works for 2.7, but I'm not sure if it works under 2.6 or lower.
DeleteOkay I updated my version and I got the following output:
Delete1.0M...
Read length: mean 101.0, STD=0.0
Possible read length and their counts:
{101: 951678}
Read span: mean 0.0, STD=0.0
I am certain these are paired end reads, why do I get 0 span?
Here is some of my bam file too:
HWI-1KL153:29:D1N72ACXX:8:1209:12276:12090 1187 chrM 9 60 1 01M = 200 292 GTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA TTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAG ?@>B?>EBBACB@??@>@?@@BC? B>B?ADCABBCBCADCC?>@DC=<@A@?C<@DBCABBBC:=&<@>FFCB@CF9BFBE@G<=CA?BB9>>B X 0:i:1 X1:i:0 BD:Z:KKOLLKNLJLILLJJIJKLMJLLLJMMIMMONLKLMNNNONNKBMNMMKKBBLNMOLON JJJJMKJKOOKNNNONKNPOOLNPNPONOOPQQQQQTTMNNN MD:Z:101 RG:Z:Exome_10081 XG:i:0 BI:Z:NNRPQNPPMOKPPNMNLNORMPLPMQOLQPRQLOLORQRQQQNFPPQQONGGPRQRPRQMMMMRNOO RRORQQOQPRSRSOQSRPRRSTSUUTUUWUQPRQ AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 M Q:i:60 XT:A:U
Do you mind sending me the first 10000 lines of your bam file (use head -n 10000)? I can take a look at what happens. You can zip the text file and send to li.david.wei AT gmail.com. Thanks!
DeleteHi David,I've sent it to you, please let me know if you got it.
DeleteThanks for your help!
I've looked at your data. The problem is there are too many reads with insert size 0, which confuses the program and filter all the reads. This is a bug in my program actually, and I've updated the source code and it should work well now. Thanks & give it a try!
DeletePS: you need to run the program like "samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py -" now instead of "samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py".
Hi Wei,
Deleteit seems to work well thank you for your very quick responses!
For my one sample I have the following
Read span: mean 364.3316085392387, STD=123.087158684682
The STD seems to be fairly high, but I'm not exactly sure. Does anyone have experience with what the expected STD should be?
Thanks,
Mike Chong
Probably using more reads will have a better estimation of the span. Also, it seems that many of your first 10M reads are from chrM, which may not be a good choice. You can try to use reads from other chromosomes?
DeleteHi Wei,
DeleteI tried using the whole bam and I got pretty much the same number (~100 million reads). So this seems to be a real result.
I was also wondering can the script be modified so that the span-distribution-file be outputted so that read counts are for 10bp bins? I am new at python and help would be greatly appreciated.
Thanks,
MC
You can probably load the file into R and check the histogram. That should give you an intuitive overview of the span distributions..
DeleteSorry the arrow should be pointing at the equal sign from "file= sys.stderr"
ReplyDeleteHi! and thank you very much for your script.
ReplyDeleteI would suggest to improve its performances to get rid of '.keys()' at lines 61 and 76.
if readlen in plrdlen.keys() would become if readlen in plrdlen.
By the way, could you define exactly what you mean by read span?
Thanks again!
Thanks for your reply. read span is the distance between two reads in a paired-end read plus 2* read length. It is equal to the insert size of the fragment.
DeleteHi wei,
ReplyDeleteGreat script!! Many thanks from Malaysia...
Wondering if you would know how to integrate the read span (assuming this read span is referring to inset length) with tophat2 alignment?
More specifically, one of the tophat2 parameters is "-r" (also known as mate-inner-distance) http://tophat.cbcb.umd.edu/manual.html
So if for example the read span value from your script is 200, would you know what to use for the tophat2 -r parameter??
Would you need to consider the read length of the sample (example 100bp) to determine a correct tophat2 -r value???
Many thanks
the tophat2 -r value is the read span minus 2*read length, so it's read length dependent. Since it is the distance between two reads in a paired-end read, it's called "mate-inner-distance".
Deletethanks for the reply wei :)
DeleteHmmmmmmm...following up from the question,
To include the -r parameter seems straightforward thanks to your clarification....Now that I included the (read span - 2*readlength) value in Tophat..... To add a layer of complication, if I wish also to include the standard deviation value in the mapping parameters...I can not just use the value from your script am I correct?
For example the output from your script
read span : 242 SD :90
lets say readlength = 100
Therefore -r value would be : 42 (242-2*100). However the same cant be done with SD, we cant just minus a value..(if my understanding is correct)
The question I am asking...would it be possible to get an adjusted SD value based modified read-span length?? I am thinking of simply subtracting 200 from each read length to get a new SD estimate..
I been staring at the script and see were would be a place to insert the readlength*2 value and get a new correct SD...but I have cant pin point where...
Kind regards
Hi, the SD is the same (90 in your sample). This is because for a random variable x and a constant a, SD(x)=SD(x-a). The read span comes directly from the definition of the field[8], which already includes the read length.
DeleteThanks for the insight Wei!
DeleteI am afraid I still don't fully understand.... I think I need to dive deeper into statistics as well as sam field :)
I sill fail to see why if we adjusted the read span (by subtracting 200) the SD will remain the same??
Would a alternate workflow be
1)Run your script first, get an estimate for read span
2)Subtract filed[8] with the value of readspan-2*read-length
3)Rerun an adjusted script and get a more accurate estimate for SD
Or is it just I need to read more on basic statistics and SD :)
Cheers
Thanks for this!
ReplyDeleteThanks, just used this, very useful!
ReplyDeleteHi,
ReplyDeletethank you for this post.
I have a question : why with different subsample of the same bam I can have so manydifferences in read span ?
For example :
1) samtools view accepted_hits_chr22.bam | head -n 100000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 236.150649351, STD=88.6332394609
2) samtools view accepted_hits_chr22.bam | head -n 400000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 221.92447156, STD=79.0142612548
3) samtools view accepted_hits_chr22.bam | head -n 800000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 3383.2321526, STD=3460.01998406
4) samtools view accepted_hits_chr22.bam | head -n 1000000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 3111.18687862, STD=3442.69889313
Thank you in advance
So this means there're some really long paired-end reads that disrupt the calculation of read span. You can just use the results of the top 100k reads.
DeleteOh, ok, thank you !
DeleteThis comment has been removed by the author.
ReplyDeleteHi Everyone,
ReplyDeleteI'm trying your script on sam format. > igave follwoing command
python getinsertsize.py xyz.sam | -
it showed error at line 16 # no module argparse
above command need any modification or whats command to run your code on sam format . I tried with view option in samtools but it's for bam format not for sam. Please can you guide me How can I run for sam format to calculate Mean, read length and SD from sam alignment file as input
Thanks in advance
Thanks
Hi, this means there's no argparse module in your Python system. What python version did you use? Python 2.7 should have no problem running this script.
Deletek thanks, I'll update python. My python version 2.6.6 Thanks for reply.
DeleteI tried on updated python 2.7 now. It run completely but at last rather to print output.It showed following error.
Deleteclose failed in file object destructor:
sys.excepthook is missing
lost sys.stderr
Could you suggest me why this error appearing
Thanks
Thanks a lot Wei Li, Now Its working perfectly
DeleteHi Wei,
ReplyDeleteThanks a lot for this. would you know how to adapt the script so one could select or exclude specific reads based on the span length? Eg: say you wanted only read pairs with span <500.
thanks!
I'm mapping some illumina reads to a closely related species using BWA and getting some odd results. My mean read span is non-zero but is less than the mean read length. Any idea what could cause that? I'm fairly new to this and don't know python.
ReplyDeleteRead length: mean 94.7053421448, STD=15.8237819615
Read span: mean 74.962962963, STD=48.8084808847
Probably your bam file includes a mixture of reads with different lengths. This may make it complicated for calculating read span.
DeleteI have exactly the same issue.
DeleteNice..blog you can also visit our website that is a estimating software.http://www.accurateestimator.com/
ReplyDeleteAny improvement for un-even read length reads?
ReplyDeleteI have trouble with different (MP/PE) libraries, but the insert size are all ~150bp! Not sure if that's due to un-even read length.
Thank you for good working tools
ReplyDeletei got this output: Read length of paired-end reads : mean 296.22, STD=22.51
Read span: mean 484.54, STD=108.36
Does it mean my insert size is 484-296=184 ?
I'm 15 years old. I was born with HIV my mother passed away because of the HIV infection And I regret why i never met Dr Itua he could have cured my mum for me because as a single mother it was very hard for my mother I came across Dr itua healing words online about how he cure different disease in different races diseases like HIV/Aids Herpes Copd Diabetes Hepatitis even Cancer I was so excited but frighten at same time because I haven't come across such thing article online then I contacted Dr Itua on Mail drituaherbalcenter@gmail.com I also chat with him on what's app +2348149277967 he tells me how it works then I tell him I want to proceed I paid him so swiftly Colorado post office I receive my herbal medicine within 4/5 working days he gave me guild lines to follow and here am I living healthy again can imagine how god use men to manifest his works am I writing in all articles online to spread the god work of Dr Itua Herbal Medicine Yes can ask me anything about this on my twitter @ericamilli or text 7205992850.He's a Great Man.
ReplyDelete