All About Bioinformatics: Estimating paired-end read insert length from SAM/BAM files

Friday, April 6, 2012

Estimating paired-end read insert length from SAM/BAM files

I wrote a single Python script to estimate the paired-end read insert length (or fragment length) from read mapping information (i.e., SAM/BAM files). The algorithm is simple: check the TLEN field in the SAM format, throw out pair-end reads whose pairs are too far away, and use them to estimate the mean and variance of the insert length.

This script is also able to provide a detailed distribution of read length and read span for your convenience. Please refer to the detailed usage below.

This script is distributed in GitHub now.

Usage:

getinsertsize.py [ SAM file | -]

or

samtools view [ BAM file ] | getinsertsize.py -

Detailed Usage:

usage: getinsertsize.py [-h] [--span-distribution-file SPAN_DISTRIBUTION_FILE]
[--read-distribution-file READ_DISTRIBUTION_FILE]
SAMFILE

Automatically estimate the insert size of the paired-end reads for a given
SAM/BAM file.

positional arguments:
SAMFILE Input SAM file (use - from standard input)

optional arguments:
-h, --help show this help message and exit
--span-distribution-file SPAN_DISTRIBUTION_FILE, -s SPAN_DISTRIBUTION_FILE
Write the distribution of the paired-end read span
into a text file with name SPAN_DISTRIBUTION_FILE.
This text file is tab-delimited, each line containing
two numbers: the span and the number of such paired-
end reads.
--read-distribution-file READ_DISTRIBUTION_FILE, -r READ_DISTRIBUTION_FILE
Write the distribution of the paired-end read length
into a text file with name READ_DISTRIBUTION_FILE.
This text file is tab-delimited, each line containing
two numbers: the read length and the number of such
paired-end reads.

Sample output:

Read length: mean 90.6697303194, STD=15.9446036414

Possible read length and their counts:

{108: 43070882, 76: 50882326}

Read span: mean 165.217445903, STD=32.8914834802

Note: If the SAM/BAM file size is too large, it is accurate enough to estimate based on a few reads (like 1 millioin). In this case, you can run the script as follows:

head -n 1000000 [ SAM file ] | getinsertsize.py -

samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py -

Note: According to the SAM definition, the read span "equals the number of bases from the leftmost mapped base to the rightmost mapped base". This span is the distance between two reads in a paired-end read PLUS 2 times read length. Read span is different from the "mate-inner-distance" in Tophat (-r option), which measures only the distance between two reads in a paired-end read.

47 comments:

gauravMay 18, 2012 at 9:47 AM
Hi wei I ma using your script to calculate the insert size but it is saying this gaurav@gaurav-OptiPlex-980:~/Desktop$ samtools view bowtie_out.nameSorted.PropMapPairsForRSEM.bam | head -n 1000000 | python getinsertsize.py
1M...
Read length: mean 100.0, STD=0.0
Possible read length and their counts:
{100: 1000000}
No qualified paired-end reads found. Are they single-end reads?

Can you please tell me how to check whether my reads are single end or paired end.
ReplyDelete
Replies
UnknownMay 18, 2012 at 12:29 PM
Probably these reads are single-end reads. You can check a few lines of your BAM file, and look at the 7th and 8th field of each line. If they are single-end reads, you will see marks like "0 *", which means there is no corresponding paired information.
ReplyDelete
Replies
gauravMay 21, 2012 at 1:01 AM
Hi Wei li,

I checked my read are paired end. Actually I built a denovo transcriptome and mapped those reads back to the transcriptome and made the BAM file. So will this script will work in this case also or it is specifically written for a reference based assembly.
ReplyDelete
Replies
guibarJune 29, 2012 at 1:32 PM
Hi Wei li,

I also get a message saying:
No qualified paired-end reads found. Are they single-end reads?

but my reads are paired-ends. I can see the pairing with igv. Do you have any suggestion?
ReplyDelete
Replies
ALpoptosisJuly 5, 2012 at 9:46 PM
Hello. Thanks for sharing the script! btw, can I ask why you used PNEXT field(8th column in a SAM file) rather than TLEN field(9th column in a SAM file). From my understanding, TLEN indicates an inferred insert size.
ReplyDelete
Replies
UnknownDecember 11, 2012 at 11:56 AM
Thanks for the script Wei.

I had to modify it a little to get it to work, and I think this is the same problem the commenters above me were probably having. The NH:i: field is an optional field, and many short read mapper don't actually include it, including some popular mappers.

Since your script required the optional field to be present and equal to 1 to count that read, all the reads were getting skipped (giving that message: no qualified paired-end reads). I commented out those lines pertaining to this NH:i: field, and then the script ran successfully and counted the reads.

Cheers
ReplyDelete
Replies
AnonymousJanuary 24, 2013 at 11:22 AM
Hi craig,

Can you please pot the code here. I am still getting the error.
ReplyDelete
Replies
mikemasenkoFebruary 5, 2013 at 1:09 PM
I get the following error:

$samtools view 10081.marked.realigned.recal.bam | head -n 1000000 | python getinsertsize.py
File "getinsertsize.py", line 45
print(str(nline/1000000)+'M...',file=sys.stderr);
^
SyntaxError: invalid syntax

Does any one know what the problem is?
ReplyDelete
Replies
mikemasenkoFebruary 5, 2013 at 1:11 PM
Sorry the arrow should be pointing at the equal sign from "file= sys.stderr"
ReplyDelete
Replies
L.February 18, 2013 at 4:22 AM
Hi! and thank you very much for your script.

I would suggest to improve its performances to get rid of '.keys()' at lines 61 and 76.
if readlen in plrdlen.keys() would become if readlen in plrdlen.

By the way, could you define exactly what you mean by read span?

Thanks again!
ReplyDelete
Replies
helping432March 5, 2013 at 12:49 AM
Hi wei,

Great script!! Many thanks from Malaysia...

Wondering if you would know how to integrate the read span (assuming this read span is referring to inset length) with tophat2 alignment?

More specifically, one of the tophat2 parameters is "-r" (also known as mate-inner-distance) http://tophat.cbcb.umd.edu/manual.html

So if for example the read span value from your script is 200, would you know what to use for the tophat2 -r parameter??

Would you need to consider the read length of the sample (example 100bp) to determine a correct tophat2 -r value???

Many thanks

ReplyDelete
Replies
UnknownApril 22, 2013 at 1:24 AM
Thanks for this!
ReplyDelete
Replies
flashtonJune 19, 2013 at 8:55 AM
Thanks, just used this, very useful!
ReplyDelete
Replies
DelogerNovember 8, 2013 at 3:02 PM
Hi,

thank you for this post.
I have a question : why with different subsample of the same bam I can have so manydifferences in read span ?
For example :

1) samtools view accepted_hits_chr22.bam | head -n 100000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 236.150649351, STD=88.6332394609
2) samtools view accepted_hits_chr22.bam | head -n 400000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 221.92447156, STD=79.0142612548
3) samtools view accepted_hits_chr22.bam | head -n 800000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 3383.2321526, STD=3460.01998406
4) samtools view accepted_hits_chr22.bam | head -n 1000000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 3111.18687862, STD=3442.69889313

Thank you in advance
ReplyDelete
Replies
UnknownNovember 13, 2013 at 11:32 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownNovember 13, 2013 at 11:43 AM
Hi Everyone,

I'm trying your script on sam format. > igave follwoing command

python getinsertsize.py xyz.sam | -

it showed error at line 16 # no module argparse

above command need any modification or whats command to run your code on sam format . I tried with view option in samtools but it's for bam format not for sam. Please can you guide me How can I run for sam format to calculate Mean, read length and SD from sam alignment file as input

Thanks in advance

Thanks
ReplyDelete
Replies
James CrickNovember 25, 2013 at 7:22 AM
Hi Wei,
Thanks a lot for this. would you know how to adapt the script so one could select or exclude specific reads based on the span length? Eg: say you wanted only read pairs with span <500.
thanks!
ReplyDelete
Replies
UnknownAugust 20, 2014 at 5:30 AM
I'm mapping some illumina reads to a closely related species using BWA and getting some odd results. My mean read span is non-zero but is less than the mean read length. Any idea what could cause that? I'm fairly new to this and don't know python.

Read length: mean 94.7053421448, STD=15.8237819615
Read span: mean 74.962962963, STD=48.8084808847
ReplyDelete
Replies
UnknownFebruary 18, 2015 at 4:52 AM
Nice..blog you can also visit our website that is a estimating software.http://www.accurateestimator.com/
ReplyDelete
Replies
YifangtDecember 2, 2015 at 3:00 PM
Any improvement for un-even read length reads?
I have trouble with different (MP/PE) libraries, but the insert size are all ~150bp! Not sure if that's due to un-even read length.
ReplyDelete
Replies
UnknownJuly 30, 2017 at 9:51 PM
Thank you for good working tools
i got this output: Read length of paired-end reads : mean 296.22, STD=22.51
Read span: mean 484.54, STD=108.36
Does it mean my insert size is 484-296=184 ?

ReplyDelete
Replies
AnonymousSeptember 2, 2018 at 9:33 AM
I'm 15 years old. I was born with HIV my mother passed away because of the HIV infection And I regret why i never met Dr Itua he could have cured my mum for me because as a single mother it was very hard for my mother I came across Dr itua healing words online about how he cure different disease in different races diseases like HIV/Aids Herpes Copd Diabetes Hepatitis even Cancer I was so excited but frighten at same time because I haven't come across such thing article online then I contacted Dr Itua on Mail drituaherbalcenter@gmail.com I also chat with him on what's app +2348149277967 he tells me how it works then I tell him I want to proceed I paid him so swiftly Colorado post office I receive my herbal medicine within 4/5 working days he gave me guild lines to follow and here am I living healthy again can imagine how god use men to manifest his works am I writing in all articles online to spread the god work of Dr Itua Herbal Medicine Yes can ask me anything about this on my twitter @ericamilli or text 7205992850.He's a Great Man.
ReplyDelete
Replies

Add comment