Thursday, August 18, 2011

Bases in sequence positions

There are two different coordinate base systems in different files: 0-base and 1-base. Different files use different base systems, and sometimes it causes confusions (especially when one tries to calculate the length of the region). Here I show the differences in two systems, and summarize several file formats that use both sytems.

0-base system: the first base is 0. You represent a region as [a, b). This is also called "half-close-half-open", "0-base end exclusive", or "1-base end inclusive". When calculating the length of the region, subtract a from b directly:


1-base system: the first base is 1, and you represent a region as [a,b]. When calculating the length of the region, don't forget to add 1:


Here is an example. Suppose you want to represent a region of X in the following sequence:


In the 0-base system, this is represented as [3,6), and the length is 6-3=3. In the 1-base system, use [4, 6], and the length is 6-4+1=3.

0-base system files: BED, BAM
1-base system files: SAM, GFF, GTF, Wig, PSL