1 Introduction to PyRanges

PyRanges are collections of intervals that support comparison operations (like overlap and intersect) and other methods that are useful for genomic analyses. The ranges can have an arbitrary number of meta-data fields, i.e. columns associated with them.

The data in PyRanges objects are stored in a pandas dataframe. This means the vast Python ecosystem for high-performance scientific computing is available to manipulate the data in PyRanges-objects.

import pyranges as pr
from pyranges import PyRanges
import pandas as pd
from io import StringIO
f1 = """Chromosome Start End Score Strand
chr1 4 7 23.8 +
chr1 6 11 0.13 -
chr2 0 14 42.42 +"""
df1 = pd.read_csv(StringIO(f1), sep="\s+")
gr1 = PyRanges(df1)

Now we can subset the PyRange in various ways:

print(gr1)
## +--------------+------------+------------+------------+------------+
## | Chromosome   |      Start |        End |      Score | Strand     |
## | (object)     |   (object) |   (object) |   (object) | (object)   |
## |--------------+------------+------------+------------+------------|
## | chr1         |          4 |          7 |      23.8  | +          |
## | chr1         |          6 |         11 |       0.13 | -          |
## | chr2         |          0 |         14 |      42.42 | +          |
## +--------------+------------+------------+------------+------------+
## Stranded PyRanges object has 3 rows and 5 columns from 2 chromosomes.
print(gr1["chr1", 0:5])
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int32) |   (int32) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         4 |         7 |        23.8 | +            |
## +--------------+-----------+-----------+-------------+--------------+
## Stranded PyRanges object has 1 rows and 5 columns from 1 chromosomes.
print(gr1["chr1", "-", 6:100])
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int32) |   (int32) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         6 |        11 |        0.13 | -            |
## +--------------+-----------+-----------+-------------+--------------+
## Stranded PyRanges object has 1 rows and 5 columns from 1 chromosomes.
print(gr1.Score)
## 0    23.80
## 1     0.13
## 2    42.42
## Name: Score, dtype: float64

And we can perform comparison operations with two PyRanges:

f2 = """Chromosome Start End Score Strand
chr1 5 6 -0.01 -
chr1 9 12 200 +
chr3 0 14 21.21 -"""
df2 = pd.read_csv(StringIO(f2), sep="\s+")
gr2 = PyRanges(df2)
print(gr2)
## +--------------+------------+------------+------------+------------+
## | Chromosome   |      Start |        End |      Score | Strand     |
## | (object)     |   (object) |   (object) |   (object) | (object)   |
## |--------------+------------+------------+------------+------------|
## | chr1         |          9 |         12 |     200    | +          |
## | chr1         |          5 |          6 |      -0.01 | -          |
## | chr3         |          0 |         14 |      21.21 | -          |
## +--------------+------------+------------+------------+------------+
## Stranded PyRanges object has 3 rows and 5 columns from 2 chromosomes.
print(gr1.intersect(gr2, strandedness="opposite"))
## +--------------+------------+------------+------------+------------+
## | Chromosome   |      Start |        End |      Score | Strand     |
## | (object)     |   (object) |   (object) |   (object) | (object)   |
## |--------------+------------+------------+------------+------------|
## | chr1         |          5 |          6 |      23.8  | +          |
## | chr1         |          9 |         11 |       0.13 | -          |
## +--------------+------------+------------+------------+------------+
## Stranded PyRanges object has 2 rows and 5 columns from 1 chromosomes.
print(gr1.intersect(gr2, strandedness=False))
## +--------------+------------+------------+------------+------------+
## | Chromosome   |      Start |        End |      Score | Strand     |
## | (object)     |   (object) |   (object) |   (object) | (object)   |
## |--------------+------------+------------+------------+------------|
## | chr1         |          5 |          6 |      23.8  | +          |
## | chr1         |          9 |         11 |       0.13 | -          |
## +--------------+------------+------------+------------+------------+
## Stranded PyRanges object has 2 rows and 5 columns from 1 chromosomes.

There are also convenience methods for single PyRanges:

print(gr1.merge())
## +--------------+------------+------------+
## | Chromosome   |      Start |        End |
## | (object)     |   (object) |   (object) |
## |--------------+------------+------------|
## | chr1         |          4 |         11 |
## | chr2         |          0 |         14 |
## +--------------+------------+------------+
## Unstranded PyRanges object has 2 rows and 3 columns from 2 chromosomes.

The underlying dataframe can always be accessed:

print(gr1.df)
##   Chromosome  Start  End  Score Strand
## 0       chr1      4    7  23.80      +
## 1       chr1      6   11   0.13      -
## 2       chr2      0   14  42.42      +