dplyr
отлично работает для таких операций:
# first, read in the data, with headers
depth <- read.table(header = T, text =
"chr Pos Nucleotide Coverage
chr1 1 A 10
chr1 2 G 12
chr1 3 T 3
chr1 4 A 20
chr1 5 T 22
chr1 6 N 0
chr1 7 N 0
chr2 23 A 1
chr2 24 T 5
chr2 25 G 15")
intervals <- read.table(header = T, text =
"chr start end
chr1 3 5
chr2 23 25
chr4 1 30")
Теперь вы можете приступить к работе:
library(dplyr)
# create a new data.frame:
# link intervals with any rows from depth where the value of 'chr' matches
# (keeping all rows from intervals)
merged <-
merge(intervals, depth, by = 'chr', all.x = T) %>%
mutate(
# add a column to flag rows in the range spec'd by intervals
in_range = Pos >= start & Pos <= end,
# substitute 0 for any missing values in Coverage
Coverage = coalesce(Coverage, 0L))
# now you can get your results:
result1 <-
merged %>%
# keep those in range or with no value from depth$Pos
filter(in_range | is.na(Pos)) %>%
group_by(chr, start, end) %>%
summarise(sum_cov = sum(Coverage))
result2 <-
merged %>%
# keep those in range
filter(in_range ==T) %>%
# only get the columns that were in depth
select(names(depth))
Результаты такие, как вы ожидаете:
> result1
chr start end sum_cov
1 chr1 3 5 45
2 chr2 23 25 21
3 chr4 1 30 0
> result2
chr Pos Nucleotide Coverage
1 chr1 3 T 3
2 chr1 4 A 20
3 chr1 5 T 22
4 chr2 23 A 1
5 chr2 24 T 5
6 chr2 25 G 15