Question

Учитывая набор интервалов, я хотел бы найти неперекрывающиеся отдельные интервалы из набора интервалов.

Например:

Ввод: [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]

Вывод: [[1,5], [21,22], [23,24], [25,30], [50,55], [60,70]]

Как я могу это сделать?

То, что я пробовал сейчас:

gene_bounds_list = [[1,10],[5,20], [6,21],[17,25],[22,23], [24,50],[30,55],[60,70]]
overlap_list = []
nonoverlap_list = []
nonoverlap_list.append(gene_bounds_list[0])

for i in range(1, len(gene_bounds_list)):
    curr_gene_bounds = gene_bounds_list[i]
    prev_gene_bounds = nonoverlap_list[-1]

    if curr_gene_bounds[0]<prev_gene_bounds[0]:
        if curr_gene_bounds[1]<prev_gene_bounds[0]: #case1
            continue
        if curr_gene_bounds[1] < prev_gene_bounds[1]:  #case2
            nonoverlap_list[-1][0] = curr_gene_bounds[1]
        if curr_gene_bounds[1]>prev_gene_bounds[1]:
            # previous gene was completely overlapping within current gene,
            # so replace previous gene by current (bigger) gene and put previous gene into overlap list
            overlap_list.append(nonoverlap_list[-1])
            new_bound = [gene_bounds_list[i][0], gene_bounds_list[i][1]]
            nonoverlap_list.pop()
            nonoverlap_list.append([new_bound[0], new_bound[1]])

    elif curr_gene_bounds[0] > prev_gene_bounds[0] and curr_gene_bounds[1] < prev_gene_bounds[1]:
        # completely within another gene
        overlap_list.append([curr_gene_bounds[0], curr_gene_bounds[1]])

    elif curr_gene_bounds[0] < prev_gene_bounds[1]:
        # partially overlapping with another gene
        new_bound = [nonoverlap_list[-1][1], curr_gene_bounds[1]]
        nonoverlap_list[-1][1] = curr_gene_bounds[0]
        nonoverlap_list.append([new_bound[0], new_bound[1]])

    else:
        # not overlapping with another gene
        nonoverlap_list.append([gene_bounds_list[i][0], gene_bounds_list[i][1]])

unique_data = [list(x) for x in set(tuple(x) for x in gene_bounds_list)]
within_overlapping_intervals = []

for small in overlap_list:
    for master in unique_data:
        if (small[0]==master[0] and small[1]==master[1]):
            continue
        if (small[0]>master[0] and small[1]<master[1]):
            if(small not in within_overlapping_intervals):
                within_overlapping_intervals.append([small[0], small[1]])

for o in within_overlapping_intervals:
    nonoverlap_list.append(o)  # append the overlapping intervals

nonoverlap_list.sort(key=lambda tup: tup[0])
flat_data = sorted([x for sublist in nonoverlap_list for x in sublist])
new_gene_intervals = [flat_data[i:i + 2] for i in range(0, len(flat_data), 2)]
print(new_gene_intervals)

Однако это дает мне результат: [[1, 5], [6, 10], [17, 20], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]

Есть идеи, как убрать нежелательные интервалы?

Vighnesh Raut · Answer 1 · 30 мая 2020

Нанести интервалы на временную шкалу. Поскольку диапазон равен 10**5, мы можем использовать память. Печать и сканирование могут выполняться за линейное время.

intervals = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]

max_ = max(intervals, key=lambda x: x[1])[1]
timeline = [0] * (max_ + 1)

# mark plots
for start, end in intervals:
    timeline[start] += 1
    timeline[end] -= 1


# make the timeline
for i in range(1, len(timeline)):
    timeline[i] += timeline[i - 1]


# scan
result = []
for i, item in enumerate(timeline):
    if i == 0:
        continue
    prev = timeline[i - 1]
    if item == 1 and prev != 1:
        result.append([i, i + 1])
    elif item == 1 and prev == 1:
        result[-1][1] = i + 1
        end = i


print(result)

РЕДАКТИРОВАТЬ: Поскольку диапазон обновляется до ~ 10^8, это не сработает.

Thierry Lathuille · Answer 2 · 30 мая 2020

Вот как это сделать. Идея состоит в том, чтобы отслеживать количество слоев интервалов в любой точке. Для этого мы добавляем один слой при вводе интервала и удаляем один при выходе.

Мы начинаем с построения отсортированных списков начала и конца. Чтобы определить, является ли значение началом или концом, мы создаем кортежи (start, 1) или (end, -1).

Затем мы объединяем эти два списка, сортируя по значению, и перебираем полученный список (использование heapq.merge упрощает это). Каждый раз, когда количество слоев меняется на 1, у нас начинается неперекрывающийся интервал. Когда он снова изменится, это будет конец.

from heapq import merge


def non_overlapping(data):
    out = []
    starts = sorted([(i[0], 1) for i in data])  # start of interval adds a layer of overlap
    ends = sorted([(i[1], -1) for i in data])   # end removes one
    layers = 0
    current = []
    for value, event in merge(starts, ends):    # sorted by value, then ends (-1) before starts (1)
        layers += event
        if layers ==1:  # start of a new non-overlapping interval
            current.append(value)
        elif current:  # we either got out of an interval, or started an overlap
            current.append(value)
            out.append(current)
            current = []
    return out


data = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]

non_overlapping(data)
# [[1, 5], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]

Обратите внимание, что ожидаемый ответ, который вы задали в своем вопросе, неверен (например, он содержит 45, который не является частью входных данных)

Получите неперекрывающиеся отдельные интервалы из набора интервалов Python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Получите неперекрывающиеся отдельные интервалы из набора интервалов Python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы