Как извлечь строки после конкретных слов? - PullRequest
6 голосов
/ 24 мая 2019

Я хочу получить дату и конкретный элемент в тексте, используя регулярное выражение в python 3. Ниже приведен пример:

text = '''
190219 7:05:30 line1 fail
               line1 this is the 1st fail
               line2 fail
               line2 this is the 2nd fail
               line3 success 
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line1 fail
               line1 this is the 1st fail
               line2 success 
               line2 this is the 1st success process
               line2 this process need 4sec
               line3 success 
               line3 this is the 2st success process
               line3 this process need 2sec

'''

В приведенном выше примере я хотел бы получить все строки после 'successлиния'.Вот желаемый вывод:

[('190219','7:05:30','line3 this is the 1st success process', 'line3 this process need 3sec'),
('200219', '9:10:10', 'line2 this is the 1st success process', 'line2 this process need 4sec', 'line3 this is the 2st success process','line3 this process need 2sec')]

Я хочу попробовать это:

>>> newLine = re.sub(r'\t|\n|\r|\s{2,}',' ', text)
>>> newLine
>>> Out[3]: ' 190219 7:05:30 line1 fail  line1 this is the 1st fail  line2 fail  line2 this is the 2nd fail  line3 success line3 this is the 1st success process  line3 this process need 3sec 200219 9:10:10 line1 fail  line1 this is the 1st fail  line2 success line2 this is the 1st success process  line2 this process need 4sec  line3 success line3 this is the 2st success process  line3 this process need 2sec  '

Я не знаю, как правильно получить результат.Я пробовал это, чтобы получить строку:

(\b\d{6}\b \d{1,}:\d{2}:\d{2})...

Как мне решить эту проблему?

Ответы [ 4 ]

1 голос
/ 26 мая 2019

Это мое решение с использованием регулярных выражений:

text = '''
190219 7:05:30 line1 fail
               line1 this is the 1st fail
               line2 fail
               line2 this is the 2nd fail
               line3 success 
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line1 fail
               line1 this is the 1st fail
               line2 success 
               line2 this is the 1st success process
               line2 this process need 4sec
               line3 success 
               line3 this is the 2st success process
               line3 this process need 2sec
'''

# find desired lines
count = 0
data = []
for item in text.splitlines():
    # find date
    match_date = re.search('\d+\s\d+:\d\d:\d\d', item)
    # get date
    if match_date != None:
        count = 1
        date_time = match_date.group().split(' ')
        for item in date_time:
            data.append(item)
    # find line with success
    match = re.search('\w+\d\ssuccess',item)
    # handle collecting next lines
    if match != None:
        count = 2

    if count > 2:
        data.append(item.strip())

    if count == 2:
        count += 1

# split list data
# find integers i list
numbers = []
for item in data:
     numbers.append(item.isdigit())

# get positions of integers
indexes = [i for i,x in enumerate(numbers) if x == True]
number_of_elements = len(data)
indexes = indexes + [number_of_elements]

# create list of list
result = []
for i in range(0, len(indexes)-1):
    result.append(data[indexes[i]:indexes[i+1]])

Результат:

[['190219', '7:05:30', 'line3 this is the 1st success process', 'line3 this process need 3sec'], ['200219', '9:10:10', 'line2 this is the 1st success process', 'line2 this process need 4sec', 'line3 this is the 2st success process', 'line3 this process need 2sec']]
1 голос
/ 24 мая 2019

Это аналогичное решение с использованием groupby от itertools:

import re
from itertools import groupby

def parse(lines):
    result = []
    buffer, success_block = [], False
    for date, block in groupby(lines, key=lambda l: re.match(r"(\d{6})\s(\d{1,}:\d{2}:\d{2})", l)):
        if date:
            buffer = list(date.groups())
            success_block = next(block).endswith('success')
            continue
        for success, b in groupby(block, key=lambda l: re.match(r".*line\d\ssuccess$", l)):
            if success:
                success_block = True
                continue
            if success_block:
                buffer.extend(b)

        result.append(tuple(buffer))
        buffer = []
    return result
1 голос
/ 24 мая 2019

Если вы предпочитаете более функциональный и элегантный код, то приведенный ниже код должен работать.Я использовал в Python функциональную библиотеку под названием toolz .Вы можете установить его, выполнив pip install toolz.Приведенный ниже код не использует регулярные выражения, а просто использует partitions и filters.Пожалуйста, измените input_file на файл, содержащий текст, и попробуйте.


from toolz import partitionby, partition
from itertools import dropwhile

input_file = r'input_file.txt'


def line_starts_empty(line):
    return line.startswith(' ')


def clean(line):
    return line.strip()


def contains_no_success(line):
    return 'success' not in line.lower()


def parse(args):
    head_line, tail_lines = args
    result_head = head_line[0].split()[:2]
    result_tail = list(map(clean, dropwhile(contains_no_success, tail_lines)))
    return result_head + result_tail


for item in map(parse, partition(2, partitionby(line_starts_empty, open(input_file)))):
    print(item)


1 голос
/ 24 мая 2019

Вот решение, использующее регулярное выражение для получения даты и обычный Python для получения всего остального.

Подготовьте ввод:

text = '''
190219 7:05:30 line1 fail
               line1 this is the 1st fail
               line2 fail
               line2 this is the 2nd fail
               line3 success
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line1 fail
               line1 this is the 1st fail
               line2 success
               line2 this is the 1st success process
               line2 this process need 4sec
               line3 success
               line3 this is the 2st success process
               line3 this process need 2sec
'''

# Strip the multiline string, split into lines, then strip each line
lines = [line.strip() for line in text.strip().splitlines()]
result = parse(lines)

Решение:

import re

def parse(lines):
    result = []
    buffer = []

    success = False
    for line in lines:
        date = re.match(r"(\d{6})\s(\d{1,}:\d{2}:\d{2})", line)
        if date:
            # Store previous match and reset buffer
            if buffer:
                result.append(tuple(buffer))
                buffer.clear()
            # Split the date and time and add to buffer
            buffer.extend(date.groups())
        # Check for status change
        if line.endswith("success") or line.endswith("fail"):
            success = True if line.endswith("success") else False
        # Add current line to buffer if it's part of the succeeded process
        else:
            if success:
                buffer.append(line)
    # Store last match
    result.append(tuple(buffer))
    return result

Вывод:

result = [('190219', '7:05:30', 'line3 this is the 1st success process', 'line3 this process need 3sec'), ('200219', '9:10:10', 'line2 this is the 1st success process', 'line2 this process need 4sec', 'line3 this is the 2st success process', 'line3 this process need 2sec')]
...