Question

Мне интересно, как можно разделить блоки текста в одном текстовом файле.Пример ниже.В основном у меня есть 2 предмета, один из которых переходит от «9 канала» к строке «Brief: ..», другой начинается с «Southern ...» и снова к «Brief».Как можно разделить их на 2 текстовых файла с помощью Python?Я считаю, что общий делитель будет "(женщина 16+)".Большое спасибо!

Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left 
$1,100 out
hosted by Peter Hitchener
A woman selling her caravan near Bendigo has been left $1,100 out of 
pocket after an elderly couple made the purchase with counterfeit money. 
The wildlife worker tried to use the notes to pay for a house deposit, but an 
agent noticed the notes were missing the Coat of Arms on one side. 


Brief: Radio & TV
Demographics: 153,000 (male 16+) • 177,000 (female 
16+)

Southern Cross Victoria Bendigo (1 item)


Heathcote Police are warning the residents to be on the 
lookout a
hosted by Jo Hall
Heathcote Police are warning the residents to be on the lookout after a large 
dash of fake $50 note was discovered. Victim Marianne Thomas was given 
counterfeit notes from a caravan. The Heathcote resident tried to pay the 
house deposit and that's when the counterfeit notes were spotted. Thomas 
says the caravan is in town for the Spanish Festival.


Brief: Radio & TV
Demographics: 4,000 (male 16+) • 3,000 (female 16+)

harvey · Answer 1 · 30 мая 2018

Вот модифицированный пример чего-то похожего, что я делал недавно, в основном просматривает ваш текст и копирует построчно.Основная логика основана на добавлении к текущему имени файла, которое сбрасывается после поиска нового раздела.Будет использоваться первая строка следующего раздела в качестве имени файла.

#!/usr/bin/env python
import re

data = """
Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left $1,100 out hosted by
Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100
out of pocket after an elderly couple made the purchase with counterfeit money.
The wildlife worker tried to use the notes to pay for a house deposit, but an
agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo
Hall Heathcote Police are warning the residents to be on the lookout after a
large dash of fake $50 note was discovered. Victim Marianne Thomas was given
counterfeit notes from a caravan. The Heathcote resident tried to pay the house
deposit and that's when the counterfeit notes were spotted. Thomas says the
caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)
"""



current_file = None
for line in data.split('\n'):

    # Set initial filename
    if current_file == None and line != '':
        current_file = line + '.txt'

    # This is to handle the blank line after Brief
    if current_file == None:
        continue

    text_file = open(current_file, "a")
    text_file.write(line + "\n")
    text_file.close()

    # Reset filename if we have finished this section
    # which is idenfitied by:
    #    starts with Brief - ^Brief
    #    contains some random amount of text - .*
    #    ends with ) - )$
    if re.match(r'^Brief:.*\)$', line) is not None:
        current_file = None

Это выведет следующие файлы

Channel 9 (1 item).txt
Southern Cross Victoria Bendigo (1 item).txt

snagpaul · Answer 2 · 30 мая 2018

Вот что-то с жестким кодированием, которое сделает это:

s = """Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left $1,100 out hosted by Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100 out of pocket after an elderly couple made the purchase with counterfeit money. The wildlife worker tried to use the notes to pay for a house deposit, but an agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo Hall Heathcote Police are warning the residents to be on the lookout after a large dash of fake $50 note was discovered. Victim Marianne Thomas was given counterfeit notes from a caravan. The Heathcote resident tried to pay the house deposit and that's when the counterfeit notes were spotted. Thomas says the caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)"""

part_1 = s[s.index("Channel 9"):s.index("Southern Cross")]

part_2 = s[s.index("Southern Cross"):]

И затем сохраните их в файлы.

DYZ · Answer 3 · 30 мая 2018

Похоже, что строки, начинающиеся с " Demographics: ", действуют как real Делители.Я бы использовал регулярные выражения двумя способами: во-первых, разделить текст по этим строкам;во-вторых, извлеките эти строки сами.Затем результаты можно объединить для восстановления блоков:

import re
DIVIDER = 'Demographics: .+' # Make it tunable, in case you change your mind
blocks_1 = re.split(DIVIDER, text)
blocks_2 = re.findall(DIVIDER, text)
blocks = ['\n\n'.join(pair) for pair in zip(blocks_1, blocks_2)
blocks[0]
#Channel 9 (1 item)\n\nA woman selling her caravan near ... 
#... Demographics: 153,000 (male 16+) • 177,000 (female 16+)

abarnert · Answer 4 · 30 мая 2018

На самом деле, я подозреваю, что вы действительно хотите разорвать после ссылки, начинающейся с Demographics:, или до строки, заканчивающейся (1 item) или (2 items) или аналогичной.

Но как бы вы ни хотели сломатьесть два шага:

Придумайте правило, которое вы можете превратить в функцию, которую вы вызываете в каждой строке.
Напишите некоторый код, который группирует вещи на основена результат этой функции.

Давайте использовать ваше правило.Функция для этого может быть:

def is_last_line(line):
    return line.strip().endswith('(female 16+)')

Теперь, вот способ, которым вы могли бы сгруппировать вещи, используя эту функцию:

i = 1
outfile = open(f'outfile{i}.txt', 'w')
for line in infile:
    outfile.write(line.strip())
    if is_last_line(line):
        i += 1
        outfile = open(f'outfile{i}.txt', 'w')
outfile.close()

Есть способы, которыми вы можете сделать это намного более кратким с помощьюиспользуя, например, itertools.groupby, itertools.takewhile, iter или другие функции.Или вы можете написать функцию генератора, которая все еще делает вещи вручную, но с yield группами строк, которые позволили бы создавать новые файлы намного проще (и давайте использовать блоки with).Но такая явная формулировка, вероятно, облегчает понимание новичку (и отладку, а затем расширяет его) за счет некоторой детализации.

Например, это не очень понятно изкак вы сформулировали свой вопрос, хотите ли вы, чтобы эта строка Demographics: появилась в ваших выходных файлах.Если вы этого не сделаете, должно быть очевидно, как все изменить:

    if not is_last_line(line):
        outfile.write(line.strip())
    else:
        i += 1
        outfile = open(f'outfile{i}.txt', 'w')

Отдельные блоки текста Python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Отдельные блоки текста Python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы