Перегруппировать текстовые блоки так, чтобы каждый заканчивался полным предложением - PullRequest
0 голосов
/ 28 февраля 2019

У меня есть три набора текстовых блоков (на самом деле еще много ...), которые показывают часть полного текста.Однако разбиение исходного текста не было сделано правильно, так как некоторые предложения были разбиты на два текстовых блока.

text1 = {"We will talk about data about model specification parameter \
estimation and model application and the context where we will apply \
the simple example.Is an application where we would like to analyze \
the market for electric cars because"};

text2 = {"we are interested in the market of electric cars.The choice \
that we are interested in is the choice of each individual to \
purchase an electric car or not And we will see how"};

text3 = {"to address this question. Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "};

Например, text2 начинается с «нас интересует рынок электромобилей».Это неполное первое предложение, которое фактически началось в текстовом блоке 1 (см. Последнее предложение там).

Я хочу убедиться, что каждый текстовый блок заканчивается полным предложением.Поэтому я хочу переместить неполные первые предложения в последний текстовый блок.Например, здесь результат будет:

 text1corr = {"We will talk about data about model specification parameter \
    estimation and model application and the context where we will apply \
    the simple example.Is an application where we would like to analyze \
    the market for electric cars because we are interested in the market of electric cars."};

text2corr = {"The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question."};

text3corr = {"Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "};

Как я могу сделать это на Python?Это вообще возможно?

Ответы [ 2 ]

0 голосов
/ 28 февраля 2019

Вы можете использовать функцию zip_longest() для перебора пар строк:

from itertools import zip_longest
import re

l = [text1, text2, text3]
new_l = []

for i, j in zip_longest(l, l[1:], fillvalue=''):
    # remove leading and trailing spaces
    i, j = i.strip(), j.strip()
    # remove leading half sentence
    if i[0].islower():
        i = re.split(r'[.?!]', i, 1)[-1].lstrip()
    # append half sentence from next string
    if i[-1].isalpha():
        j = re.split(r'[.?!]', j, 1)[0]
        i = f"{i} {j}."
    new_l.append(i)

for i in new_l:
    print(i)

Вывод:

We will talk about data about model specification parameter estimation and model application and the context where we will apply the simple example.Is an application where we would like to analyze the market for electric cars because we are interested in the market of electric cars.
The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question.
Furthermore, it needs to be noted that this is only a model text and there is no content associated with it.
0 голосов
/ 28 февраля 2019
text1 = "We will talk about data about model specification parameter \
estimation and model application and the context where we will apply \
the simple example.Is an application where we would like to analyze \
the market for electric cars because"

text2 = "we are interested in the market of electric cars.The choice \
that we are interested in is the choice of each individual to \
purchase an electric car or not And we will see how"

text3 = "to address this question. Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "

textList = [text1,text2,text3]

corrected_list = []
prev_incomplete_sentece = ''
for index , text in enumerate(textList):
    if(len(prev_incomplete_sentece) > 0):
        corrected_text =  text[len(prev_incomplete_sentece) + 1:]
    else:
        corrected_text = text
    if(index +1 < len(textList)):
        corrected_text += ' '+ textList[index+1].split('.')[0]
        prev_incomplete_sentece = textList[index+1].split('.')[0]
    corrected_list.append(corrected_text)    

Вывод:

['We will talk about data about model specification parameter estimation and model application and the context where we will apply the simple example.Is an application where we would like to analyze the market for electric cars because we are interested in the market of electric cars',
 'The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question',
 ' Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. ']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...