Как эффективно очистить текст как список токенизированных предложений - PullRequest
0 голосов
/ 16 сентября 2018

Это пример.Скажем, у меня есть блок текста

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer

paragraph = "There was a steaming mist in all the hollows, and it had roamed in its forlornness up the hill, like an evil spirit, seeking rest and finding none. A clammy and intensely cold mist, it made its slow way through the air in ripples that visibly followed and overspread one another, as the waves of an unwholesome sea might do. It was dense enough to shut out everything from the light of the coach-lamps but these its own workings, and a few yards of road; and the reek of the labouring horses steamed into it, as if they had made it all."
nltk.tokenize.sent_tokenize(paragraph)

вывод:

['There was a steaming mist in all the hollows, and it had roamed in its forlornness up the hill, like an evil spirit, seeking rest and finding none.',
 'A clammy and intensely cold mist, it made its slow way through the air in ripples that visibly followed and overspread one another, as the waves of an unwholesome sea might do.',
 'It was dense enough to shut out everything from the light of the coach-lamps but these its own workings, and a few yards of road; and the reek of the labouring horses steamed into it, as if they had made it all.']

, если бы я поместил абзац в токенизированные слова, я мог бы выполнить все виды предварительной обработки текста

word_tokenize(paragraph)

for word in paragraph:
    word.lower
    lemmas = [WordNetLemmatizer(word) for word in paragraph]

и т. Д.

но обычно эти методы предварительной обработки принимают строковые входные данные, состоящие из отдельных слов

Как я могу предварительно обработать предложения и сохранить их в этой структуре данных?

...