Python - NLTK Bigram Keep <s>и </s> как одно слово - PullRequest
0 голосов
/ 13 ноября 2018

Я пытаюсь создать программу для расчета вероятностей биграмм. Мой первый шаг - разработать комбинации предложений.

Каждое из этих предложений начинается с <s> и заканчивается </s>. Допустим, мое примерное предложение было <s> my name is python </s>, мой результат должен быть (у меня есть p тегов, потому что я определю вероятность после)

p(my | <s>)
p(name | my )
p (is | name)
p (python | is)
p (</s> | python)

Но вместо этого я получу такой результат:

Counter({('<', 's'): 1, ('s', '>'): 1, ('>', 'my'): 1, ('my', 'name'): 1, ('name', 'is'): 1, ('is', 'python'): 1, ('python', '<'): 1, ('<', '/s'): 1, ('/s', '>'): 1})

Как бы я разделил <s> и </s> как отдельное слово, а не разделял его?

Мой код:

text = "<s> my name is python </s>" 
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)

print(Counter(bigrams))

Редактировать

Допустим, у меня есть текстовый файл

<s> a a b b c c </s> <s> a c b c </s> <s> b c c a b </s>

Затем я открываю этот текстовый файл, выполняю с ним следующую операцию и сохраняю его в виде списка.

temp = re.split("\s+",line.rstrip('\n'))
bigramText.append(temp)

Итак, теперь в моем списке:

[['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]

Теперь на этом этапе я хочу выполнить вычисления, чтобы получить вероятности биграммы. Я не знаю, поможет ли мой первоначальный вопрос получить результат, но по сути я пытаюсь выяснить, сколько раз встречаются эти комбинации, т. Е. Вам нужно проверить, сколько раз буква появляется рядом с другой

Ответы [ 2 ]

0 голосов
/ 13 ноября 2018

Обычно токенизатор NLTK допускает ошибки при сегментировании '<s>' и '</s>' Вы должны удалить их перед вызовом токенизатора, а затем добавить их после токенизации.

text = "<s> my name is python </s>" 
clean_text = text.replace('<s>','').replace('</s>','')
token =  ['<s>'] + nltk.word_tokenize(clean_text) + ['</s>']
bigrams = ngrams(token,2)
0 голосов
/ 13 ноября 2018

Вам, вероятно, следует написать свой собственный биграммизатор, если вы можете разделить его на пробелы (что обычно является условным условием. Я остаюсь собой, а не я)

def custom_bigrams(l):
    return list(zip(l, l[1:]))
print(custom_bigrams(['<s>', 'my', 'name', 'is', 'python', '</s>']))

он печатает

[('<s>', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'python'), ('python', '</s>')]

Чтобы использовать его в своем списке, вы должны вычислить биграммы, а затем использовать метод обновления из счетчика.

your_list = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]

c = Counter()
for x in your_list:
     c.update(custom_bigrams(x))

Вывод

 Counter({('b', 'c'): 3, ('<s>', 'a'): 2, ('a', 'b'): 2, ('c', 'c'): 2, ('c', '</s>'): 2, ('a', 'a'): 1, ('b', 'b'): 1, ('a', 'c'): 1, ('c', 'b'): 1, ('<s>', 'b'): 1, ('c', 'a'): 1, ('b', '</s>'): 1})
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...