UnicodeDecodeError в компиляции StanfordNERTagger - PullRequest
0 голосов
/ 29 января 2019

Я использую оболочку NLTK для тегирования NER с использованием модели Stanford 3class.В новостях BBC необработанный текст, написанный на английском языке, получающий UnicodeDecodeError.

Вот мой код

from nltk.tag import StanfordNERTagger
st1 = StanfordNERTagger('/home/saurabh/saurabh-cair/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz', '/home/saurabh/saurabh-cair/stanford-ner-2018-10-16/stanford-ner.jar', encoding='utf-8')
file=open('/home/saurabh/saurabh-cair/model_training/bbc/data.txt','rt')
text=file.read()
file.close()

import nltk
words = nltk.word_tokenize(text)
xyz=st1.tag(words)

for i in xyz:   
    print(i)  

получил ошибку, как

Traceback (most recent call last):
  File "model_english.py", line 26, in <module>
    words = nltk.word_tokenize(text)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1241, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1281, in span_tokenize
    for sl in slices:
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1322, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 314, in _pair_iter
    for el in it:
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1297, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1343, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1478, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
    prev = next(it)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 584, in _annotate_first_pass
    for aug_tok in tokens:
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 550, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)  

Я попытался utf-8, ascii и кодировка по умолчанию, но это не решило мою проблему.

Текстовые данные содержат предложения типа:

General Motors of the US is to pay Fiat 1.55bn euros ($2bn; £1.1bn) to get out of a deal which could have forced it to buy the Italian car maker outright.  

Я использую Anaconda python 2.7

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...