Я использую оболочку NLTK для тегирования NER с использованием модели Stanford 3class.В новостях BBC необработанный текст, написанный на английском языке, получающий UnicodeDecodeError.
Вот мой код
from nltk.tag import StanfordNERTagger
st1 = StanfordNERTagger('/home/saurabh/saurabh-cair/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz', '/home/saurabh/saurabh-cair/stanford-ner-2018-10-16/stanford-ner.jar', encoding='utf-8')
file=open('/home/saurabh/saurabh-cair/model_training/bbc/data.txt','rt')
text=file.read()
file.close()
import nltk
words = nltk.word_tokenize(text)
xyz=st1.tag(words)
for i in xyz:
print(i)
получил ошибку, как
Traceback (most recent call last):
File "model_english.py", line 26, in <module>
words = nltk.word_tokenize(text)
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1241, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1281, in span_tokenize
for sl in slices:
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1322, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 314, in _pair_iter
for el in it:
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1297, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1343, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1478, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
prev = next(it)
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 584, in _annotate_first_pass
for aug_tok in tokens:
File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 550, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
Я попытался utf-8, ascii и кодировка по умолчанию, но это не решило мою проблему.
Текстовые данные содержат предложения типа:
General Motors of the US is to pay Fiat 1.55bn euros ($2bn; £1.1bn) to get out of a deal which could have forced it to buy the Italian car maker outright.
Я использую Anaconda python 2.7