Python NLTK ConllCorpus Reader - PullRequest
       72

Python NLTK ConllCorpus Reader

0 голосов
/ 12 апреля 2020

У меня есть код ошибки в следующем коде:

import nltk 
import random 

corp = nltk.corpus.ConllCorpusReader('.', 'tigercorpus-2.2.conll09.tar.gz',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='latin1')

tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)

Вот код ошибки:

Traceback (most recent call last):
  File "C:\Users\Ich\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-103-53bc424d399a>", line 7, in <module>
    tagged_sents = list(corp.tagged_sents())
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\collections.py", line 482, in __len__
    return max(len(lst) for lst in self._lists)
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\collections.py", line 482, in <genexpr>
    return max(len(lst) for lst in self._lists)
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
    for tok in self.iterate_from(self._toknum[-1]):
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
    tokens = self.read_block(self._stream)
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 227, in _read_grid_block
    for block in read_blankline_block(stream):
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 604, in read_blankline_block
    line = stream.readline()
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\data.py", line 1220, in readline
    new_chars = self._read(readsize)
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\data.py", line 1458, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "C:\Users\Ich\anaconda3\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Users\Ich\anaconda3\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Кто-нибудь имеет представление о том, что здесь происходит?

Спасибо за вашу поддержку!

...