http://www.jcpjournal.org/journal/view.html?doi=10.15430 / JCP.2018.23.2.70
Если я использую следующий код python для синтаксического анализа указанной выше HTML страницы, я получу UnicodeDecodeError
.
from lxml import html
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
Если я сначала фильтрую ввод с помощью iconv -f utf-8 -t utf-8 -c
, а затем запускаю тот же код python, я все равно получаю UnicodeDecodeError
. Что такое надежный фильтр (без знания кодировки ввода HTML), чтобы отфильтрованный результат всегда работал с кодом python? Спасибо.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte
EDIT: вот используемые команды.
$ wget 'http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70'
$ ./main.py < 'view.html?doi=10.15430%2FJCP.2018.23.2.70'
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
$ iconv -f utf-8 -t utf-8 -c < 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | ./main.py
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte