Я использую pdfminer для перевода pdf в текст, но некоторые цифры или слова или '[' не могут отображаться
import sys
import importlib
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
path = r'C:\\Users\\User\\Desktop\\2002\\1999-66.pdf'
def parse():
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
layout = device.get_result()
for x in layout:
if (isinstance(x, LTTextBoxHorizontal)):
with open(r'C:/Users/User/Desktop/2002/3.txt', 'a') as f:
results = x.get_text()
if __name__ == '__main__':
Я ожидаю, что результат будет
[BP] Sergey Brin and Larry Page. Google search engine. http://google.stanford.edu.
[CGMP98] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Ecient crawling through
URL ordering. In To Appear: Proceedings of the Seventh International Web Conference
(WWW 98), 1998.
[Gar95] Eugene Gareld. New international professional society signals the maturing of scientometrics and informetrics. The Scientist, 9(16), Aug 1995. http://www.the-scientist.
[Gof71] William Goman. A mathematical method for analyzing the growth of a scientic
discipline. Journal of the ACM, 18(2):173{185, April 1971.
Но фактическийвывод
Sergey Brin and Larry Page. Google search engine. http:google.stanford.edu.
CGMP Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. E cient crawling through
url ordering. In To Appear: Proceedings of the Seventh International Web Conference
WWW , .
Eugene Gar eld. New international professional society signals the maturing of sciento-
metrics and informetrics. The Scientist, , Aug . http:www.the-scientist.
library.upenn.eduyr augustissi_ .html.
Gof William Go man. A mathematical method for analyzing the growth of a scienti c
discipline. Journal of the ACM, :, April .