Используйте pdfminer, но какое-то слово или число не могут быть показаны - PullRequest
0 голосов
/ 09 октября 2019

Я использую pdfminer для перевода pdf в текст, но некоторые цифры или слова или '[' не могут отображаться

import sys
import importlib
importlib.reload(sys)

from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

path = r'C:\\Users\\User\\Desktop\\2002\\1999-66.pdf'
def parse():
    fp = open(path, 'rb')
    praser = PDFParser(fp)
    doc = PDFDocument()
    praser.set_document(doc)
    doc.set_parser(praser)
    doc.initialize()
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in doc.get_pages():
            interpreter.process_page(page)
            layout = device.get_result()
            for x in layout:
                if (isinstance(x, LTTextBoxHorizontal)):
                    with open(r'C:/Users/User/Desktop/2002/3.txt', 'a') as f:
                        results = x.get_text()
                        print(results)
                        f.write(results)

if __name__ == '__main__':
    parse()

Я ожидаю, что результат будет

[BP] Sergey Brin and Larry Page. Google search engine. http://google.stanford.edu.
[CGMP98] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Ecient crawling through
URL ordering. In To Appear: Proceedings of the Seventh International Web Conference
(WWW 98), 1998.
[Gar95] Eugene Gareld. New international professional society signals the maturing of scientometrics and informetrics. The Scientist, 9(16), Aug 1995. http://www.the-scientist.
library.upenn.edu/yr1995/august/issi_950821.ht%ml.
[Gof71] William Goman. A mathematical method for analyzing the growth of a scientic
discipline. Journal of the ACM, 18(2):173{185, April 1971.

Но фактическийвывод

BP
Sergey Brin and Larry Page. Google search engine. http:google.stanford.edu.
CGMP     Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. E cient crawling through
url ordering. In To Appear: Proceedings of the Seventh International Web Conference
WWW     ,       .
Gar 
Eugene Gar eld. New international professional society signals the maturing of sciento-
metrics and informetrics. The Scientist,    , Aug       . http:www.the-scientist.
library.upenn.eduyr     augustissi_  .html.
Gof William Go man. A mathematical method for analyzing the growth of a scienti c
discipline. Journal of the ACM, :, April    .
...