Анализ текста PDF с использованием Python - PullRequest
0 голосов
/ 11 ноября 2019

У меня есть несколько PDF-файлов, имеющих несколько страниц. Я хочу извлечь только необходимую информацию из всего текста. Мне удалось прочитать текст и получить его в списке, но я не смог найти способ извлечь необходимые строки. Ниже приведен код, который я мог бы написать: -

import PyPDF2
import io
import re
import pandas as pd

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
mypdf = open('C:/XXXX/XXXXX/Desktop/7-29-19 Office Availabilities 1.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)

entry=[]
for page in PDFPage.get_pages(mypdf, 
                              caching=True,
                              check_extractable=True):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    page_interpreter.process_page(page)
    text = fake_file_handle.getvalue()

    entry.append(text)
     # close open handles
    converter.close()
    fake_file_handle.close()


Flyer= [x.split() for x in entry if x.startswith('FL')]
print(Flyer)

Ниже приведен вывод, который яможет быть так далеко: -

["FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible1) 104-112 E 1st St - Sanford, FL 32771Rand Complex-40,000 SF Class C Office Building  Renovated in 1988 Built in 1910Hotard RealtyMarie Hotard (407) 467-5397Building Notes:-7,0001 yr2ndVacantOffice/N$7.80/mgN15 MthsMarie Hotard (407) 467-5397PHotard RealtyCall to negotiate renovation needs. Located in the heart of downtown Sanford's historic district, over 5,000 square feet ofoffice space on the second floor above bustling First Street. Historical building with great potential.2) 110 W 1st St - Sanford, FL 32771The Welaka Building-25,797 SF Class B Loft/Creative Space Building  Renovated in 1997 Built in 1887Brenner Real Estate E.Charles E. Brenner (407) 677-1700Building Notes:-366Negotiable2nd / Suite 214VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate198Negotiable2nd / Suite 234VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,276Negotiable2nd / Suite 240VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate743Negotiable2nd / Suite 242VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,357Negotiable2nd / Suite 246VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate720Negotiable2nd / Suite 250VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real EstateCopyrighted report licensed to CBRE - 759852.7/29/2019Page 1\x0c",
 'FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible3) 734 N 3rd St - Leesburg, FL 347483rd Street Office Park-21,472 SF Class B Office Building  Built in 1974Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:City of Leesburg services and utilities including high speed DSL, fiber optic internet and phone systems all pre-wired and marked from central control room.ADA bathrooms & private executive and managerial offices.Security and central fire alarm system installed.1,564Negotiable1st / Suite Space 1VacantOffice/D$14.00/mgN29 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC4) 900 N 14th St - Leesburg, FL 34748Trophy Leesburg Offices-40,302 SF Class B Office Building  Built in 1981Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:Building has 24 hour access.Join GSA and the Social Security Administration in this great Leesburg Office.  Current layout is perfect for high density office user.  Small modifications can bemade to accommodate anything from a small college to a large medical tenant.Great Central Florida location along busy US 27 and not far from The Villages.5,150Negotiable2ndVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC2nd Floor maybe divided into smaller units11,248Negotiable3rdVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC3rd Floor has 3 units that maybe divided or combined.Copyrighted report licensed to CBRE - 759852.7/29/2019Page 2\x0c',

Желаемый вывод: -

['Flyer Number',    'Address',  'Total SF', 'Class',    'Suite/Bldg',   'SF available', 'Rent/SF/Year', 'Term', 'Occupancy',    'User/Type',    'Leasing company',  'Contact',  'Listed',   'Divisible',
'FL 32771', '104-112 E 1st St - Sanford',   '40,000 SF',    'C',    'P 2nd',    '7000', '$7.80/mg', '1 yr', 'Vacant',   'Office/N', 'Hotard Realty',    'Marie Hotard (407) 467-5397',  '15 Mths',  'N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd/Suite 214',  '366',  '$22.00/fs ',   'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 234 ',   '198',  '$22.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 240',    '1276', '$16.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 242',    '743',  '$18.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 246',    '1357', '$16.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 250',    '720',  '$18.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N']

Пожалуйста, помогите !!

1 Ответ

0 голосов
/ 11 ноября 2019
list = ["FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible1) 104-112 E 1st St - Sanford, FL 32771Rand Complex-40,000 SF Class C Office Building  Renovated in 1988 Built in 1910Hotard RealtyMarie Hotard (407) 467-5397Building Notes:-7,0001 yr2ndVacantOffice/N$7.80/mgN15 MthsMarie Hotard (407) 467-5397PHotard RealtyCall to negotiate renovation needs. Located in the heart of downtown Sanford's historic district, over 5,000 square feet ofoffice space on the second floor above bustling First Street. Historical building with great potential.2) 110 W 1st St - Sanford, FL 32771The Welaka Building-25,797 SF Class B Loft/Creative Space Building  Renovated in 1997 Built in 1887Brenner Real Estate E.Charles E. Brenner (407) 677-1700Building Notes:-366Negotiable2nd / Suite 214VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate198Negotiable2nd / Suite 234VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,276Negotiable2nd / Suite 240VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate743Negotiable2nd / Suite 242VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,357Negotiable2nd / Suite 246VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate720Negotiable2nd / Suite 250VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real EstateCopyrighted report licensed to CBRE - 759852.7/29/2019Page 1\x0c",
'FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible3) 734 N 3rd St - Leesburg, FL 347483rd Street Office Park-21,472 SF Class B Office Building  Built in 1974Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:City of Leesburg services and utilities including high speed DSL, fiber optic internet and phone systems all pre-wired and marked from central control room.ADA bathrooms & private executive and managerial offices.Security and central fire alarm system installed.1,564Negotiable1st / Suite Space 1VacantOffice/D$14.00/mgN29 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC4) 900 N 14th St - Leesburg, FL 34748Trophy Leesburg Offices-40,302 SF Class B Office Building  Built in 1981Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:Building has 24 hour access.Join GSA and the Social Security Administration in this great Leesburg Office.  Current layout is perfect for high density office user.  Small modifications can bemade to accommodate anything from a small college to a large medical tenant.Great Central Florida location along busy US 27 and not far from The Villages.5,150Negotiable2ndVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC2nd Floor maybe divided into smaller units11,248Negotiable3rdVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC3rd Floor has 3 units that maybe divided or combined.Copyrighted report licensed to CBRE - 759852.7/29/2019Page 2\x0c']
Requirements = ['Flyer Number',    'Address',  'Total SF', 'Class',    'Suite/Bldg',   'SF available', 'Rent/SF/Year', 'Term', 'Occupancy',    'User/Type',    'Leasing company',  'Contact',  'Listed',   'Divisible']
OutputList = []
for item in list:
    FlyerNumber = "FL" + (item.split("FL",1)[1])[:6]
    Address = (item.split(") ",1)[1]).split(",",1)[0]
    Squarefeet = str((item.split("SF Class",1)[0]).rsplit("-",1)[1]) + "SF"
    Class = (item.split("Class ",1)[1])[0]
    CollectedData = [FlyerNumber,Address,Squarefeet,Class]
    OutputList.extend(CollectedData)
print(OutputList)
FinalList = Requirements + OutputList
print(FinalList)

Печать OutputList даст вам:

['FL 32771', '104-112 E 1st St - Sanford', '40,000 SF', 'C', 'FL 34748', '734 N 3rd St - Leesburg', '21,472 SF', 'B']

Печать FinalList даст вам:

['Flyer Number', 'Address', 'Total SF', 'Class', 'Suite/Bldg', 'SF available', 'Rent/SF/Year', 'Term', 'Occupancy', 'User/Type', 'Leasing company', 'Contact', 'Listed', 'Divisible', 'FL 32771', '104-112 E 1st St - Sanford', '40,000 SF', 'C', 'FL 34748', '734 N 3rd St - Leesburg', '21,472 SF', 'B']

Я поймал себя на том, что много времени прохожу через всеТребованиеИтак, вот половина из них, от Flyer Number до Class. Убедитесь, что ваша информация не меняется, иначе выходные данные могут измениться.

Код, имеющий проблемы:

import PyPDF2
import io
import re
import pandas as pd

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
mypdf = open('C:/Users/renu.sharma/Desktop/7-29-19 Office Availabilities 1.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)

entry=[]
for page in PDFPage.get_pages(mypdf,  
                              caching=True, 
                              check_extractable=True):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO() 
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter) 
    page_interpreter.process_page(page)
    text = fake_file_handle.getvalue() 

    entry.append(text) 
     # close open handles
    converter.close()
    fake_file_handle.close()

Requirements = ['Flyer Number',    'Address',  'Total SF', 'Class',    'Suite/Bldg',   'SF available', 'Rent/SF/Year', 'Term', 'Occupancy',    'User/Type',    'Leasing company',  'Contact',  'Listed',   'Divisible']
OutputList = []
for item in entry:
    FlyerNumber = item.split("FL",1)[1])[:6]
    Address = (item.split(") ",1)[1]).split(",",1)[0]
    Squarefeet = str((item.split("SF Class",1)[0]).rsplit("-",1)[1]) + "SF"
    Class = (item.split("Class ",1)[1])[0]
    CollectedData = [FlyerNumber,Address,Squarefeet,Class]
    OutputList.extend(CollectedData)
FinalList = Requirements + OutputList 

print(OutputList)
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...