При попытке извлечь текстовые поля из большого каталога .pptx-файлов приведенный ниже скрипт отлично работает для некоторых презентаций Powerpoint:
from pptx import Presentation
import glob
f = open("Scraped PPTX Data.txt", "a", encoding='utf-8')
for eachfile in glob.glob("*.pptx"):
prs = Presentation(eachfile)
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
f.write(shape.text)
f.close()
И все же для многих других (на вид очень больших) я получаю это Огромная стена ошибки:
File "C:\Users\GLD-POS3\Desktop\SIGNS\PPT_Scraper.py", line 9, in <module>
prs = Presentation(eachfile)
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 37, in from_file
phys_reader, pkg_srels, content_types
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 70, in _load_serialized_parts
for partname, blob, srels in part_walker:
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 106, in _walk_phys_parts
phys_reader, part_srels, visited_partnames
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 106, in _walk_phys_parts
phys_reader, part_srels, visited_partnames
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 103, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\phys_pkg.py", line 111, in blob_for
return self._zipf.read(pack_uri.membername)
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 1432, in read
return fp.read()
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 885, in read
buf += self._read1(self.MAX_N)
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 989, in _read1
self._update_crc(data)
File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 917, in _update_crc
raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'ppt/media/image170.jpeg'