Я использую программу Caliber для преобразования PDF-файла в EPUB-файл, но результаты довольно грязные и нечитаемые.По правде говоря, файл EPUB - это просто набор файлов HTML, и результат преобразования является грязным, потому что Caliber интерпретирует каждую строку файла PDF как элемент
, что создает множество уродливых разрывов строк в файле EPUB.
Поскольку EPUB на самом деле представляет собой набор файлов HTML, его можно проанализировать с помощью Beautiful Soup.Однако программа, которую я написал, чтобы искать элементы с классом «calibre1» (обычный абзац) и объединять их в отдельные элементы (чтобы не было уродливых разрывов строк), не работает, и я не могу понять, почему.
Может ли Beautiful Soup справиться с тем, что я пытаюсь сделать?
import os
from bs4 import BeautifulSoup
path = "C:\\Users\\Eunice\\Desktop\\eBook"
for pathname, directorynames, filenames in os.walk(path):
# Get all HTML files in the target directory
for file_name in filenames:
# Open each HTML file, which is encoded using the "Latin1" encoding scheme
with open(pathname + "\\" + file_name, 'r', encoding="Latin1") as file:
# Create a list, which we will write our new HTML tags to later
html_elem_list: list = []
# Create a BS4 object
soup = BeautifulSoup(file, 'html.parser')
# Create a list of all BS4 elements, which we will traverse in the proceeding loop
html_elements = [x for x in soup.find_all()]
for html_element in html_elements:
try:
# Find the element with a class called "calibre1," which is how Calibre designates normal body text in a book
if html_element.attrs['class'][0] in 'calibre1':
# Combine the next element with the previous element if both elements are part of the same body text
if html_elem_list[-1].attrs['class'][0] in 'calibre1':
# Remove nonbreaking spaces from this element before adding it to our list of elements
html_elem_list[-1].string = html_elem_list[-1].text.replace(
'\n', ' ') + html_element.text
# This element must not be of the "calibre1" class, so add it to the list of elements without combining it with the previous element
else:
html_elem_list.append(html_element)
# This element must not have any class, so add it to the list of elements without combining it with the previous element
except KeyError:
html_elem_list.append(html_element)
# Create a string literal, which we will eventually write to our resultant file
str_htmlfile = ''
# For each element in the list of HTML elements, append the string representation of that element (which will be a line of HTML code) to the string literal
for elem in html_elem_list:
str_htmlfile = str_htmlfile + str(elem)
# Create a new file with a distinct variation of the name of the original file, then write the resultant HTML code to that file
with open(pathname + "\\" + '_modified_' + file_name, 'wb') as file:
file.write(str_htmlfile.encode('Latin1'))
Вот входные данные:
<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html>
Вот что я ожидаю:
<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip.642</p>
</body></html>
Вот фактический результат:
<html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html><body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body><p class="calibre5" id="calibre_pb_62">Note for Tyler</p>