У меня есть XML, структурированный так:
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="191.745,592.218,249.042,603.5>
<textline bbox="68.031,540.828,372.755,553.566">
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
<text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="284.964,553.628,290.760,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">v</text>
<text font="NUMPTY+ImprintMTnum" bbox="290.382,553.628,295.477,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="295.333,553.628,301.707,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
<text font="NUMPTY+ImprintMTnum" bbox="301.563,553.628,305.390,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
<text font="NUMPTY+ImprintMTnum" bbox="305.245,553.628,311.620,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="311.475,553.628,315.992,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="315.847,553.628,320.942,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="320.798,553.628,324.625,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
<text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="331.445,553.639,337.241,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="337.097,553.639,340.924,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="340.312,553.639,343.560,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="343.416,553.639,346.319,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
<text font="NUMPTY+ImprintMTnum" bbox="365.139,553.628,368.387,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">-</text>
<text font="NUMPTY+ImprintMTnum" bbox="68.031,540.828,72.548,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="72.404,540.828,77.499,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="77.354,540.828,81.871,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="81.726,540.828,84.975,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="84.830,540.828,89.925,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
</new_line>
</textline>
<textline bbox="68.031,528.028,372.758,540.766">
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="106.735,540.828,113.110,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">d</text>
<text font="NUMPTY+ImprintMTnum" bbox="112.965,540.828,118.061,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="117.916,540.828,121.164,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="121.020,540.828,124.268,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="124.124,540.828,129.219,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="129.074,540.828,131.978,553.310" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
</textline>
</textbox>
</page>
</pages>
То есть преобразование этой части PDF (выделено на скриншоте):
У меня есть код, который добавляет новую строку каждый раз, когда появляется новый текстовый сегмент (большой интервал между словами, изображенный на рисунке выше, в нашем случае между "Rodri go" и " 1 "). Я делаю это, используя атрибут "bbox" XML. Мой код работает таким образом, но проблема в том, что он не учитывает дефисы, поскольку они означают, что это одно и то же слово. Поэтому мне нужно, чтобы «mi-seria» была вместе внутри тега newline, а следующий символ новой строки начинается с «della». Если дефиса нет, должен начинаться перевод строки.
У меня есть обходной путь, который заставляет что-то работать, но мне нужно что-то более точное. Пока что мой вывод подходит для сегмента:
<textline bbox="68.031,540.828,372.755,553.566">
<new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
<text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="284.964,553.628,290.760,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">v</text>
<text font="NUMPTY+ImprintMTnum" bbox="290.382,553.628,295.477,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="295.333,553.628,301.707,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
<text font="NUMPTY+ImprintMTnum" bbox="301.563,553.628,305.390,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
<text font="NUMPTY+ImprintMTnum" bbox="305.245,553.628,311.620,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="311.475,553.628,315.992,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="315.847,553.628,320.942,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="320.798,553.628,324.625,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
<text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="331.445,553.639,337.241,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="337.097,553.639,340.924,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="340.312,553.639,343.560,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="343.416,553.639,346.319,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
<text font="NUMPTY+ImprintMTnum" bbox="365.139,553.628,368.387,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">-</text>
<text font="NUMPTY+ImprintMTnum" bbox="68.031,540.828,72.548,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="72.404,540.828,77.499,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="77.354,540.828,81.871,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="81.726,540.828,84.975,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="84.830,540.828,89.925,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
</new_line>
</textline>
<textline bbox="68.031,528.028,372.758,540.766">
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="106.735,540.828,113.110,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">d</text>
<text font="NUMPTY+ImprintMTnum" bbox="112.965,540.828,118.061,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="117.916,540.828,121.164,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="121.020,540.828,124.268,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="124.124,540.828,129.219,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text><new_line></textline></textbox>
Но у меня нет кода, который говорит: «вставьте новую строку, только если расстояние <0 (я имею в виду, отрицательное число) и там не дефис, в противном случае не ". </p>
Мой код до сих пор:
import lxml.etree as etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('fe3.xml', parser)
root = tree.getroot()
# Get the first BBox value as float
# Return null if not found
def getBBoxFirstValue(line):
if line is not None:
bb = line.attrib.get('bbox')
if bb is not None:
try:
return float(bb.split(",")[0])
except ValueError:
pass
return None
new_line = None
previous_bb = None
for x in tree.xpath('//text'):
# Get current bb value
bb = getBBoxFirstValue(x)
# Check current and past values aren't empty
if bb is not None and previous_bb is not None:
#print(bb, previous_bb, (bb-previous_bb))
#print(abs(bb-previous_bb))
# If distance with preview bb > 10
if (bb - previous_bb) > 20 or (bb - previous_bb) < -1000:
# If new_line isn't empty: it's inserted into parent tag at position of current tag
if new_line is not None:
x.getparent().insert(x.getparent().index(x), new_line)
# A new "new_line" element is created
new_line = etree.Element("new_line")
# If the new line isn't not (e.g. one distance > 10 has been already found)
if new_line is not None:
new_line.append(x)
# Keep latest non empty BBox 1st value
if bb is not None:
previous_bb = bb
# Add last new_line element if not null
if new_line is not None:
tree.xpath('//text')[-1].getparent().append(new_line)
newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("UTF-8")
print(newtree)
with open("output.xml", "wb") as f:
f.write(newtree)
Подводя итог, как я могу сделать мой код лучше, чтобы я мог вставить newline
тег, когда указано определенное расстояние, но если есть дефис, я не разделяю слова, а если дефиса нет, я разделяю слова, открывая новый тег newline
?