У меня есть файл XML, структурированный так:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="6" bbox="68.031,502.428,372.824,566.366">
<textline bbox="68.031,553.628,372.759,566.366">
<text font="PYNIYO+ImprintMTnum-Italic" bbox="68.031,553.639,76.375,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">T</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="76.231,553.639,79.479,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="79.334,553.639,83.161,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">t</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="83.017,553.639,88.112,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="87.968,553.639,91.216,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">l</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="91.071,553.639,96.167,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
</textline>
</textbox>
<textbox id="7" bbox="68.031,449.028,372.743,487.366">
<textline bbox="68.031,474.628,372.663,487.366">
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,475.072,72.786,485.757" colourspace="DeviceGray" ncolour="0" size="10.685">4</text>
<text font="NUMPTY+ImprintMTnum" bbox="75.718,474.628,84.061,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="83.883,474.628,90.258,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="90.080,474.628,95.175,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="94.997,474.628,98.246,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="98.068,474.628,103.163,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="102.985,474.628,105.889,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="NUMPTY+ImprintMTnum" bbox="107.991,474.628,119.116,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">…</text>
<text font="NUMPTY+ImprintMTnum" bbox="118.938,474.628,121.842,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
</textline>
</textbox>
</page>
</pages>
Я хотел бы открыть родительский элемент new_line
, когда между первым значением атрибута bbox
есть определенное расстояние и его предшествующий брат в тегах text
. Но я также хочу сохранить вложенную структуру textbox
, в то время как вывод моего кода не сохраняет ее. Код работает, принимая это расстояние, затем что-то идет не так.
Пример вывода, который я хотел бы получить:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="6" bbox="68.031,502.428,372.824,566.366">
<textline bbox="68.031,553.628,372.759,566.366">
<new_line>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="68.031,553.639,76.375,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">T</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="76.231,553.639,79.479,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="79.334,553.639,83.161,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">t</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="83.017,553.639,88.112,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="87.968,553.639,91.216,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">l</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="91.071,553.639,96.167,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
</new_line>
</textline>
</textbox>
<textbox id="7" bbox="68.031,449.028,372.743,487.366">
<textline bbox="68.031,474.628,372.663,487.366">
<new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,475.072,72.786,485.757" colourspace="DeviceGray" ncolour="0" size="10.685">4</text>
<text font="NUMPTY+ImprintMTnum" bbox="75.718,474.628,84.061,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="83.883,474.628,90.258,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="90.080,474.628,95.175,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="94.997,474.628,98.246,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="98.068,474.628,103.163,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="102.985,474.628,105.889,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="NUMPTY+ImprintMTnum" bbox="107.991,474.628,119.116,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">…</text>
<text font="NUMPTY+ImprintMTnum" bbox="118.938,474.628,121.842,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
</new_line>
</textline>
</textbox>
</page>
</pages>
Мой код:
import lxml.etree as etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('fe3.xml', parser)
root = tree.getroot()
# Get the first BBox value as float
# Return null if not found
def getBBoxFirstValue(line):
if line is not None:
bb = line.attrib.get('bbox')
if bb is not None:
try:
return float(bb.split(",")[0])
except ValueError:
pass
return None
new_line = None
previous_bb = None
for x in tree.xpath('//text'):
# Get current bb value
bb = getBBoxFirstValue(x)
# Check current and past values aren't empty
if bb is not None and previous_bb is not None:
#print(bb, previous_bb, (bb-previous_bb))
#print(abs(bb-previous_bb))
# If distance with preview bb > 10
if (bb - previous_bb) > 20 or - 50 <= (bb - previous_bb) < -1000:
# If new_line isn't empty: it's inserted into parent tag at position of current tag
if new_line is not None:
x.getparent().insert(x.getparent().index(x), new_line)
# A new "new_line" element is created
new_line = etree.Element("new_line")
# If the new line isn't not (e.g. one distance > 10 has been already found)
if new_line is not None:
new_line.append(x)
# Keep latest non empty BBox 1st value
if bb is not None:
previous_bb = bb
# Add last new_line element if not null
if new_line is not None:
tree.xpath('//text')[-1].getparent().append(new_line)
newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("UTF-8")
print(newtree)
with open("output.xml", "wb") as f:
f.write(newtree)
В чем проблема в моем коде?