Вставка родительского узла в XML работает не так, как хотелось бы (Python) - PullRequest
0 голосов
/ 26 апреля 2020

У меня есть файл XML, структурированный так:

<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="6" bbox="68.031,502.428,372.824,566.366">
<textline bbox="68.031,553.628,372.759,566.366">
<text font="PYNIYO+ImprintMTnum-Italic" bbox="68.031,553.639,76.375,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">T</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="76.231,553.639,79.479,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="79.334,553.639,83.161,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">t</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="83.017,553.639,88.112,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="87.968,553.639,91.216,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">l</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="91.071,553.639,96.167,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
            </textline>
        </textbox>
<textbox id="7" bbox="68.031,449.028,372.743,487.366">
<textline bbox="68.031,474.628,372.663,487.366">
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,475.072,72.786,485.757" colourspace="DeviceGray" ncolour="0" size="10.685">4</text>

<text font="NUMPTY+ImprintMTnum" bbox="75.718,474.628,84.061,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="83.883,474.628,90.258,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="90.080,474.628,95.175,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="94.997,474.628,98.246,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="98.068,474.628,103.163,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="102.985,474.628,105.889,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>

<text font="NUMPTY+ImprintMTnum" bbox="107.991,474.628,119.116,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">…</text>
<text font="NUMPTY+ImprintMTnum" bbox="118.938,474.628,121.842,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
</textline>
</textbox>
    </page>
</pages>

Я хотел бы открыть родительский элемент new_line, когда между первым значением атрибута bbox есть определенное расстояние и его предшествующий брат в тегах text. Но я также хочу сохранить вложенную структуру textbox, в то время как вывод моего кода не сохраняет ее. Код работает, принимая это расстояние, затем что-то идет не так.

Пример вывода, который я хотел бы получить:

<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="6" bbox="68.031,502.428,372.824,566.366">
<textline bbox="68.031,553.628,372.759,566.366">
<new_line>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="68.031,553.639,76.375,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">T</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="76.231,553.639,79.479,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="79.334,553.639,83.161,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">t</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="83.017,553.639,88.112,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="87.968,553.639,91.216,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">l</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="91.071,553.639,96.167,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
          </new_line>
            </textline>
        </textbox>
<textbox id="7" bbox="68.031,449.028,372.743,487.366">
<textline bbox="68.031,474.628,372.663,487.366">
<new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,475.072,72.786,485.757" colourspace="DeviceGray" ncolour="0" size="10.685">4</text>

<text font="NUMPTY+ImprintMTnum" bbox="75.718,474.628,84.061,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="83.883,474.628,90.258,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="90.080,474.628,95.175,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="94.997,474.628,98.246,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="98.068,474.628,103.163,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="102.985,474.628,105.889,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>

<text font="NUMPTY+ImprintMTnum" bbox="107.991,474.628,119.116,487.110" colourspace="DeviceGray" ncolour="0" size="12.482">…</text>
<text font="NUMPTY+ImprintMTnum" bbox="118.938,474.628,121.842,487.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
</new_line>
</textline>
</textbox>
    </page>
</pages>

Мой код:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('fe3.xml', parser)
root = tree.getroot()

# Get the first BBox value as float
# Return null if not found
def getBBoxFirstValue(line):
    if line is not None:
        bb = line.attrib.get('bbox')
        if bb is not None:
            try:
                return float(bb.split(",")[0])
            except ValueError:
                pass
    return None


new_line        = None
previous_bb     = None

for x in tree.xpath('//text'):
    # Get current bb value
    bb = getBBoxFirstValue(x)
    # Check current and past values aren't empty
    if bb is not None and previous_bb is not None:
        #print(bb, previous_bb, (bb-previous_bb))
        #print(abs(bb-previous_bb))
        # If distance with preview bb > 10
        if (bb - previous_bb) > 20 or - 50 <= (bb - previous_bb) < -1000:
            # If new_line isn't empty: it's inserted into parent tag at position of current tag

            if new_line is not None:
                x.getparent().insert(x.getparent().index(x), new_line)
            # A new "new_line" element is created
            new_line = etree.Element("new_line")

        # If the new line isn't not (e.g. one distance > 10 has been already found)
        if new_line is not None:
            new_line.append(x)

    # Keep latest non empty BBox 1st value
    if bb is not None:
        previous_bb = bb

# Add last new_line element if not null
if new_line is not None:
    tree.xpath('//text')[-1].getparent().append(new_line)


newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("UTF-8")
print(newtree)
with open("output.xml", "wb") as f:
    f.write(newtree)

В чем проблема в моем коде?

...