У меня есть несколько xml файлов, это xml версия документа PDF. Сначала я должен объединить файлы xml, а затем прочитать слово, заканчивающееся дефисом. Если слово оканчивается дефисом, в XML (TCL CHAR = '-') создается отдельный тег, мне нужно идентифицировать этот тег и объединить последнее слово предыдущей строки и первое слово следующей строки, которое в отдельном теге называется. У меня есть следующие коды для слияния
def run(files):
first = None
for filename in files:
data = ET.parse(filename).getroot()
if first is None:
first = data
else:
first.extend(data)
if first is not None:
root = ET.tostring(first)
return root
и следующий код для слова слияния
beg_line_cont = []
end_line_cont = []
for block in root:
for para in block:
for line in para:
for word in line:
if word.tag == 'TC':
line = word.text
if word.tag == 'TCL' and word.attrib['CHAR']=='-':
beg_line_cont.append(line)
if word.tag == 'TC':
line = word.text
end_line_cont.append(line)
Код слияния не работает, я могу предыдущая строка перед TCL CHAR = '-', но не следующая строка ... Может кто-нибудь помочь ??
XML Пример файла здесь:
</PAR>
<LPAR PBDPL="[D]137[L]120" PBCAMGTI="[G]LP6[T]Lead VJ" STRIKE="0"></LPAR>
<PAR PBDPL="[D]3360[P]3m" PBCAMGTI="[G]I2AS[I]0" TAPARADV="[HYP]1" BLMODE="3" STRIKE="0" UNIQID="d180d82ee84ff937">
<LINE>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Diese Angebotsunterlage (die „</TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Angebotsunterlage</TC>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>“) beschreibt das freiwillige öffentliche Übernahme</TC>
<TCL CHAR="-" WIDTH="67" CTLCHAR="-" CTLSTR="" TYPE="SYSTEMHYPHEN" VISIBLE="1" USE_SF_LDRVALUES="1"/></LINE>
<LINE>
<TC>angebot in Form eines Tauschangebots (das „</TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Angebot</TC>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>“) der ADO Properties S.A., einer Aktiengesell</TC>
<TCL CHAR="-" WIDTH="67" CTLCHAR="-" CTLSTR="" TYPE="SYSTEMHYPHEN" VISIBLE="1" USE_SF_LDRVALUES="1"/></LINE>
<LINE>
<TC>schaft nach luxemburgischem Recht </TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ITALIC" PNTSZSTR="" FONTNAME="" FACE="I" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>(société anonyme)</TC>
<FRMDEF NAME="ITALIC" PNTSZSTR="" FONTNAME="" FACE="I" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC> mit Sitz in Senningerberg, eingetragen im </TC>
</LINE>
<LINE>