input: html
<SPAN id=idxSpan><OBJECT id=IndiDocX codeBase="/IndiDocX.CAB#version=4,5,0,132" classid=clsid:43B180A2-396A-45CE-86D1-9680E4A9952C width=500 height=201 VIEWASTEXT><PARAM NAME="_ExtentX" VALUE="13229"><PARAM NAME="_ExtentY" VALUE="5318"><PARAM NAME="BackColor" VALUE="0"><PARAM NAME="ForeColor" VALUE="0"><PARAM NAME="Enabled" VALUE="True"><PARAM NAME="BackStyle" VALUE="0"><PARAM NAME="BorderStyle" VALUE="0"><PARAM NAME="iWidth" VALUE="800"><PARAM NAME="iHeight" VALUE="200"><PARAM NAME="MainDocUNID" VALUE="AAB092D735084A064825852D00372312"><PARAM NAME="ServerIP" VALUE="zboa3.sinopec.com"><PARAM NAME="DbPath" VALUE="/sinopec4/dep4809/swgl_4809.nsf"><PARAM NAME="DocForm" VALUE="frmIndiDocs"><PARAM NAME="FileInfos" VALUE="<!1!>6E09605B0382AFBA482585290033D844<file_unid>QFGG0L8JG5XF4C389PW5</file_unid><file_name>关于印发《党内关怀帮扶实施细则》的通〔2019〕67号 ).sep</file_name><file_size>42315</file_size><file_create>2020-3-12 17:34:23</file_create><file_update>2020-3-12 17:34:23</file_update><file_editmodel>0</file_editmodel><doc_unid>4825795A000CAA904825852D0001DA87</doc_unid></!1!><!2!>6E09605B0382AFBA482585290033D844<file_unid>NM6NEGOCXG5PSBMGFVMQ</file_unid><file_name>公司党内关怀帮扶实施标准.docx</file_name><file_size>20581</file_size><file_create>2020-3-12 17:34:26</file_create><file_update>2020-3-12 17:34:26</file_update><file_editmodel>0</file_editmodel><doc_unid>4825795A000CAA904825852D0001DAB0</doc_unid></!2!>
<!3!>6E09605B0382AFBA482585290033D844<file_unid>6M0ZGTE3H0FH4PN9QBT0</file_unid><file_name>公司发〔2020〕19号关于转发《关于印发〈党内关怀帮扶实施细则〉的通知》的通知.pdf</file_name><file_size>95471</file_size><file_create>2020-3-16 18:6:48</file_create><file_update>2020-3-16 18:6:48</file_update><file_editmodel>0</file_editmodel><doc_unid>4825795A000CAA904825852D0036ECE1</doc_unid></!3!>"
><PARAM NAME="Editable" VALUE="True"><PARAM NAME="WordTrack" VALUE="True"><PARAM NAME="WordLock" VALUE="True"><PARAM NAME="UpdInfoDocID" VALUE="4825795A000CAA9048258529003409BE"><PARAM NAME="SessionID" VALUE="554C571D3511CF5D338DC83F58767F21"><PARAM NAME="FileNum" VALUE="0"><PARAM NAME="FileNames" VALUE=""><PARAM NAME="FileSelNames" VALUE=""><PARAM NAME="LockForm" VALUE="True"><PARAM NAME="IsShowTrack" VALUE="True"><PARAM NAME="MenuValue" VALUE="11110000"><PARAM NAME="CanUseHandMark" VALUE="1"><PARAM NAME="CanHandMarkFile" VALUE="1"><PARAM NAME="CanClearHandMarkFile" VALUE="1"><PARAM NAME="HandMarkFileWidth" VALUE="6"><PARAM NAME="CanChangeHandMarkFile" VALUE="1"><PARAM NAME="Version" VALUE="V12"><PARAM NAME="WebServerVersion" VALUE="379"><PARAM NAME="EngFileName" VALUE="true"></OBJECT></SPAN>
Мне нужна функция, позволяющая выводить формат, подобный следующему:
[{'file_name':**value of <file_name>**,'url':http://server.com/**value of <doc_unid>**/$file/**value of <file_unid>****value of <file_name> ext part**}]
Я думаю, что это плохой код, и я не могу получить результат , Я использую bs4 так:
soup = BeautifulSoup(string_html, 'lxml', exclude_encodings='utf-8')
data = soup.find('param', attrs={'name': 'FileInfos'})['value']
soup_data = BeautifulSoup(data, 'lxml', exclude_encodings='utf-8')
for n in soup_data.find_all(name=['doc_unid','file_unid','file_name']):
print(n.doc_unid)
почему он не может работать ??
html4 = re.sub(r'(\<)(/?)\!(\d+\!)', r'<\g<2>li', html)
soup = BeautifulSoup(html4, 'lxml')
data = soup.find('param', attrs={'name': 'FileInfos'})['value']
data1 = '<ul>' + data + '</ul>'
soup_data = BeautifulSoup(data1, 'lxml')
for n in soup_data.children:
print(n.doc_unid.string)
почему только одни данные?