После нескольких дней чтения и поиска по inte rnet ... Я решил попросить здесь о помощи.
У меня есть HTML файл, содержащий таблицу, и мне нужно превратить этот HTML файл в CSV.
Небольшой образец моего HTML файла:
<html>
<body>
<p class="timestamp">Fri 21 Jul 13:14:15 BST 2017
</p>
<h3>TAT Signal and TMH near C-terminus</h3>
<table>
<tr style = "background:#E7EBD8"><td>1</td><td>GCF_000688455.1_ASM68845v1_protein.faa.gz</td><td colspan = 4>Acidobacterium ailaaui</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MSRRTFVSSATAGLAALGALSSAAEGHAQLVWTSKNWKLAEFETLLREPARIRQVYDVTQ</td></tr>
<tr style = "background:#E7EBD8"><td>WP_026442391.1</td><td colspan = 5>hypothetical protein [Acidobacterium ailaaui]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Length: 233</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Number of predicted TMHs: 1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number of AAs in TMHs: 21.25002</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number, first 60 AAs: 1.35114</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Total prob of N-in: 0.67991</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>inside</td>
<td>1</td>
<td>201</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>TMhelix</td>
<td>202</td>
<td>224</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>outside</td>
<td>225</td>
<td>233</td>
</tr>
<tr style = "background:#D8EBEA"><td>2</td><td>GCF_000022565.1_ASM2256v1_protein.faa.gz</td><td colspan = 4>Acidobacterium capsulatum ATCC 51196</td></tr>
<tr style = "background:#D8EBEA"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum</td></tr>
<tr style = "background:#D8EBEA"><td>First 60 AAs</td><td colspan = 5>MKSISRRSFVTTAAAGMAALGSLGPALPAAQGQAVEMASDWDISSFNQLAQSPARVKQLF</td></tr>
<tr style = "background:#D8EBEA"><td>WP_012680923.1</td><td colspan = 5>Tat pathway signal sequence domain-containing protein [Acidobacterium capsulatum]</td></tr>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Length: 237</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Number of predicted TMHs: 1</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number of AAs in TMHs: 31.62059</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number, first 60 AAs: 5.92535</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Total prob of N-in: 0.86701</td>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>inside</td>
<td>1</td>
<td>205</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>TMhelix</td>
<td>206</td>
<td>228</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>outside</td>
<td>229</td>
<td>237</td>
</tr>
<tr style = "background:#E7EBD8"><td>3</td><td>GCF_000014005.1_ASM1400v1_protein.faa.gz</td><td colspan = 4>Candidatus Koribacter versatilis Ellin345</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Candidatus Koribacter; Candidatus Koribacter versatilis</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MGEKALMSKKPTIEEHLKATGVTRRSFVQLCGMLMAAAPIGLSLTSKASAQEVAKVVGKA</td></tr>
<tr style = "background:#E7EBD8"><td>WP_011525036.1</td><td colspan = 5>hydrogenase 2 small subunit [Candidatus Koribacter versatilis]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Length: 401</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Number of predicted TMHs: 1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number of AAs in TMHs: 19.93057</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number, first 60 AAs: 2.05251</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Total prob of N-in: 0.15168</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>outside</td>
<td>1</td>
<td>344</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>TMhelix</td>
<td>345</td>
<td>367</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>inside</td>
<td>368</td>
<td>401</td>
</tr>
</body>
</html>
Я пробовал этот python скрипт:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("file.html")
bsObj = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table")
rows = table.findAll("tr")
with open("editors.csv", "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll(["td", "th"]):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
И я получил эту ошибку:
Traceback (most recent call last):
File "CleanTableTEST.py", line 18, in <module>
rows = table.findAll("tr")
File "/home/raven/.local/lib/python3.6/site-packages/bs4/element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Я также пробовал этот код:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("file.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")
with open("editors.csv", "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll(["td", "th"]):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
И я получил эту ошибку:
Traceback (most recent call last):
File "CleanTable.py", line 17, in <module>
table = soup.findAll("table", {"class":"wikitable"})[0]
IndexError: list index out of range
У меня очень мало опыта, так что это результат несколько дней поиска и редактирования кода ...
Большое спасибо