Превращение и HTML файл в CSV с помощью python - PullRequest
1 голос
/ 29 апреля 2020

После нескольких дней чтения и поиска по inte rnet ... Я решил попросить здесь о помощи.

У меня есть HTML файл, содержащий таблицу, и мне нужно превратить этот HTML файл в CSV.

Небольшой образец моего HTML файла:

    <html>
<body>
<p class="timestamp">Fri 21 Jul 13:14:15 BST 2017
</p>

<h3>TAT Signal and TMH near C-terminus</h3>
<table>
<tr style = "background:#E7EBD8"><td>1</td><td>GCF_000688455.1_ASM68845v1_protein.faa.gz</td><td colspan = 4>Acidobacterium ailaaui</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MSRRTFVSSATAGLAALGALSSAAEGHAQLVWTSKNWKLAEFETLLREPARIRQVYDVTQ</td></tr>
<tr style = "background:#E7EBD8"><td>WP_026442391.1</td><td colspan = 5>hypothetical protein [Acidobacterium ailaaui]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Length: 233</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number of AAs in TMHs: 21.25002</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number, first 60 AAs:  1.35114</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Total prob of N-in:        0.67991</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>inside</td>
<td>1</td>
<td>201</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>TMhelix</td>
<td>202</td>
<td>224</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>outside</td>
<td>225</td>
<td>233</td>
</tr>
<tr style = "background:#D8EBEA"><td>2</td><td>GCF_000022565.1_ASM2256v1_protein.faa.gz</td><td colspan = 4>Acidobacterium capsulatum ATCC 51196</td></tr>
<tr style = "background:#D8EBEA"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum</td></tr>
<tr style = "background:#D8EBEA"><td>First 60 AAs</td><td colspan = 5>MKSISRRSFVTTAAAGMAALGSLGPALPAAQGQAVEMASDWDISSFNQLAQSPARVKQLF</td></tr>
<tr style = "background:#D8EBEA"><td>WP_012680923.1</td><td colspan = 5>Tat pathway signal sequence domain-containing protein [Acidobacterium capsulatum]</td></tr>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Length: 237</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number of AAs in TMHs: 31.62059</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number, first 60 AAs:  5.92535</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Total prob of N-in:        0.86701</td>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>inside</td>
<td>1</td>
<td>205</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>TMhelix</td>
<td>206</td>
<td>228</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>outside</td>
<td>229</td>
<td>237</td>
</tr>
<tr style = "background:#E7EBD8"><td>3</td><td>GCF_000014005.1_ASM1400v1_protein.faa.gz</td><td colspan = 4>Candidatus Koribacter versatilis Ellin345</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Candidatus Koribacter; Candidatus Koribacter versatilis</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MGEKALMSKKPTIEEHLKATGVTRRSFVQLCGMLMAAAPIGLSLTSKASAQEVAKVVGKA</td></tr>
<tr style = "background:#E7EBD8"><td>WP_011525036.1</td><td colspan = 5>hydrogenase 2 small subunit [Candidatus Koribacter versatilis]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Length: 401</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number of AAs in TMHs: 19.93057</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number, first 60 AAs:  2.05251</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Total prob of N-in:        0.15168</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>outside</td>
<td>1</td>
<td>344</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>TMhelix</td>
<td>345</td>
<td>367</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>inside</td>
<td>368</td>
<td>401</td>
</tr>
</body>
</html>

Я пробовал этот python скрипт:

 import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("file.html")
bsObj = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table")
rows = table.findAll("tr")

with open("editors.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

И я получил эту ошибку:

Traceback (most recent call last):
  File "CleanTableTEST.py", line 18, in <module>
    rows = table.findAll("tr")
  File "/home/raven/.local/lib/python3.6/site-packages/bs4/element.py", line 2128, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Я также пробовал этот код:

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("file.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

with open("editors.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

И я получил эту ошибку:

Traceback (most recent call last):
  File "CleanTable.py", line 17, in <module>
    table = soup.findAll("table", {"class":"wikitable"})[0]
IndexError: list index out of range

У меня очень мало опыта, так что это результат несколько дней поиска и редактирования кода ...

Большое спасибо

1 Ответ

0 голосов
/ 29 апреля 2020

Ваш первый пример очень близок, вам просто нужно заменить findAll на find для таблицы, поскольку вы ищете только одну таблицу, а не список таблиц.

Если вы измените строка следующего, она должна работать как положено:

table = soup.find("table")
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...