Обработка текстовых данных без тегов с помощью BeautifulSoup - PullRequest
0 голосов
/ 28 мая 2020
B>DAY</B>, Arbitrator: Under the jurisdiction of the United States
Federal Government and the Federal Aviation Administration, the above
grievance arbitration was submitted to **Joseph L. Daly, Arbitrator**,
on August 15, 2017, at the Federal Aviation Administration South West
Regional Office Central Service Center, Fort Worth, Texas. Prior to
the arbitration hearing, the parties motions were made by the FAA
and NATCA to exclude witnesses from testifying at the arbitration
hearing. The arbitrator denied the motions by a written **decision dated
August 6, 2017**.</P>
<P>The parties filed post-hearing briefs on October 20, 2017. The
Opinion and Award was rendered on October 30, 2017.</P>

Выше приведены данные, из которых я хочу извлечь значение даты решения и соответствующее имя арбитра, например, Джозеф Л. Дейли

Мой текущий код: -

with open ("file.sgm","r")as f:
contents =f.read()
soup = BeautifulSoup(contents, 'html.parser')
s = soup.find_all('p')
for i in s:
   data = i.text
   print(data)

Я могу извлечь пара-данные, но теперь как мне извлечь соответствующие значения из приведенных выше данных.

1 Ответ

0 голосов
/ 28 мая 2020
import re


data = """
B>DAY</B>, Arbitrator: Under the jurisdiction of the United States
Federal Government and the Federal Aviation Administration, the above
grievance arbitration was submitted to **Joseph L. Daly, Arbitrator**,
on August 15, 2017, at the Federal Aviation Administration South West
Regional Office Central Service Center, Fort Worth, Texas. Prior to
the arbitration hearing, the parties motions were made by the FAA
and NATCA to exclude witnesses from testifying at the arbitration
hearing. The arbitrator denied the motions by a written **decision dated
August 6, 2017**.</P>
<P>The parties filed post-hearing briefs on October 20, 2017. The
Opinion and Award was rendered on October 30, 2017.</P>
"""

match = re.findall(r"\*\*([^*]*)\*\*", data)

print(match)

Вывод:

['Joseph L. Daly, Arbitrator', 'decision dated\nAugust 6, 2017']
...