У меня есть следующий текст со страницы HTML:
page =
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).
General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""
Я хочу найти текст между пунктом 1 «Бизнес» и пунктом 1А «Факторы риска». Я не могу использовать Beautifulsoup, потому что каждая страница имеет свою структуру HTML-тегов. Я использую следующий код для получения текста, но он не работает:
regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.', #<===pattern 1: with an attribute bold before the item subtitle
'b>\s*Item 1\.(.+?)b>\s*Item 1A\.', #<===pattern 2: with a tag <b> before the item subtitle
'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>', #<===pattern 3: with a tag <\b> after the item subtitle
'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle
for regex in regexs:
match = re.search(regex, page, flags=re.IGNORECASE|re.DOTALL) #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
if match:
soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?)
#soup.text removes the html tags and only keep the texts
#rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
rawText = soup.text
Ожидаемый результат:
Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).
General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.
Я думаю, первое регулярное выражение должно соответствовать шаблону, но это не
РЕДАКТИРОВАТЬ: Вот фактическая страница htm и способ получить текст:
# Import the libraries
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
response = requests.get(url, headers=HEADERS)
page = response.text
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip() #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(u'\xa0', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(u'/s/', ' ') #<===replace " " (a special character for space in HTML) with space.
while ' ' in page:
page = page.replace(' ', ' ') #<===remove extra space