Как извлечь каждое содержимое df1 (достоинства, недостатки, df_tit), используя BeautifulSoup? - PullRequest
0 голосов
/ 27 января 2020

У меня есть вопрос о теге extract. Это схема HTML.

<div class="content_body_ty1">
  <div class="us_label_wrap">..</div>
    <h2 class="us_label">
  <dl class="tc_list">
    <dt class = "merit">merit</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
    <dt class = "disadvantages">disadvantages</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
    <dt class = "df_tit">wish</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>

Я хочу извлечь содержимое тега, используя for-l oop. Затем поместите элемент в список. 1) "заслуга", бла ~~~~ 2) "недостатки", бла ~~~~ 3) "df_tit", бла ~~~~

Здесь мой код

maximum = 3
merit = [] 
disadv = []
tit = []
for page_number in range(1, maximum+1):
    URL = 'https://www.example.co.kr/companies/reviews/page={}'.format(page_number) 
    response = client.get(URL)
    print(page_number)
    whole_source = response.content.decode('utf-8')
    soup = BeautifulSoup(whole_source, 'html.parser')
    for entry in soup.find_all('dl', class_ = 'tc_list'): 
        if entry.find('dt', class_ = 'merit'):
            merit.append(entry.find('dd', class_ = 'df1')) 
        elif entry.find('dt', class_ = 'disadvantage'):
            disadv.append(entry.find('dd', class_ = 'df1'))
        elif entry.find('dt', class_ = 'df_tit'):
            tit.append(entry.find('dd', class_ = 'df1'))

Как мне извлечь содержимое тега. Пожалуйста, проверьте эту проблему. Спасибо!

Ответы [ 2 ]

1 голос
/ 27 января 2020

вы можете найти тег dd и использовать previous_sibling, чтобы проверить, к какой категории относится ваш элемент.

См. Код ниже:

import requests
from bs4 import BeautifulSoup


html = '<div class="content_body_ty1"><div class="us_label_wrap">..</div><h2 class="us_label"><dl class="tc_list"><dt class = "merit">merit</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd><dt cl    ass = "disadvantages">disadvantages</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd><dt class = "df_tit">wish</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd></dl></div>'


maximum = 3
merit = []
disadv = []
tit = []
soup = BeautifulSoup(html, 'html.parser')

dl_list = soup.find('dl', class_ = 'tc_list')
for dd in dl_list.find_all('dd',{'class':'df1'}):
    if dd.previous_sibling:
        if 'merit' in dd.previous_sibling.get('class'):
            merit.append(dd.text)
        elif 'disadvantages' in dd.previous_sibling.get('class'):
            disadv.append(dd.text)
        elif 'df_tit' in dd.previous_sibling.get('class'):
            tit.append(dd.text)

print(merit)
print(disadv)
print(tit)

РЕЗУЛЬТАТЫ :

[' blah~~~~~~~~blah~~~~']
[' blah~~~~~~~~blah~~~~']
[' blah~~~~~~~~blah~~~~']
0 голосов
/ 27 января 2020
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<div class="content_body_ty1">
  <div class="us_label_wrap">..</div>
    <h2 class="us_label">
  <dl class="tc_list">
    <dt class = "merit">merit</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
    <dt class = "disadvantages">disadvantages</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
    <dt class = "df_tit">wish</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
  </dl>
</div>
'''
# If DT and DD are not one-to-one, we need to make some small changes
doc = SimplifiedDoc(html)
dl = doc.select('dl.tc_list')
# The first method.  
lst = dl.children
i = 0
N = len(lst)-1
while i<N:
  print (lst[i].text,lst[i+1].text)
  i+=2

# The second method.
dts = dl.dts
for dt in dts:
  print (dt.text,dt.next.text)

Результат:

merit blah~~~~~~~~blah~~~~
disadvantages blah~~~~~~~~blah~~~~
wish blah~~~~~~~~blah~~~~
merit blah~~~~~~~~blah~~~~
disadvantages blah~~~~~~~~blah~~~~
wish blah~~~~~~~~blah~~~~
...