Question

Я пытаюсь извлечь данные между двумя элементами "Руководители" и "Аналитики", пример , но я не знаю, как поступить. Мой html:

<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>

Я хочу сделать это для целой пачки файлов, мой код до сих пор:

from bs4 import BeautifulSoup
import requests
import textwrap
import os
from lxml import html
import csv

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            page=f.read()
            soup = BeautifulSoup(f.read(),'html.parser')
            match = soup.find('div',class_='content_part hid', id='article_participants')
    print(match)

Я новичок в Python так что терпите меня.

Мой предпочтительный вывод будет:

Заголовок можно найти в следующих HTML:

<div class="page_header_email_alerts" id="page_header">
      <h1>
        <span itemprop="headline">Wabash National's (WNC) CEO Richard Giromini on Q4 2014 Results - Earnings Call Transcript</span>
              </h1>

      <div id="article_info">
        <div class="article_info_pos">
          <span itemprop="datePublished" content="2015-02-04T21:48:03Z">Feb.  4, 2015  4:48 PM ET</span>
          <span id="title_article_comments"></span>
          <span class="print_hide"><span class="print_hide">&nbsp;|&nbsp;</span> <span>About:</span> <span id="about_primary_stocks"><a title="Wabash National Corporation" href="/symbol/WNC" sasource="article_primary_about_trc">Wabash National Corporation (WNC)</a></span></span>
          <span class="author_name_for_print">by: SA Transcripts</span>
            <span id="second_line_wrapper"></span>
        </div>
'''

dabingsou · Answer 1 · 10 февраля 2020

Объедините ваш код.

import os
from simplified_scrapy.simplified_doc import SimplifiedDoc
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha'
for filename in os.listdir(directory):
  if filename.endswith('.html'):
    fname = os.path.join(directory,filename)
    with open(fname, 'r') as f:
      page=f.read()
      doc = SimplifiedDoc(page)
      headline = doc.select('div#article_info>span#about_primary_stocks>a>text()')
      div = doc.select('div#article_participants')
      if not div: continue
      ps = div.getElements('p',start='<strong>Executives</strong>',end='<strong>Analysts</strong>')
      Executives = [p.text.split('-')[0].strip() for p in ps]
      ps = div.getElements('p',start='<strong>Analysts</strong>')
      Analysts = [p.text.split('-')[0].strip() for p in ps]
      print (headline)
      print (Executives)
      print (Analysts)

Следующий код является примером.

from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<div class="page_header_email_alerts" id="page_header">
  <h1>
    <span itemprop="headline">Wabash National's (WNC) CEO Richard Giromini on Q4 2014 Results - Earnings Call Transcript</span>
  </h1>
  <div id="article_info">
    <div class="article_info_pos">
      <span itemprop="datePublished" content="2015-02-04T21:48:03Z">Feb.  4, 2015  4:48 PM ET</span>
      <span id="title_article_comments"></span>
      <span class="print_hide"><span class="print_hide">&nbsp;|&nbsp;</span> <span>About:</span> <span id="about_primary_stocks"><a title="Wabash National Corporation" href="/symbol/WNC" sasource="article_primary_about_trc">Wabash National Corporation (WNC)</a></span></span>
      <span class="author_name_for_print">by: SA Transcripts</span>
        <span id="second_line_wrapper"></span>
    </div>
  </div>
</div>
<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
</div>
'''
doc = SimplifiedDoc(html)
headline = doc.select('div#article_info>span#about_primary_stocks>a>text()')
div = doc.select('div#article_participants')
ps = div.getElements('p',start='<strong>Executives</strong>',end='<strong>Analysts</strong>')
Executives = [p.text.split('-')[0].strip() for p in ps]
ps = div.getElements('p',start='<strong>Analysts</strong>')
Analysts = [p.text.split('-')[0].strip() for p in ps]

print (headline)
print (Executives)
print (Analysts)

Результат:

Wabash National Corporation (WNC)
[u'Mike Pettit', u'Richard Giromini', u'Jeffery Taylor']
[u'Jeffery Taylor']

Вот еще примеры: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Janib Soomro · Answer 2 · 10 февраля 2020

@ dabingsou имеет хорошее решение, однако здесь гораздо упрощенный подход без необходимости использования сложных репозиториев:

from re import search

html = """<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>"""

soup = search( r"(<strong>Executives(.+))<strong>", html, re.DOTALL)
print ( soup.group(1) )

Результат (html):

<strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p>

Результат ( текст):

print ( bs(soup.group(1), "lxml").get_text() )

Executives
Mike Pettit - Vice President of Finance and Investor Relations
Richard Giromini - President and Chief Executive Officer
Jeffery Taylor - Senior Vice President and Chief Financial Officer

ILovePython · Answer 3 · 09 февраля 2020

Это не самый эффективный способ, но вы можете попробовать:

file = open(File_Path,'r') #open my file ( be careful with encoding)
text = file.readlines() #extract the content of the file
file.close() #close my file
Goal = [] # will include all the lines beetwen Executives and Analysts 
for indice,line in enumerate(text): 
    if "<p><strong>Executives</strong></p>" in line:
        """
        when the line with "<p><strong>Executives</strong></p>" is found, it will add to Goal all the next line until <p><strong>Analysts</strong></p> appear in a line
        """
        i = 1
        while not("<p><strong>Analysts</strong></p>" in text[indice+i]):
            Goal.append(text[indice+i])
            i +=1
        break
print(Goal)

самая важная часть находится в основном l oop, чтобы вы могли адаптировать его к своей программе

если вы знаете количество строк между руководителями и аналитиками, вы можете заменить while l oop на:

Goal = text[indice+1:indice+<number_of_line + 1>]

и удалить: i = 1

Таким образом вы сохраните маркер (например:

...

) и "\ n" во всех ваших строках

Вы можете удалить все "\ n" в строке с помощью встроенной функции:

line = line.replace("\n","")

Существует несколько способов получения данных между маркерами, например использование handle_data в htmlparser, или вы можете использовать функцию findall в re:

data_in_line = re.findall(r'>(.*?)<',line)

data_in_line будет списком всех данных, которые относятся к шаблон r '> (. *?) <', поэтому все данные находятся между '>' и '<' </p>

. Например: '

atest

'

вернется ['atest']

Было ли это полезно для вас?

Извлечь текст между двумя буквами

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Извлечь текст между двумя буквами

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов