BeautifulSoup или просит не читать какой-то раздел веб-страницы - PullRequest
0 голосов
/ 03 февраля 2019

Я новичок в изучении веб-страниц и у меня возникли проблемы с получением данных с веб-страницы.

Я пытаюсь прочитать эту веб-страницу: https://www.timeanddate.com/weather/pakistan/lahore/historic?month=7&year=2018

и пытаюсь получитьданные о скорости ветра через элемент div с классом: wstext, но по какой-то причине страница, которая запрашивает библиотеку, получает через Интернет, не содержит этот конкретный класс и некоторых его предков.

import requests
import bs4 as bs
import numpy as np

wind = np.random.rand(120)
dailyWindRecord = np.random.rand(30,4)

html = requests.get('https://www.timeanddate.com/weather/pakistan/lahore/historic?month=7&year=2018')

print(html.text)

soup = bs.BeautifulSoup(html.content, 'html5lib')

print(soup.prettify)

windList = soup.findAll('div')
print(windList)

попытался распечатать запросы данных html, прочитанные напрямую, и после их анализа через beautifulsoup, чтобы увидеть, содержат ли данные html этот класс, но я ничего не смог найти.Любая помощь будет принята с благодарностью.

Ответы [ 2 ]

0 голосов
/ 03 февраля 2019

Мое исследование и очень-очень грязное "решение проблемы"

1.BeautifulSoap просто отлично

Посмотрите на решение для панд - оно работает просто отлично.

Посмотрите на источник панд - мы видим, что панды используют _BeautifulSoupHtml5LibFrameParser.

Ergo: BeautifulSoup в порядке.

2."Nitty-gritty dirty своего рода решение" с curl

Давайте попробуем curl :

$ curl https://www.timeanddate.com/weather/pakistan/lahore/historic\?month\=7\&year\=2018 > result.html   
$ less result.html

Что мы видим здесь:

</script><script type="text/javascript">
var data={"copyright":"Contents are strictly for use by 
timeanddate.com","units": 
{"temp":"°C","prec":"mm","wind":"km\/h","baro":"mbar"},
"temp":        
[{"date":15304047E5,"temp":29},{"date":15304065E5,"temp":29},  
{"date":15304083E5,"temp":29},{"date":15304101E5,"temp":28},
...

IПредположим, это данные, которые ищет OP.

3.Возможное решение

  1. Загрузите URL тем или иным способом.curl / wget / requests - все должно быть в порядке
  2. Из загруженного html-экстракта var data.* Python str -методов должно быть достаточно
  3. json.loads это извлечено data
  4. Готово

Красота в таком решении - данные приходят as is бездекодирование из html <table>.

PS

Лично мне нравится pandas -решение.

Потому что pandas - отличная библиотека.

Но для решения этой проблемы панды не нужны.

0 голосов
/ 03 февраля 2019

Панды могут выполнять работу за вас, вместо того, чтобы использовать bs4 или запросы:

import numpy as np
import pandas as pd

wind = np.random.rand(120)
dailyWindRecord = np.random.rand(30,4)

url = 'https://www.timeanddate.com/weather/pakistan/lahore/historic?month=7&year=2018'

tables = pd.read_html(url)

table = tables[1]

print (table.iloc[:,4])

Вывод:

print (table.iloc[:,4])
0       3 mph
1     No wind
2     No wind
3     No wind
4     No wind
5     No wind
6     No wind
7       3 mph
8       5 mph
9       6 mph
10      5 mph
11      5 mph
12      6 mph
13      5 mph
14    No wind
15      3 mph
16    No wind
17    No wind
18    No wind
19    No wind
20      5 mph
21    No wind
22      6 mph
23      6 mph
24      5 mph
25      6 mph
26      7 mph
27      7 mph
28      7 mph
29      3 mph
30      3 mph
31      3 mph
32      3 mph
33    No wind
34      3 mph
35      3 mph
36    No wind
37    No wind
38        NaN
Name: (Unnamed: 4_level_0, Wind), dtype: object

Вариант 2:

Вы можете найти и вытащить структуру json в html, а затем поработать с ней.Однако, когда я попробовал это, он расширил данные за месяц, а не за один день, по часам:

import numpy as np
import requests
import bs4
import json

wind = np.random.rand(120)
dailyWindRecord = np.random.rand(30,4)

url = 'https://www.timeanddate.com/weather/pakistan/lahore/historic?month=7&year=2018'

response = requests.get(url)

soup = bs4.BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script')
jsonObj = None

for script in scripts:
    if 'var data='  in script.text:
        jsonStr = script.text.strip()

        jsonStr = jsonStr.split('var data=')[1]
        jsonStr = jsonStr.split(';')[0]

        jsonObj = json.loads(jsonStr)

for item in jsonObj['detail']:
    date = item['ds']
    wind = item['wind']

    print ('Date: %-40s   Wind: %s' %(date,wind) )

Вывод:

Date: Sunday, 1 July 2018, 00:00 — 06:00         Wind: 0.621
Date: Sunday, 1 July 2018, 06:00 — 12:00         Wind: 3.728
Date: Sunday, 1 July 2018, 12:00 — 18:00         Wind: 3.107
Date: Sunday, 1 July 2018, 18:00 — 00:00         Wind: 3.107
Date: Monday, 2 July 2018, 00:00 — 06:00         Wind: 1.864
Date: Monday, 2 July 2018, 06:00 — 12:00         Wind: 5.593
Date: Monday, 2 July 2018, 12:00 — 18:00         Wind: 8.7
Date: Monday, 2 July 2018, 18:00 — 00:00         Wind: 9.943
Date: Tuesday, 3 July 2018, 00:00 — 06:00        Wind: 10.564
Date: Tuesday, 3 July 2018, 06:00 — 12:00        Wind: 11.185
Date: Tuesday, 3 July 2018, 12:00 — 18:00        Wind: 9.943
Date: Tuesday, 3 July 2018, 18:00 — 00:00        Wind: 6.214
Date: Wednesday, 4 July 2018, 00:00 — 06:00      Wind: 6.836
Date: Wednesday, 4 July 2018, 06:00 — 12:00      Wind: 4.971
Date: Wednesday, 4 July 2018, 12:00 — 18:00      Wind: 6.214
Date: Wednesday, 4 July 2018, 18:00 — 00:00      Wind: 3.728
Date: Thursday, 5 July 2018, 00:00 — 06:00       Wind: 1.864
Date: Thursday, 5 July 2018, 06:00 — 12:00       Wind: 1.864
Date: Thursday, 5 July 2018, 12:00 — 18:00       Wind: 3.107
Date: Thursday, 5 July 2018, 18:00 — 00:00       Wind: 3.107
Date: Friday, 6 July 2018, 00:00 — 06:00         Wind: 1.864
Date: Friday, 6 July 2018, 06:00 — 12:00         Wind: 6.214
Date: Friday, 6 July 2018, 12:00 — 18:00         Wind: 6.836
Date: Friday, 6 July 2018, 18:00 — 00:00         Wind: 3.728
Date: Saturday, 7 July 2018, 00:00 — 06:00       Wind: 1.243
Date: Saturday, 7 July 2018, 06:00 — 12:00       Wind: 2.486
Date: Saturday, 7 July 2018, 12:00 — 18:00       Wind: 6.836
Date: Saturday, 7 July 2018, 18:00 — 00:00       Wind: 2.486
Date: Sunday, 8 July 2018, 00:00 — 06:00         Wind: 3.107
Date: Sunday, 8 July 2018, 06:00 — 12:00         Wind: 6.214
Date: Sunday, 8 July 2018, 12:00 — 18:00         Wind: 5.593
Date: Sunday, 8 July 2018, 18:00 — 00:00         Wind: 4.35
Date: Monday, 9 July 2018, 00:00 — 06:00         Wind: 5.593
Date: Monday, 9 July 2018, 06:00 — 12:00         Wind: 5.593
Date: Monday, 9 July 2018, 12:00 — 18:00         Wind: 6.214
Date: Monday, 9 July 2018, 18:00 — 00:00         Wind: 4.35
Date: Tuesday, 10 July 2018, 00:00 — 06:00       Wind: 6.836
Date: Tuesday, 10 July 2018, 06:00 — 12:00       Wind: 8.078
Date: Tuesday, 10 July 2018, 12:00 — 18:00       Wind: 6.836
Date: Tuesday, 10 July 2018, 18:00 — 00:00       Wind: 5.593
Date: Wednesday, 11 July 2018, 00:00 — 06:00     Wind: 6.214
Date: Wednesday, 11 July 2018, 06:00 — 12:00     Wind: 12.428
Date: Wednesday, 11 July 2018, 12:00 — 18:00     Wind: 8.078
Date: Wednesday, 11 July 2018, 18:00 — 00:00     Wind: 5.593
Date: Thursday, 12 July 2018, 00:00 — 06:00      Wind: 4.971
Date: Thursday, 12 July 2018, 06:00 — 12:00      Wind: 8.078
Date: Thursday, 12 July 2018, 12:00 — 18:00      Wind: 7.457
Date: Thursday, 12 July 2018, 18:00 — 00:00      Wind: 6.214
Date: Friday, 13 July 2018, 00:00 — 06:00        Wind: 5.593
Date: Friday, 13 July 2018, 06:00 — 12:00        Wind: 11.807
Date: Friday, 13 July 2018, 12:00 — 18:00        Wind: 9.321
Date: Friday, 13 July 2018, 18:00 — 00:00        Wind: 5.593
Date: Saturday, 14 July 2018, 00:00 — 06:00      Wind: 4.971
Date: Saturday, 14 July 2018, 06:00 — 12:00      Wind: 4.971
Date: Saturday, 14 July 2018, 12:00 — 18:00      Wind: 6.214
Date: Saturday, 14 July 2018, 18:00 — 00:00      Wind: 6.214
Date: Sunday, 15 July 2018, 00:00 — 06:00        Wind: 8.7
Date: Sunday, 15 July 2018, 06:00 — 12:00        Wind: 8.7
Date: Sunday, 15 July 2018, 12:00 — 18:00        Wind: 8.7
Date: Sunday, 15 July 2018, 18:00 — 00:00        Wind: 5.593
Date: Monday, 16 July 2018, 00:00 — 06:00        Wind: 4.971
Date: Monday, 16 July 2018, 06:00 — 12:00        Wind: 11.185
Date: Monday, 16 July 2018, 12:00 — 18:00        Wind: 11.185
Date: Monday, 16 July 2018, 18:00 — 00:00        Wind: 8.7
Date: Tuesday, 17 July 2018, 00:00 — 06:00       Wind: 7.457
Date: Tuesday, 17 July 2018, 06:00 — 12:00       Wind: 8.078
Date: Tuesday, 17 July 2018, 12:00 — 18:00       Wind: 6.836
Date: Tuesday, 17 July 2018, 18:00 — 00:00       Wind: 4.971
Date: Wednesday, 18 July 2018, 00:00 — 06:00     Wind: 3.728
Date: Wednesday, 18 July 2018, 06:00 — 12:00     Wind: 2.486
Date: Wednesday, 18 July 2018, 12:00 — 18:00     Wind: 6.214
Date: Wednesday, 18 July 2018, 18:00 — 00:00     Wind: 4.971
Date: Thursday, 19 July 2018, 00:00 — 06:00      Wind: 4.971
Date: Thursday, 19 July 2018, 06:00 — 12:00      Wind: 5.593
Date: Thursday, 19 July 2018, 12:00 — 18:00      Wind: 6.214
Date: Thursday, 19 July 2018, 18:00 — 00:00      Wind: 1.864
Date: Friday, 20 July 2018, 00:00 — 06:00        Wind: 2.486
Date: Friday, 20 July 2018, 06:00 — 12:00        Wind: 5.593
Date: Friday, 20 July 2018, 12:00 — 18:00        Wind: 8.078
Date: Friday, 20 July 2018, 18:00 — 00:00        Wind: 3.728
Date: Saturday, 21 July 2018, 00:00 — 06:00      Wind: 0.621
Date: Saturday, 21 July 2018, 06:00 — 12:00      Wind: 1.243
Date: Saturday, 21 July 2018, 12:00 — 18:00      Wind: 2.486
Date: Saturday, 21 July 2018, 18:00 — 00:00      Wind: 7.457
Date: Sunday, 22 July 2018, 00:00 — 06:00        Wind: 4.971
Date: Sunday, 22 July 2018, 06:00 — 12:00        Wind: 6.836
Date: Sunday, 22 July 2018, 12:00 — 18:00        Wind: 4.35
Date: Sunday, 22 July 2018, 18:00 — 00:00        Wind: 4.35
Date: Monday, 23 July 2018, 00:00 — 06:00        Wind: 2.486
Date: Monday, 23 July 2018, 06:00 — 12:00        Wind: 6.214
Date: Monday, 23 July 2018, 12:00 — 18:00        Wind: 6.836
Date: Monday, 23 July 2018, 18:00 — 00:00        Wind: 4.971
Date: Tuesday, 24 July 2018, 00:00 — 06:00       Wind: 3.107
Date: Tuesday, 24 July 2018, 06:00 — 12:00       Wind: 7.457
Date: Tuesday, 24 July 2018, 12:00 — 18:00       Wind: 4.35
Date: Tuesday, 24 July 2018, 18:00 — 00:00       Wind: 2.486
Date: Wednesday, 25 July 2018, 00:00 — 06:00     Wind: 1.243
Date: Wednesday, 25 July 2018, 06:00 — 12:00     Wind: 3.728
Date: Wednesday, 25 July 2018, 12:00 — 18:00     Wind: 6.836
Date: Wednesday, 25 July 2018, 18:00 — 00:00     Wind: 7.457
Date: Thursday, 26 July 2018, 00:00 — 06:00      Wind: 7.457
Date: Thursday, 26 July 2018, 06:00 — 12:00      Wind: 9.321
Date: Thursday, 26 July 2018, 12:00 — 18:00      Wind: 11.185
Date: Thursday, 26 July 2018, 18:00 — 00:00      Wind: 7.457
Date: Friday, 27 July 2018, 00:00 — 06:00        Wind: 6.836
Date: Friday, 27 July 2018, 06:00 — 12:00        Wind: 5.593
Date: Friday, 27 July 2018, 12:00 — 18:00        Wind: 4.35
Date: Friday, 27 July 2018, 18:00 — 00:00        Wind: 4.35
Date: Saturday, 28 July 2018, 00:00 — 06:00      Wind: 3.728
Date: Saturday, 28 July 2018, 06:00 — 12:00      Wind: 6.214
Date: Saturday, 28 July 2018, 12:00 — 18:00      Wind: 1.864
Date: Saturday, 28 July 2018, 18:00 — 00:00      Wind: 3.728
Date: Sunday, 29 July 2018, 00:00 — 06:00        Wind: 3.107
Date: Sunday, 29 July 2018, 06:00 — 12:00        Wind: 6.836
Date: Sunday, 29 July 2018, 12:00 — 18:00        Wind: 5.593
Date: Sunday, 29 July 2018, 18:00 — 00:00        Wind: 2.486
Date: Monday, 30 July 2018, 00:00 — 06:00        Wind: 1.864
Date: Monday, 30 July 2018, 06:00 — 12:00        Wind: 3.728
Date: Monday, 30 July 2018, 12:00 — 18:00        Wind: 4.971
Date: Monday, 30 July 2018, 18:00 — 00:00        Wind: 2.486
Date: Tuesday, 31 July 2018, 00:00 — 06:00       Wind: 1.243
Date: Tuesday, 31 July 2018, 06:00 — 12:00       Wind: 6.836
Date: Tuesday, 31 July 2018, 12:00 — 18:00       Wind: 6.836
Date: Tuesday, 31 July 2018, 18:00 — 00:00       Wind: 3.107

Ниже приведена разбивка формата json на wind

enter image description here

...