BeautifulSoup 4 - элемент соскабливания (h2) вне div - PullRequest
0 голосов
/ 10 июля 2020

Я пытаюсь очистить некоторые данные футбольных матчей со следующего сайта:

https://liveonsat.com/uk-england-all-football.php

Глядя на исходный код сайта, я смог чтобы определить, что большая часть информации (названия команд, время начала и каналы) содержится во внешнем блоке div (div class = "blockfix"). Я могу успешно очистить эти данные, используя приведенный ниже код:

import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import messagebox
from tkinter import *
from PIL import ImageTk, Image


def makesoup(url):
    page=requests.get(url)
    return BeautifulSoup(page.text,"lxml")
   
    
    
def matchscrape(g_data):
    
    for item in g_data:
        try:
            match = item.find_all("div", {"class": "fix"})[0].text
            print(match)
        except:
            pass
        try:
            starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
            print(starttime)
        except:
            pass
        try:
            channel = item.find_all("td", {"class": "chan_col"})
            for i in channel:
                    print(i.get_text().strip())
        except:
            pass
            
            
            
def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))

    
        
        
root = tk.Tk()
root.resizable(False, False)
root.geometry("600x600")
root.wm_title("liveonsat scraper")
Label = tk.Label(root, text = 'liveonsat scraper', font = ('Comic Sans MS',18))
button = tk.Button(root, text="Scrape Matches", command=start)
button3 = tk.Button(root,  text = "Quit Program", command=quit)
Label.pack()
button.pack()
button3.pack()
status_label = tk.Label(text="")
status_label.pack()
root.mainloop()

Я получаю, например, следующий вывод:

Output

The issue I am having is that one element (date of the matches) is contained outside of the div ( div class="blockfix"). I am unsure as to how I am able to retrieve this data. I tried to change the following code below:

def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))

to

def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("td", {"height": "50"})) 

as this element contained the h2 tag for date of the matches ( h2 class="time_head), but when I attempt this I get a completely different output which is incorrect (see code below)

def matchscrape(g_data):
    
    for item in g_data:
        try:
            match = item.find_all("div", {"class": "fix"})[0].text
            print(match)
        except:
            pass
        try:
            matchdate = item.find_all("h2", {"class": "time_head"})[0].text
            print(matchdate)
        except:
            pass
        try:
            starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
            print(starttime)
        except:
            pass
        try:
            channel = item.find_all("td", {"class": "chan_col"})
            for i in channel:
                    print(i.get_text().strip())
        except:
            pass
            
            
            
def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("td", {"height": "50"}))

Incorrect Output: (due to only one match name, time and date being outputted with 100's of channel names)

wrongoutput

To further clarify. The end result I am trying to achieve is each match, time of each match, channels showing each match and date match is showing to be scraped and outputted (printed).

Thank you to anyone who can provide guidance or assistance to me with this issue. If further clarification or anything else is required I will be more than happy to provide.

Update: Below is HTML code as requested in the comments for one match as an example. The element I am having issue with displaying is h2 class="time_head"

 Футбол  Пятница, 10 июля  Engli sh Чемпионат - неделя 43  
image Хаддерсфилд - Лутон Таун image
image
<!-- around icon 4--> ST: 18:00
<!-- around icon 4 ENDS-->
<!-- around all tables of a channel type 4-->

1 Ответ

1 голос
/ 10 июля 2020

Вот как этого добиться:

import requests
import re
import unidecode
from bs4 import BeautifulSoup

# Get page source
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

response = requests.get('https://liveonsat.com/uk-england-all-football.php', headers=headers)
soup = BeautifulSoup(response.content)

# process results

for match in soup.find_all('div',class_='blockfix'):
    #Competitors list. Using Regex, we look for a div containing two competitors name sepatated by a ' v '
    competitors = match.find('div', text = re.compile(r'(.*) v (.*)')).text
    # Looking at the match date by searching the previous h2 tag with time_head as class attribute
    match_date  = match.find_previous('h2',class_='time_head').text
    fLeft_time_live = match.find('div',class_='fLeft_time_live').text.strip()
    #Match time
    channels = match.find('div',class_='fLeft_live')
    print("Competitors ", competitors)
    print("Match date", match_date)
    print("Match time", fLeft_time_live)
    
    #Grab tv transmission times
    for channel in channels.find_all('a'):
        # if the show time is available, it will be contained in a "mouseover" tag
        # we try to find this tag, otherwise we just display the channel name
        try:
            show_date = BeautifulSoup(channel.get('onmouseover')).text
        except:
            print("  " ,channel.text.strip().replace('📺',''), "- no time displayed - ",)
            continue
        show_date  = unidecode.unidecode(show_date )
        #Some regex logic to extract the show date
        pattern = r"CAPTION, '(.*)'\)"
        show_date  = re.search(pattern,show_date ).group(1)
        print("  ", show_date )
        
    print()

Вывод

Competitors  Huddersfield v Luton Town
Match date Friday, 10th  July
Match time ST: 19:00
   beIN Connect MENA  - no time displayed - 
   beIN Connect TURKEY  - no time displayed - 
   beIN Sports MENA 12 HD  - 2020-07-10 17:00:00
   beIN Sports MENA 2 HD  - 2020-07-10 17:00:00
   beIN Sports Turkey 4 HD  - 2020-07-10 17:00:00
   Eleven Sports 2 Portugal HD  - 2020-07-10 17:00:00
   ....

EDIT : исправлено извлечение даты совпадения. ..

...