Я пытаюсь очистить некоторые данные футбольных матчей со следующего сайта:
https://liveonsat.com/uk-england-all-football.php
Глядя на исходный код сайта, я смог чтобы определить, что большая часть информации (названия команд, время начала и каналы) содержится во внешнем блоке div (div class = "blockfix"). Я могу успешно очистить эти данные, используя приведенный ниже код:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import messagebox
from tkinter import *
from PIL import ImageTk, Image
def makesoup(url):
page=requests.get(url)
return BeautifulSoup(page.text,"lxml")
def matchscrape(g_data):
for item in g_data:
try:
match = item.find_all("div", {"class": "fix"})[0].text
print(match)
except:
pass
try:
starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
print(starttime)
except:
pass
try:
channel = item.find_all("td", {"class": "chan_col"})
for i in channel:
print(i.get_text().strip())
except:
pass
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))
root = tk.Tk()
root.resizable(False, False)
root.geometry("600x600")
root.wm_title("liveonsat scraper")
Label = tk.Label(root, text = 'liveonsat scraper', font = ('Comic Sans MS',18))
button = tk.Button(root, text="Scrape Matches", command=start)
button3 = tk.Button(root, text = "Quit Program", command=quit)
Label.pack()
button.pack()
button3.pack()
status_label = tk.Label(text="")
status_label.pack()
root.mainloop()
Я получаю, например, следующий вывод:
The issue I am having is that one element (date of the matches) is contained outside of the div ( div class="blockfix"). I am unsure as to how I am able to retrieve this data. I tried to change the following code below:
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))
to
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("td", {"height": "50"}))
as this element contained the h2 tag for date of the matches ( h2 class="time_head), but when I attempt this I get a completely different output which is incorrect (see code below)
def matchscrape(g_data):
for item in g_data:
try:
match = item.find_all("div", {"class": "fix"})[0].text
print(match)
except:
pass
try:
matchdate = item.find_all("h2", {"class": "time_head"})[0].text
print(matchdate)
except:
pass
try:
starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
print(starttime)
except:
pass
try:
channel = item.find_all("td", {"class": "chan_col"})
for i in channel:
print(i.get_text().strip())
except:
pass
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("td", {"height": "50"}))
Incorrect Output: (due to only one match name, time and date being outputted with 100's of channel names)
To further clarify. The end result I am trying to achieve is each match, time of each match, channels showing each match and date match is showing to be scraped and outputted (printed).
Thank you to anyone who can provide guidance or assistance to me with this issue. If further clarification or anything else is required I will be more than happy to provide.
Update: Below is HTML code as requested in the comments for one match as an example. The element I am having issue with displaying is h2 class="time_head"
Футбол Пятница, 10 июля Engli sh Чемпионат - неделя 43
Хаддерсфилд - Лутон Таун
<!-- around icon 4-->
ST: 18:00
<!-- around icon 4 ENDS-->
<!-- around all tables of a channel type 4-->