Python очистить родительский URL-адрес для ссылок, затем дочерний URL-адрес этих ссылок, затем данные таблицы и сохранить в читаемый файл - PullRequest
0 голосов
/ 16 июня 2020

я хотел бы очистить все URL-адреса с https://gg.co.uk/tips/today веб-сайта, например (https://gg.co.uk/racing/16-jun-2020/thirsk-1300), а затем циклически перебрать каждый из этих URL-адресов, чтобы получить https://gg.co.uk/racing/form-profile-2703975 затем проанализируйте таблицу в каждом 'https://gg.co.uk/racing/form-profile-2703975' для вывода в файл csv для каждой гонки, например, 'https://gg.co.uk/racing/16-jun-2020/thirsk-1300' пример вывода формат

PLACE DATE   / GOING DISTANCE / CLASS       TIME /   COURSE JOCKEY 
16th Jun 2020  Good to Soft      7f Class 5 1:00     Thirsk F Norton
4th Jun 2020   Standard          6f Class 5 4:30     Newcastle  J Fanning

Мне удалось очистить ссылки, но затем не могу очистить каждую ссылку и вывести в CSV

import requests
from bs4 import BeautifulSoup
import csv

        page = requests.get('https://gg.co.uk/tips/today')
        base_url = 'https://gg.co.uk'
        soup = BeautifulSoup(page.text, 'html.parser')

        link_set = set()
        for link in soup.find_all('a',{'class' : 'winning-post'}):
        web_links = link.get("href")
        print(base_url + web_links)
        link_set.add(web_links)
    Print(web_links)

1 Ответ

0 голосов
/ 17 июня 2020

Этот скрипт получит все form-profile-xxx URL-адреса из https://gg.co.uk/racing/16-jun-2020/thirsk-1300, а затем получит каждую строку, принадлежащую этой гонке, со страницы профиля и сохранит ее в csv:

import csv
import requests
from bs4 import BeautifulSoup


url = 'https://gg.co.uk/racing/16-jun-2020/thirsk-1300'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('a[href^="/racing/form-profile-"]'):
    u = 'https://gg.co.uk' + a['href']
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    row = s.select_one('tr:has(a[href="{}"])'.format(url.replace('https://gg.co.uk', '')))
    if not row:
        continue
    tds = [td.get_text(strip=True, separator='\n') for td in row.select('td')]
    print(tds)
    all_data.append(tds)

with open('data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

Печать:

['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9\n5\nF Norton\nM Johnston', '5/6\nWon']
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9st 5lb\nF Norton\nM Johnston', '5/6\nWon']
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9st 5lb\nF Norton\nM Johnston', '5/6\nWon']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9\n5\nS Donohoe\nC Fellowes', '5/2']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9st 5lb\nS Donohoe\nC Fellowes', '5/2']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9st 5lb\nS Donohoe\nC Fellowes', '5/2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9\n5\nKevin Stott\nK A Ryan', '12/1\n2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9st 5lb\nKevin Stott\nK A Ryan', '12/1\n2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9st 5lb\nKevin Stott\nK A Ryan', '12/1\n2']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9\n0\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9st\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9st\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['5th\n6\n4', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMy Best Friend\n9\n5\nD Nolan\nD OʼMeara', '15/2\n4.25']
['5th\n6\n4', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMy Best Friend\n9st 5lb\nD Nolan\nD OʼMeara', '15/2\n4.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9\n5\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9st 5lb\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9st 5lb\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9\n5\nBen Robinson\nOllie Pears', '40/1\n7']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9st 5lb\nBen Robinson\nOllie Pears', '40/1\n7']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9st 5lb\nBen Robinson\nOllie Pears', '40/1\n7']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9\n5\nD Allan\nT D Easterby', '33/1\n27']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9st 5lb\nD Allan\nT D Easterby', '33/1\n27']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9st 5lb\nD Allan\nT D Easterby', '33/1\n27']

И сохраняет data.csv (скриншот из Libre Office):

enter image description here

...