Извлечение содержимого внутри <script type = "text / javascript"> и $ (function () - PullRequest
0 голосов
/ 29 мая 2019

Я работаю над очисткой данных с веб-сайтов.Я был в состоянии извлечь содержимое внутри тега.Но внутри него есть $ (function () {). Я хочу извлечь из него содержимое.

import urllib.request
from bs4 import BeautifulSoup
import json 
url = 'https://www.broadwayinbound.com/shows/'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
soup = BeautifulSoup(data)
results = soup.findAll('script', {'type':'text/javascript'})
r = []
for result in results :
    if 'var shows = [' in result.text:
        r.append(result.text)
print (r[0])

Я хочу извлечь только содержимое 'var show'.

{"Id":"12680","ClientClassCode":"default","ShowName":"Ain't Too Proud - The Life and Times of The Temptations","ShowCode":"AINTPROUD","SortName":"Ain't Too Proud - The Life and Times of The Temptations","ShowLogo":"/product-resources/Aint-Too-Proud-Temptations-Musical-Broadway-Group-Sales-Show-Tickets-500-102318.jpg","ShowLogoText":"Ain't Too Proud - The Life and Times of The Temptations Tickets | Broadway......

Ответы [ 2 ]

1 голос
/ 29 мая 2019

Предполагая, что остальная часть кода работает, простое регулярное выражение должно сработать:)

import urllib.request
import re
import json
from bs4 import BeautifulSoup

url = 'https://www.broadwayinbound.com/shows/'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
soup = BeautifulSoup(data)
results = soup.findAll('script', {'type':'text/javascript'})
r = []
for result in results :
    if 'var shows = [' in result.text:
        x = re.findall(r"var shows = (\[.*\])", result.text)
        if (len(x) > 0):
            r.append(x[0])

print(json.loads(r[0]))
print(json.loads(r[0])[0]["Id"])
0 голосов
/ 29 мая 2019

Вам придется манипулировать строкой. По сути, это дает вам список структур JSON:

import requests
from bs4 import BeautifulSoup
import json 

url = 'https://www.broadwayinbound.com/shows/'
response = requests.get(url)
data = response.text     # a `bytes` object
soup = BeautifulSoup(data)
results = soup.findAll('script', {'type':'text/javascript'})
r = []


for result in results :
    if 'var shows = [' in result.text:
        jsonStr = result.text

        jsonStr = jsonStr.split('var shows = [')[1]
        jsonStr = jsonStr.rsplit('];',1)[0]

        jsonStr_list = jsonStr.split('{"Id":')[1:]

        for each in jsonStr_list:
            each = jsonStr_list[0]
            w=1
            if each[-1] == ',':
                each = each.rstrip(',')

            jsonTemp = '{"Id":' + each
            jsonObj = json.loads(jsonTemp)

            r.append(jsonObj)

Выход:

print (r)
[{'Id': '12680', 'ClientClassCode': 'default', 'ShowName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowCode': 'AINTPROUD', 'SortName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowLogo': '/product-resources/Aint-Too-Proud-Temptations-Musical-Broadway-Group-Sales-Show-Tickets-500-102318.jpg', 'ShowLogoText': "Ain't Too Proud - The Life and Times of The Temptations Tickets | Broadway Inbound", 'ShowPromo': '', 'ShowPromoText': '', 'Description': "<em>Ain't Too Proud</em> is the electrifying new musical that follows The Temptations' extraordinary journey from the streets of Detroit to the Rock & Roll Hall of Fame.<br /><br />Five guys. One dream. And a sound that would make music history. With their signature dance moves and unmistakable harmonies, they rose to the top of the charts creating an amazing 42 Top Ten Hits with 14 reaching number one. The rest is history — how they met, the groundbreaking heights they hit, and how personal and political conflicts threatened to tear the group apart as the United States fell into civil unrest. This thrilling story of brotherhood, family, loyalty, and betrayal is set to the beat of the group's treasured hits, including “My Girl,” “Just My Imagination,” “Get Ready,” “Papa Was a Rolling Stone,” and so many more.<br /><br />After breaking house records at Berkeley Rep, The Kennedy Center, and at the Ahmanson Theater, <em>Ain't Too Proud</em>, written by three time Obie Award winner Dominique Morisseau, directed by two-time Tony Award® winner Des McAnuff (<em>Jersey Boys</em>), and featuring choreography by Tony nominee Sergio Trujillo (<em>Jersey Boys</em>, <em>On Your Feet</em>), now brings the untold story of this legendary quintet to irresistible life on Broadway.", 'Category': 'Broadway', 'CategoryCode': 'BW', 'ShowType': 'Musical', 'ShowTypeCode': 'MUSICAL', 'Rating': 'Might not be suitable for younger children', 'RatingCode': 'PT', 'City': 'New York', 'CityCode': 'NYCA', 'FirstPerformance': '2/28/2019', 'NextPerformance': '5/30/2019', 'NextPerformanceTime': '7:00 PM', 'OnSaleThrough': '6/7/2020', 'Weekdays': ['fr', 'mo', 'sa', 'su', 'th', 'tu', 'we'], 'MinPrice': '42.00', 'MaxPrice': '385.90', 'GroupMinimum': '10', 'MaximumTickets': '25', 'VenueName': 'Imperial Theatre', 'Url': '/shows/aint-too-proud-the-life-and-times-of-the-temptations/', 'BroadwayCollectionEN': 'http://www.broadwaycollection.com/shows/https://www.broadwaycollection.com/shows/aint-too-proud/', 'BroadwayCollectionES': 'http://www.broadwaycollection.com/es/shows/https://www.broadwaycollection.com/es/shows/aint-too-proud/', 'BroadwayCollectionDE': 'http://www.broadwaycollection.com/de/shows/https://www.broadwaycollection.com/de/shows/aint-too-proud/', 'BroadwayCollectionJA': 'http://www.broadwaycollection.com/ja/shows/https://www.broadwaycollection.com/ja/shows/aint-too-proud/', 'BroadwayCollectionPT': 'http://www.broadwaycollection.com/pt-br/shows/https://www.broadwaycollection.com/pt-br/shows/aint-too-proud/', 'BroadwayCollectionZH': 'http://www.broadwaycollection.com/zh-hans/shows/https://www.broadwaycollection.com/zh-hans/shows/aint-too-proud/', 'RunTime': '2 hours and 30 minutes, including intermission', 'ShowLetUsKnow': False}, {'Id': '12680', 'ClientClassCode': 'default', 'ShowName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowCode': 'AINTPROUD', 'SortName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowLogo': '/product-resources/Aint-Too-Proud-Temptations-Musical-Broadway-Group-Sales-Show-Tickets-500-102318.jpg', 'ShowLogoText': "Ain't Too Proud - The Life and Times of The Temptations Tickets | Broadway Inbound", 'ShowPromo': '', 'ShowPromoText': '', 'Description': "<em>Ain't Too Proud</em> is the electrifying new musical that follows The Temptat ...
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...