Как заменить все URL-адреса в HTML их окончательным перенаправлением? - PullRequest
0 голосов
/ 11 апреля 2020

Предпочтительно использовать BeautifulSoup, так как я уже использую его для других целей. Но любое решение Python подойдет.

    s = BeautifulSoup(bodyhtml, features="lxml")
    items = s.find_all("div", {"class": "text-block"})
    # I want to replace all URLs in `items` with their final redirect.

Вот пример URL:

https://tracking.tldrnewsletter.com/CL0/https:%2F%2Farstechnica.com%2Finformation-technology%2F2020%2F04%2Fmeet-dark_nexus-quite-possibly-the-most-potent-iot-botnet-ever%2F/1/0100017163ab9f84-cfdbd3c3-ef8c-4b34-b2a0-f6f4b8f78359-000000/BEB0JUmMqamX4piPthkn_oJ78cjvd6UocEmGf7iO5Pk=136

Вот item[5] (все элементы одинаковы):

<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a><br/><br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span><br/></span><br/></div>

1 Ответ

1 голос
/ 11 апреля 2020

Получить соответствующие a элементы. Замените префикс атрибута href пустой строкой, предполагая, что префиксы все одинаковые. Избавьтесь от всего, что следует за первым. Затем отключите его так:

from bs4 import BeautifulSoup
from urllib.parse import unquote


html = """
<head>

    <body>
        <p>
            <div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
                <br/>
                <br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
                <br/>
                </span>
                <br/>
            </div>
        </p>

        </body>
</head>
"""

s = BeautifulSoup(html, features="lxml")
for a in s.select('div.text-block a'):
        a['href'] = unquote(a['href'].replace("https://tracking.tldrnewsletter.com/CL0/", "").split('/')[0])
print(s)

Выходы:

    <html><head>
</head><body>
<p>
</p><div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://www.polygon.com/2020/4/8/21213551/google-stadia-free-pro-subscription"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
<br/>
<br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
<br/>
</span>
<br/>
</div>
</body>
</html>
...