Question

Я пытаюсь скомпилировать файлы патентов с веб-страницы USPTO с BeautifulSoup.

df['link']
urls=df['link'].to_numpy()
urls
for i in urls:
    page = requests.get(i)
    ## storing the content of the page in a variable
    txt = page.text
    ## creating BeautifulSoup object
    soup = bs4.BeautifulSoup(txt, 'html.parser')
    soup

, однако, он печатает только один из URL, а не все 5 ссылок. Мне нужны все 5 ссылок в текстовом виде.

Любые предложения приветствуются. Приветствия

ССЫЛКИ, КОТОРЫЕ Я НУЖЕН СКАЧАТЬ #

array(['http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=2&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=3&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=4&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=5&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n'],
      dtype=object)

αԋɱҽԃ αмєяιcαη · Answer 1 · 28 марта 2020

import pandas as pd


def Main(url):
    for item in range(1, 6):
        df = pd.read_html(url.format(item), attrs={
                          'width': '100%'}, skiprows=1, match=r"^\d{4}/\d+")[0]
        df.to_csv("data.csv", index=False, mode="a")


Main("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r={}&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n")

Вывод: Просмотр онлайн

Снимок экрана:

Новый код для каждого запроса на использование в комментариях:

import requests


def Main(url):
    for item in range(1, 6):
        with requests.Session() as req:
            r = req.get(url.format(item))
            with open("data.txt", 'a') as f:
                f.writelines(r.text)


Main("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r={}&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n")

удаление нескольких URL с помощью bs4

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

удаление нескольких URL с помощью bs4

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы