Получение информации о юристах по разным ссылкам на сайте - PullRequest
1 голос
/ 06 июня 2019

Я абсолютный новичок в Web Scraping с использованием Python и просто очень мало знаю о программировании на Python. Я просто пытаюсь получить информацию о юристах в штате Теннесси. На веб-странице есть несколько ссылок, внутри которых есть еще больше ссылок, и внутри них находятся различные юристы.

Пожалуйста, не могли бы вы сказать мне шаги, которым я должен следовать.

Я закончил, пока не извлек ссылки на первой странице, но мне нужны только ссылки на города, тогда как у меня есть все ссылки с тегами href. Теперь, как я могу их повторить и продолжить?

from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.select('a')]
print(links)```

It is printing
````C:\Users\laptop\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/laptop/.PyCharmCE2017.1/config/scratches/scratch_1.py
['https://www.superlawyers.com', 'https://attorneys.superlawyers.com', 'https://ask.superlawyers.com', 'https://video.superlawyers.com',.... ````

All the links are extracted whereas I only need the links of the cities. Kindly help.

Ответы [ 3 ]

2 голосов
/ 06 июня 2019

Без регулярного выражения:

cities = soup.find('div', class_="three_browse_columns" )
for city in cities.find_all('a'):
   print(city['href'])
1 голос
/ 07 июня 2019

Быстрее будет использовать родительский идентификатор, а затем выбрать a теги в

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://attorneys.superlawyers.com/tennessee/')
soup = bs(r.content, 'lxml')
cities = [item['href'] for item in soup.select('#browse_view a')]
1 голос
/ 06 июня 2019

Используйте регулярное выражение re и найдите в href значение города.

from bs4 import BeautifulSoup as bs
import re

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.find_all('a',href=re.compile('https://attorneys.superlawyers.com/tennessee/'))]
print(links)

Выход:

['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']

Если вы хотите использовать css selector, используйте следующий код.

from bs4 import BeautifulSoup as bs
import requests
res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
links = [item['href'] for item in soup.select('a[href^="https://attorneys.superlawyers.com/tennessee"]')]
print(links)

Выход:

['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...