Доброе утро, мир
Я новичок в python и пробую вещи. Я пытаюсь удалить дубликаты ссылок из приведенного ниже запуска. в настоящее время их было найдено 253 ссылки. Может кто-нибудь помочь мне с этим?
import requests from bs4 import BeautifulSoup import csv page = "https://www.census.gov/programs-surveys/popest.html" r = requests.get(page) raw_html = r.text soup = BeautifulSoup(raw_html, 'html.parser') links = soup.find_all("a") print ('Number of links retrieved: ', len (links))
set не будет заботиться о порядке сортировки.
set
Поэтому я использовал list для правильной очистки href.
list
href
Теперь len равен 123
len
123
from bs4 import BeautifulSoup import requests r = requests.get("https://www.census.gov/programs-surveys/popest.html") soup = BeautifulSoup(r.text, 'html.parser') links = [] for item in soup.findAll("a", href=True): item = item.get("href") if item.startswith("h"): pass else: item = f"https://www.census.gov/{item}" if item not in links: links.append(item) print(item) print(len(links))
Вывод:
https://www.census.gov/#content https://www.census.gov/en.html https://www.census.gov/topics/population/age-and-sex.html https://www.census.gov/businessandeconomy https://www.census.gov/topics/education.html https://www.census.gov/topics/preparedness.html https://www.census.gov/topics/employment.html https://www.census.gov/topics/families.html https://www.census.gov/topics/population/migration.html https://www.census.gov/geo https://www.census.gov/topics/health.html https://www.census.gov/topics/population/hispanic-origin.html https://www.census.gov/topics/housing.html https://www.census.gov/topics/income-poverty.html https://www.census.gov/topics/international-trade.html https://www.census.gov/topics/population.html https://www.census.gov/topics/population/population-estimates.html https://www.census.gov/topics/public-sector.html https://www.census.gov/topics/population/race.html https://www.census.gov/topics/research.html https://www.census.gov/topics/public-sector/voting.html https://www.census.gov/about/index.html https://www.census.gov/data https://www.census.gov/academy https://www.census.gov/about/what/admin-data.html https://www.census.gov/data/data-tools.html https://www.census.gov/developers/ https://www.census.gov/data/experimental-data-products.html https://www.census.gov/data/related-sites.html https://www.census.gov/data/software.html https://www.census.gov/data/tables.html https://www.census.gov/data/training-workshops.html https://www.census.gov/library/visualizations.html https://www.census.gov/library.html https://www.census.gov/AmericaCounts https://www.census.gov/library/audio.html https://www.census.gov/library/fact-sheets.html https://www.census.gov/library/photos.html https://www.census.gov/library/publications.html https://www.census.gov/library/video.html https://www.census.gov/library/working-papers.html https://www.census.gov/programs-surveys/are-you-in-a-survey.html https://www.census.gov/programs-surveys/decennial-census/2020census-redirect.html https://www.census.gov/2020census https://www.census.gov/programs-surveys/acs https://www.census.gov/programs-surveys/ahs.html https://www.census.gov/programs-surveys/abs.html https://www.census.gov/programs-surveys/asm.html https://www.census.gov/programs-surveys/cog.html https://www.census.gov/programs-surveys/cbp.html https://www.census.gov/programs-surveys/cps.html https://www.census.gov/EconomicCensus https://www.census.gov/internationalprograms https://www.census.gov/programs-surveys/metro-micro.html https://www.census.gov/popest https://www.census.gov/programs-surveys/popproj.html https://www.census.gov/programs-surveys/saipe.html https://www.census.gov/programs-surveys/susb.html https://www.census.gov/programs-surveys/sbo.html https://www.census.gov/sipp/ https://www.census.gov/programs-surveys/surveys-programs.html https://www.census.gov/newsroom.html https://www.census.gov/partners https://www.census.gov/programs-surveys/sis.html https://www.census.gov/NAICS https://www.census.gov/library/reference/code-lists/schedule/b.html https://www.census.gov/data/developers/data-sets/Geocoding-services.html https://www.census.gov/about-us https://www.census.gov/about/who.html https://www.census.gov/about/what.html https://www.census.gov/about/business-opportunities.html https://www.census.gov/careers https://www.census.gov/fieldjobs https://www.census.gov/about/history.html https://www.census.gov/about/policies.html https://www.census.gov/privacy https://www.census.gov/regions https://www.census.gov/about/contact-us/staff-finder.html https://www.census.gov/about/contact-us.html https://www.census.gov/about/faqs.html https://www.commerce.gov/ https://www.census.gov//en.html https://www.census.gov//programs-surveys.html https://www.census.gov//popest https://www.census.gov//programs-surveys/popest/about.html https://www.census.gov//programs-surveys/popest/data.html https://www.census.gov//programs-surveys/popest/geographies.html https://www.census.gov//programs-surveys/popest/guidance.html https://www.census.gov//programs-surveys/popest/guidance-geographies.html https://www.census.gov//programs-surveys/popest/library.html https://www.census.gov//programs-surveys/popest/news.html https://www.census.gov//programs-surveys/popest/technical-documentation.html https://www.census.gov//programs-surveys/popest/data/tables.html https://www.census.gov//programs-surveys/popest/about/schedule.html https://www.census.gov//newsroom/press-releases/2019/popest-nation.html https://www.census.gov//newsroom/press-releases/2019/popest-nation/popest-nation-spanish.html https://www.census.gov//newsroom/press-releases/2019/new-years-2020.html https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-national.html https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-state.html https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-county.html https://www.census.gov//library/publications/2015/demo/p25-1142.html https://www.census.gov//library/publications/2010/demo/p25-1139.html https://www.census.gov//library/publications/2010/demo/p25-1138.html https://www.census.gov//programs-surveys/popest/library/publications.html https://www.census.gov//library/visualizations/2020/comm/superbowl.html https://www.census.gov//library/visualizations/2019/comm/slower-growth-nations-pop.html https://www.census.gov//library/visualizations/2019/comm/happy-new-year-2020.html https://www.census.gov//programs-surveys/popest/library/visualizations.html https://www.census.gov/# https://www.census.gov/#uscb-nav-skip-header https://www.census.gov/newsroom/blogs.html https://www.census.gov/newsroom/stories.html https://www.facebook.com/uscensusbureau https://twitter.com/uscensusbureau https://www.linkedin.com/company/us-census-bureau https://www.youtube.com/user/uscensusbureau https://www.instagram.com/uscensusbureau/ https://www.census.gov/quality/ https://www.census.gov/datalinkage https://www.census.gov/about/policies/privacy/privacy-policy.html#accessibility https://www.census.gov/foia https://www.usa.gov/ https://www.census.gov// 123
Преобразуйте его в набор, и он удалит дубликаты:
links = set(soup.find_all("a")) out: Number of links retrieved: 244