выскабливание ссылок с сайта с ошибкой 403 python - PullRequest
0 голосов
/ 13 января 2019

Я пытаюсь вычистить ссылки из списка ссылок (все на разные страницы одного и того же сайта), но продолжаю работать ошибка 403. Вот пример ссылки, которую я пытаюсь очистить

https://www.spectatornews.com/page/6/?s=band

https://www.spectatornews.com/page/7/?s=band

и т.д.

Вот мой код:

getarticles = []

from bs4 import BeautifulSoup
import urllib.request

for i in listoflinks:
    resp = urllib.request.urlopen(i)
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):

        getarticles.append(link['href'])

Я пытался использовать некоторые ответы из Ошибка HTTP 403 в Python 3 Web Scraping , но я не добился большого успеха. Я не уверен, правильно ли я применяю их ко всему списку ссылок. Я попытался использовать одно из приведенных ниже решений с помощью заголовка, но это возвращает ошибку HTTP 406: недопустимо

Вот мой код, который пытались исправить:

getarticles = []
from bs4 import BeautifulSoup

from bs4 import BeautifulSoup
import urllib.request

for i in listoflinks:
    req=urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})
    resp = urllib.request.urlopen(req)
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):

        getarticles.append(link['href'])

Любая помощь очень ценится. Я очень новичок в этом, так что, насколько вы можете объяснить и помочь, это здорово. Я просто хотел бы собрать ссылки из моего списка сайтов!

Спасибо

Ответы [ 2 ]

0 голосов
/ 14 января 2019

Я хочу сказать заранее, что я редко использую библиотеку urllib / 3. Однако я попытался использовать команду терминала терминала scrapy, а также использовать библиотеку запросов без агента пользователя и получил ответ 200.

Я заметил, что вы не объявляли тип парсера, когда объявляли "суп".

 soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

Хотя мне гораздо удобнее использовать анализатор scrapy, несмотря на то, что он тяжелее, но если вы правильно помните, вы должны объявить тип синтаксического анализатора, например

soup = BeautifulSoup(resp, "lxml")

Битто Бенни-чан говорит, что ему удалось заставить его ответить 200 urllib.request, так что попробуйте его изменения. Который просто вводил полное имя агента пользователя.

Я бы предложил использовать библиотеку запросов. Я думаю, это было бы достаточно простое изменение.

from bs4 import BeautifulSoup
import requests

listoflinks = ['https://www.spectatornews.com/page/6/?s=band', 'https://www.spectatornews.com/page/7/?s=band']

getarticles = []

for i in listoflinks:
    resp = requests.get(i)
    soup = BeautifulSoup(resp.content, "lxml")

    for link in soup.find_all('a', href=True):

        getarticles.append(link['href'])

список getarticles вывел это:

'https://www.spectatornews.com/category/showcase/',
 'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/',
 'https://www.spectatornews.com/category/showcase/',
 'https://www.spectatornews.com/page/5/?s=band',
 'https://www.spectatornews.com/?s=band',
 'https://www.spectatornews.com/page/2/?s=band',
 'https://www.spectatornews.com/page/3/?s=band',
 'https://www.spectatornews.com/page/4/?s=band',
 'https://www.spectatornews.com/page/5/?s=band',
 'https://www.spectatornews.com/page/7/?s=band',
 'https://www.spectatornews.com/page/8/?s=band',
 'https://www.spectatornews.com/page/9/?s=band',
 'https://www.spectatornews.com/page/127/?s=band',
 'https://www.spectatornews.com/page/7/?s=band',
 'https://www.spectatornews.com',
 'https://www.spectatornews.com/feed/rss/',
 '#',
 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
 'https://www.snapchat.com/add/spectator news',
 'https://www.instagram.com/spectatornews/',
 'http://twitter.com/spectatornews',
 'http://facebook.com/spectatornews',
 '/',
 'https://snosites.com/why-sno/',
 'http://snosites.com',
 'https://www.spectatornews.com/wp-login.php',
 '#top',
 '/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/category/sports/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/category/multimedia-2/',
 'https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/category/sports/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/category/multimedia-2/',
 '/',
 'https://www.spectatornews.com/about/',
 'https://www.spectatornews.com/about/editorial-policy/',
 'https://www.spectatornews.com/about/correction-policy/',
 'https://www.spectatornews.com/about/bylaws/',
 'https://www.spectatornews.com/advertise/',
 'https://www.spectatornews.com/contact/',
 'https://www.spectatornews.com/staff/',
 'https://www.spectatornews.com/submit-a-letter/',
 'https://www.spectatornews.com/submit-a-news-tip/',
 '/',
 'https://www.spectatornews.com',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/category/sports/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/category/multimedia-2/',
 '/',
 'https://www.spectatornews.com/feed/rss/',
 '#',
 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
 'https://www.snapchat.com/add/spectator news',
 'https://www.instagram.com/spectatornews/',
 'http://twitter.com/spectatornews',
 'http://facebook.com/spectatornews',
 'https://www.spectatornews.com/campus-news/2002/05/09/late-night-bus-service-idea-abandoned-due-to-expense/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/opinion/2002/03/21/yates-deserved-what-she-got-husband-also-to-blame/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/opinion/2001/11/29/air-force-concert-band-inspires-zorn-arena-audience/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/campus-news/2001/10/25/goth-style-bands-will-entertain-at-halloween-costume-concert/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/campus-news/2001/04/19/campus-group-will-host-hemp-event-with-bands-information/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
 'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
 'https://www.spectatornews.com/staff/?writer=Alanna%20Huggett',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/tag/geekcon/',
 'https://www.spectatornews.com/tag/tv10/',
 'https://www.spectatornews.com/tag/uwec/',
 'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
 'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
 'https://www.spectatornews.com/staff/?writer=Kar%20Wei%20Cheng',
 'https://www.spectatornews.com/category/column-2/',
 'https://www.spectatornews.com/category/multimedia-2/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/tag/accessories/',
 'https://www.spectatornews.com/tag/fashion/',
 'https://www.spectatornews.com/tag/multimedia/',
 'https://www.spectatornews.com/tag/winter/',
 'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
 'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
 'https://www.spectatornews.com/staff/?writer=Julia%20Van%20Allen',
 'https://www.spectatornews.com/category/multimedia-2/',
 'https://www.spectatornews.com/tag/dancing/',
 'https://www.spectatornews.com/tag/harry-potter/',
 'https://www.spectatornews.com/tag/smom/',
 'https://www.spectatornews.com/tag/student-ministry-of-magic/',
 'https://www.spectatornews.com/tag/uwec/',
 'https://www.spectatornews.com/tag/yule/',
 'https://www.spectatornews.com/tag/yule-ball/',
 'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
 'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
 'https://www.spectatornews.com/staff/?writer=Madeline%20Fuerstenberg',
 'https://www.spectatornews.com/category/column-2/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/tag/1950/',
 'https://www.spectatornews.com/tag/1975/',
 'https://www.spectatornews.com/tag/2000/',
 'https://www.spectatornews.com/tag/articles/',
 'https://www.spectatornews.com/tag/spectator/',
 'https://www.spectatornews.com/tag/throwback/',
 'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
 'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
 'https://www.spectatornews.com/staff/?writer=Taylor%20Reisdorf',
 'https://www.spectatornews.com/category/column-2/',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/tag/altoona/',
 'https://www.spectatornews.com/tag/boss-women/',
 'https://www.spectatornews.com/tag/business-women/',
 'https://www.spectatornews.com/tag/cherish-woodford/',
 'https://www.spectatornews.com/tag/crossfit/',
 'https://www.spectatornews.com/tag/crossfit-river-prairie/',
 'https://www.spectatornews.com/tag/eau-claire/',
 'https://www.spectatornews.com/tag/fitness/',
 'https://www.spectatornews.com/tag/gym/',
 'https://www.spectatornews.com/tag/local/',
 'https://www.spectatornews.com/tag/nicole-randall/',
 'https://www.spectatornews.com/tag/river-prairie/',
 'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
 'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
 'https://www.spectatornews.com/staff/?writer=Lea%20Kopke',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/tag/bad-art/',
 'https://www.spectatornews.com/tag/fmdown/',
 'https://www.spectatornews.com/tag/ghosts-of-the-sun/',
 'https://www.spectatornews.com/tag/music/',
 'https://www.spectatornews.com/tag/pablo-center/',
 'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
 'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
 'https://www.spectatornews.com/staff/?writer=Stephanie%20Janssen',
 'https://www.spectatornews.com/category/column-2/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/tag/satire/',
 'https://www.spectatornews.com/tag/sleepy/',
 'https://www.spectatornews.com/tag/tator/',
 'https://www.spectatornews.com/tag/uw-eau-claire/',
 'https://www.spectatornews.com/tag/uwec/',
 'https://www.spectatornews.com/page/6/?s=band',
 'https://www.spectatornews.com/?s=band',
 'https://www.spectatornews.com/page/2/?s=band',
 'https://www.spectatornews.com/page/3/?s=band',
 'https://www.spectatornews.com/page/4/?s=band',
 'https://www.spectatornews.com/page/5/?s=band',
 'https://www.spectatornews.com/page/6/?s=band',
 'https://www.spectatornews.com/page/8/?s=band',
 'https://www.spectatornews.com/page/9/?s=band',
 'https://www.spectatornews.com/page/10/?s=band',
 'https://www.spectatornews.com/page/127/?s=band',
 'https://www.spectatornews.com/page/8/?s=band',
 'https://www.spectatornews.com',
 'https://www.spectatornews.com/feed/rss/',
 '#',
 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
 'https://www.snapchat.com/add/spectator news',
 'https://www.instagram.com/spectatornews/',
 'http://twitter.com/spectatornews',
 'http://facebook.com/spectatornews',
 '/',
 'https://snosites.com/why-sno/',
 'http://snosites.com',
 'https://www.spectatornews.com/wp-login.php',
 '#top',
 '/',
 'https://www.spectatornews.com/category/campus-news/',
 'https://www.spectatornews.com/category/currents/',
 'https://www.spectatornews.com/category/sports/',
 'https://www.spectatornews.com/category/opinion/',
 'https://www.spectatornews.com/category/multimedia-2/']
0 голосов
/ 13 января 2019

403 ЗАПРЕЩЕНО

Сервер понял запрос, но отказывается его авторизовать.

406 НЕ ПРИНИМАЕТСЯ

Целевой ресурс не имеет текущего представления, которое бы быть приемлемым для агента пользователя, в соответствии с активным поля заголовка согласования, полученные в запросе, и сервер не желая предоставлять представление по умолчанию.

Ваш User-Agent может быть проблемой. Я смог получить вывод, изменив его

from bs4 import BeautifulSoup
import urllib.request
listoflinks=['https://www.spectatornews.com/page/6/?s=band','https://www.spectatornews.com/page/6/?s=band']
getarticles = []
for i in listoflinks:
    req = urllib.request.Request(
    i,
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
    )
    resp= urllib.request.urlopen(req)
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'),features="lxml")
    for link in soup.find_all('a', href=True):
        getarticles.append(link['href'])
print(getarticles)

выход

['https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/about/', 'https://www.spectatornews.com/about/editorial-policy/', 'https://www.spectatornews.com/about/correction-policy/', 'https://www.spectatornews.com/about/bylaws/', 'https://www.spectatornews.com/advertise/', 'https://www.spectatornews.com/contact/', 'https://www.spectatornews.com/staff/', 'https://www.spectatornews.com/submit-a-letter/', 'https://www.spectatornews.com/submit-a-news-tip/', '/', 'https://www.spectatornews.com', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', 'https://www.spectatornews.com/campus-news/2004/05/06/english-fest-draws-speakers-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/campus-news/2004/05/03/burgers-on-the-grill-bands-on-the-scene/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2004/04/29/hempfest-celebrates-its-10th-year-with-11-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/29/pat-mcgee-band-rocks-mad-town/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/22/leinenkugels-battle-of-the-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/08/on-the-music-scene-band-makes-mondays-better/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/03/18/on-the-music-scene-band-carries-on-duluozs-work/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/10/09/jamband-grooving-to-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/05/01/joepalooza-set-with-5-bands-one-drummer/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/campus-news/2003/05/01/hempfest-features-nine-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2003/02/17/houston-based-band-reaching-out-to-college-students-on-tour/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/?s=band', 'https://www.spectatornews.com/page/2/?s=band', 'https://www.spectatornews.com/page/3/?s=band', 'https://www.spectatornews.com/page/4/?s=band', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com/page/8/?s=band', 'https://www.spectatornews.com/page/9/?s=band', 'https://www.spectatornews.com/page/127/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', '/', 'https://snosites.com/why-sno/', 'http://snosites.com', 'https://www.spectatornews.com/wp-login.php', '#top', '/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', 'https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/about/', 'https://www.spectatornews.com/about/editorial-policy/', 'https://www.spectatornews.com/about/correction-policy/', 'https://www.spectatornews.com/about/bylaws/', 'https://www.spectatornews.com/advertise/', 'https://www.spectatornews.com/contact/', 'https://www.spectatornews.com/staff/', 'https://www.spectatornews.com/submit-a-letter/', 'https://www.spectatornews.com/submit-a-news-tip/', '/', 'https://www.spectatornews.com', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', 'https://www.spectatornews.com/campus-news/2004/05/06/english-fest-draws-speakers-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/campus-news/2004/05/03/burgers-on-the-grill-bands-on-the-scene/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2004/04/29/hempfest-celebrates-its-10th-year-with-11-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/29/pat-mcgee-band-rocks-mad-town/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/22/leinenkugels-battle-of-the-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/08/on-the-music-scene-band-makes-mondays-better/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/03/18/on-the-music-scene-band-carries-on-duluozs-work/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/10/09/jamband-grooving-to-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/05/01/joepalooza-set-with-5-bands-one-drummer/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/campus-news/2003/05/01/hempfest-features-nine-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2003/02/17/houston-based-band-reaching-out-to-college-students-on-tour/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/?s=band', 'https://www.spectatornews.com/page/2/?s=band', 'https://www.spectatornews.com/page/3/?s=band', 'https://www.spectatornews.com/page/4/?s=band', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com/page/8/?s=band', 'https://www.spectatornews.com/page/9/?s=band', 'https://www.spectatornews.com/page/127/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', '/', 'https://snosites.com/why-sno/', 'http://snosites.com', 'https://www.spectatornews.com/wp-login.php', '#top', '/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/']

Изменить для обработки 404 ошибок:

Некоторые ссылки в вашем списке могут быть недоступны. Одним из вариантов является использование блока try-exc для их обработки и обработки оставшихся ссылок

Таким образом, окончательный код будет

from bs4 import BeautifulSoup
import urllib.request
listoflinks=['https://www.spectatornews.com/page/6/?s=band','https://www.spectatornews.com/page/6/?s=band','https://www.spectatornews.com/page/100099?s=band','http://sdfgsdjhgfjsgdhfgsj.com']
getarticles = []
for i in listoflinks:
    req = urllib.request.Request(
    i,
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
    )
    try:
        resp= urllib.request.urlopen(req)
    except urllib.error.HTTPError as e:
        if e.code == 404:
            print("Unavailable link",i," skipping---")
        else:
            raise e
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'),features="lxml")
    for link in soup.find_all('a', href=True):
        getarticles.append(link['href'])
print(getarticles)
...