Итак, я пытаюсь узнать, как выполнить очистку веб-страниц с помощью python, и для этого я хочу выяснить, как очистить все аудиофайлы от этого веб-сайта .
Итак, вот мой текущий код
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nasa.gov/connect/sounds/index.html').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Однако я не думаю, что он сбрасывает все HTML со страницы, так как это вывод, который я получаю
<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
<head>
<meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<meta content="NASA" property="og:site_name"/>
<link href="http://www.w3.org/1999/xhtml/vocab" rel="profile"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="text/html" name="dc.format"/>
<meta content="Text" name="dc.type"/>
<meta content="und" name="dc.language"/>
<meta content="/connect/sounds/index.html" name="dc.identifier"/>
<meta content="2015-01-26T09:44-05:00" name="dc.date"/>
<meta content="Jim Wilson" name="dc.creator"/>
<meta content="Audio and Ringtones" name="dc.title"/>
<meta content="/connect/sounds/index.html" property="twitter:url"/>
<meta content="11348282" property="twitter:site:id"/>
<meta content="@NASA" property="twitter:site"/>
<meta content="article" property="og:type"/>
<link href="/connect/sounds/index.html" rel="shortlink"/>
<meta content="NASA.gov brings you the latest images, videos and news from America's space agency. Get the latest updates on NASA missions, watch NASA TV live, and learn about our quest to reveal the unknown and benefit all humankind." name="description"/>
<meta content="http://www.nasa.gov/sites/default/files/images/potw1335a_0.jpg" property="twitter:image1"/>
<meta content="NASA.gov brings you the latest images, videos and news from America's space agency. Get the latest updates on NASA missions, watch NASA TV live, and learn about our quest to reveal the unknown and benefit all humankind." property="og:description"/>
<meta content="http://www.nasa.gov/sites/default/files/files/nasa_insignia_300.jpg" property="og:image"/>
<meta content="gallery" property="twitter:card"/>
<meta content="NASA brings you images, videos and features from the unique perspective of America's space agency. Get updates on missions, watch NASA TV, read blogs, view the latest discoveries, and
more." property="twitter:description"/>
<meta content="http://www.nasa.gov/sites/default/files/images/astro.jpg" property="twitter:image0"/>
<meta content="http://www.nasa.gov/sites/default/files/images/earth_1000.jpg" property="twitter:image2"/>
<link href="/connect/sounds/index.html" rel="canonical"/>
<meta content="http://www.nasa.gov/sites/default/files/images/Aeroplane.jpeg" property="twitter:image3"/>
<meta content="Audio and Ringtones" property="og:title"/>
<meta content="http://www.nasa.gov/connect/sounds/index.html" property="og:url"/>
<meta content="Audio and Ringtones" property="twitter:title"/>
<meta content="http://www.nasa.gov" property="twitter:image"/>
<meta content="Drupal 7 (http://drupal.org)" name="generator"/>
<script type="application/ld+json">{
"@context": "http://schema.org",
"@graph": [
{
"@type": "WebPage",
"@id": "https://www.nasa.gov/connect/sounds/index.html",
"name": "Audio and Ringtones",
"description": "NASA.gov brings you the latest images, videos and news from America\u0027s space agency. Get the latest updates on NASA missions, watch NASA TV live, and learn about our quest to reveal the unknown and benefit all humankind.",
"author": {
"@type": "Organization",
"@id": "https://www.nasa.gov/connect/sounds/index.html",
"name": "NASA",
"url": "https://www.nasa.gov",
"sameAs": [
"https://twitter.com/nasa",
"https://www.facebook.com/nasa",
"https://instagram.com/nasa",
"https://plus.google.com/+NASA"
]
},
"publisher": {
"@type": "Organization",
"@id": "https://www.nasa.gov/connect/sounds/index.html",
"name": "NASA",
"url": "https://www.nasa.gov",
"sameAs": "https://twitter.com/nasa,https://www.facebook.com/nasa,https://instagram.com/nasa,https://plus.google.com/+NASA",
"logo": {
"@type": "ImageObject",
"url": "https://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg",
"width": "110",
"height": "92"
}
}
},
{
"@type": "WebSite",
"@id": "www.nasa.gov",
"name": "NASA",
"url": "www.nasa.gov"
}
]
}</script>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=10.0" name="viewport"/>
<title>Audio and Ringtones | NASA</title>
<meta content="%7B%22modulePrefix%22%3A%22nasa%22%2C%22environment%22%3A%22development%22%2C%22baseURL%22%3A%22/%22%2C%22locationType%22%3A%22none%22%2C%22EmberENV%22%3A%7B%22FEATURES%22%3A%7B%7D%7D%2C%22APP%22%3A%7B%22LOG_ACTIVE_GENERATION%22%3Atrue%2C%22LOG_VIEW_LOOKUPS%22%3Atrue%7D%2C%22contentSecurityPolicyHeader%22%3A%22Content-Security-Policy-Report-Only%22%2C%22contentSecurityPolicy%22%3A%7B%22default-src%22%3A%22%27none%27%22%2C%22script-src%22%3A%22%27self%27%20%27unsafe-eval%27%22%2C%22font-src%22%3A%22%27self%27%22%2C%22connect-src%22%3A%22%27self%27%22%2C%22img-src%22%3A%22%27self%27%22%2C%22style-src%22%3A%22%27self%27%22%2C%22media-src%22%3A%22%27self%27%22%7D%2C%22exportApplicationGlobal%22%3Atrue%7D" name="nasa/config/environment"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<style>
@import url("/sites/all/modules/custom/scald_before_after_image/scald_before_after_image.css?");
@import url("/sites/all/modules/custom/scald_htmlsnippet/scald_htmlsnippet.css?");
@import url("/sites/all/modules/custom/scald_iframe/scald_iframe.css?");
</style>
<link href="/sites/all/themes/custom/nasatwo/css/vendor.css?" media="all" rel="stylesheet" type="text/css"/>
<link href="/sites/all/themes/custom/nasatwo/css/nasa.css?" media="all" rel="stylesheet" type="text/css"/>
<script id="_fed_an_ua_tag" language="javascript" src="https://dap.digitalgov.gov/Universal-Federated-Analytics-Min.js?agency=NASA&yt=true&dclink=true"></script>
<script type="text/javascript">
// DO NOT MODIFY BELOW THIS LINE *****************************************
;(function (g) {
var d = document, am = d.createElement('script'), h = d.head || d.getElementsByTagName("head")[0], fsr = 'fsReady',
aex = {
"src": "//gateway.answerscloud.com/nasa-gov/production/gateway.min.js",
"type": "text/javascript",
"async": "true",
"data-vendor": "fs",
"data-role": "gateway"
};
for (var attr in aex){am.setAttribute(attr, aex[attr]);}h.appendChild(am);g[fsr] = function () {var aT = '__' + fsr + '_stk__';g[aT] = g[aT] || [];g[aT].push(arguments);};
})(window);
// DO NOT MODIFY ABOVE THIS LINE *****************************************
</script>
<script>window.landingPageID = 336285</script>
<script>window.Drupal = {behaviors: {}};</script>
<script src="/sites/all/themes/custom/nasatwo/js/vendor.js?"></script>
<script src="/sites/all/themes/custom/nasatwo/js/nasa.js?"></script>
</head>
<body class="html not-front not-logged-in page-node page-node- page-node-336285 node-type-landing-page-2015 section-connect">
<div class="l-page ember-init-hide">
<header class="l-header container-fluid" role="banner"></header>
<div class="l-main">
<div class="l-content container-fluid" id="main" role="main">
<script>
window.forcedRoute = "landingPage";
window.cardFeed = [];
</script>
</div>
</div>
<footer class="l-footer container-fluid" role="contentinfo">
<script async="async" src="//script.crazyegg.com/pages/scripts/0070/1109.js"></script>
</footer>
</div>
<script>
/**
* © 2011-2014 iPerceptions, Inc. All rights reserved. Do not distribute.
* iPerceptions provides this code 'as is' without warranty of any kind,
* either express or implied.
*/
window.iperceptionskey = 'CTS00001';
(function () {
var a = document.createElement('script'),
b = document.getElementsByTagName('body')[0];
a.type = 'text/javascript';
a.async = true;
a.src = '//universal.iperceptions.com/wrapper.js';b.appendChild(a);
})();
</script>
</body>
</html>
Таким образом, как вы можете видеть, гиперссылки, содержащие файлы для загрузки звуковых ссылок, вообще не отображаются. И когда вы go заходите на веб-страницу, вы можете осмотреть веб-страницу и убедиться, что она не тянет все это вниз. Любые идеи о том, почему это может быть? Спасибо за любую помощь.