Специальные символы в URL-адресе при получении данных с помощью request.get - PullRequest
0 голосов
/ 03 августа 2020

Я столкнулся с некоторыми проблемами при получении содержимого из IRI, содержащего некоторые специальные символы. Я строго работал с модулем requests. Ниже приведены некоторые URL-адреса, вызывающие проблемы:

https://cwur.org/2018-19/King's-College-London.php

https://cwur.org/2018-19/University-of-Wisconsin–Madison.php

import requests
res = requests.get('https://cwur.org/2018-19/University-of-São-Paulo.php')
res.text

Ответы [ 2 ]

1 голос
/ 03 августа 2020

Чтобы получить ответ 200, передайте User-Agent в заголовках.

import requests

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
res = requests.get('https://cwur.org/2018-19/University-of-São-Paulo.php', headers=headers)
print(res.status_code)
print("---" * 10)
print(res.text)

Вывод:

200
------------------------------
<html lang="en">
<head>

<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->

<meta name="description" content="The Center for World University Rankings (CWUR) is a leading consulting organization and publisher of the largest academic ranking of global universities.">

<meta name="keywords" content="ranking, rankings, university, universities, college, colleges, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, world, top, best, global, Ranking universitario mundial, Classement mondial des universités , Weltweites Universitätsranking, Zentrum für weltweite Universitätsrankings , ××ר×× ×××× ××רס××××ת ××¢××××, ××ר×× ×××ר×× ×××× ××רס××××ת ××¢××××, ì¸ê³ ëíìì, ãä¸çã®å¤§å­¦ããã, ä¸ç大學æå中å¿, ì¸ê³ëíë­í¹ì¼í°,ä¸ç大学ã©ã³ã­ã³ã°ã»ã³ã¿ã¼, Ranking mundial universitário, РейÑинг ÑнивеÑÑиÑеÑов миÑа , ÑазÑабоÑки ÑейÑинга ÑнивеÑÑиÑеÑов миÑа, ÙرÙز ,تصÙÙ٠اÙجاÙعات اÙعاÙÙÙØ© ,تصÙÙÙ, اÙجاÙعات, جاÙعات, اÙعاÙÙ, تصÙÙ٠اÙجاÙعات, ÙرÙز تصÙÙ٠اÙجاÙعات اÙعاÙÙÙØ©, Ranking de universidades del mundo, subject, subjects, journal, journals, ranking by subjects, country ranking, country rankings">

<link rel="icon" type="image/png" href="../../favicon.png" />

<!-- Bootstrap core CSS -->
<link href="../../dist/css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="../../assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet">

<!-- Custom styles for this template -->
<link href="../../starter-template.css" rel="stylesheet">

<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->

<style type="text/css">
/* CSS used here will be applied after bootstrap.css */

.navbar-custom {
    color: #FFFFFF;
    background-color: #222222;
    border-color: #222222;
}

</style>
<title> University of São Paulo Ranking | CWUR World University Rankings 2018-2019</title>
</head>

<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
      <div class="container">
            <div class="navbar-header">
              <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
                <span class="sr-only">Toggle navigation</span>
                <span class="icon-bar"></span>
                <span class="icon-bar"></span>
                <span class="icon-bar"></span>
              </button>
              <a href="http://cwur.org"><img src="../images/logo_944_400.png" height="50"></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            </div>
            <div id="navbar" class="navbar-collapse collapse">
              <ul class="nav navbar-nav">
                <li><a href="../about.php" style="color:white">About</a></li>

                <li class="dropdown">
                  <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false" style="color:white">World University Rankings <span class="caret"></span></a>
                  <ul class="dropdown-menu">
                    <li class="dropdown-header">World University Rankings</li>
                    <li><a href="../2020-21.php">2020-21</a></li>
                    <li><a href="../2019-20.php">2019-20</a></li>
                    <li><a href="../2018-19.php">2018-19</a></li>
                    <li><a href="../2017.php">2017</a></li>
                    <li><a href="../2016.php">2016</a></li>
                    <li><a href="../2015.php">2015</a></li>
                    <li><a href="../2014.php">2014</a></li>
                    <li><a href="../2013.php">2013</a></li>
                    <li><a href="../2012.php">2012</a></li>
                    <li role="separator" class="divider"></li>
                    <li class="dropdown-header">University Rankings by Country</li>
                    <li><a href="../2018-19/country.php">2018-19</a></li>
                    <li><a href="../2017/country.php">2017</a></li>
                    <li><a href="../2016/country.php">2016</a></li>
                    <li><a href="../2015/country.php">2015</a></li>
                    <li><a href="../2014/country.php">2014</a></li>
                    <li role="separator" class="divider"></li>

                    <li><a href="../2017/subjects.php">Rankings by Subject</a></li>
                  </ul>
                </li>
                <li class="dropdown">
                  <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false" style="color:white">Methodology <span class="caret"></span></a>
                  <ul class="dropdown-menu">
                    <li><a href="../methodology/world-university-rankings.php">World University Rankings</a></li>
                    <li><a href="../methodology/subject-rankings.php">Subject Rankings</a></li>
                  </ul>
                </li>
                <li><a href="../media.php" style="color:white">Media</a></li>
              </ul>
            </div>
          </div>
        </nav>


<div class="container">
 <div class="page-header">
  <h4> University of São Paulo Ranking - CWUR World University Rankings 2018-2019</h4>
  <!-- Go to www.addthis.com/dashboard to customize your tools -->
<div class="addthis_toolbox addthis_default_style addthis_32x32_style"> <a class="addthis_button_preferred_1"></a> <a class="addthis_button_preferred_2"></a> <a class="addthis_button_preferred_3"></a> <a class="addthis_button_preferred_4"></a><a class="addthis_button_compact"></a></div> </div>

 <div class="row">
  <div class="col-md-8">

    <table class="table table-bordered table-hover">
 <tr><td><b>Institution Name</b></td><td>University of São Paulo </td></tr>
 <tr><td><b>Native Name</b></td><td>Universidade de São Paulo </td></tr>
 <tr><td><b>Location</b></td><td>Brazil</td></tr>
 <tr><td><b>World Rank</b></td><td>77</td></tr>
 <tr><td><b>National Rank</b></td><td>1</td></tr>
 <tr><td><b>Quality of Education Rank</b></td><td>583</td></tr>
 <tr><td><b>Alumni Employment Rank</b></td><td>256</td></tr>
 <tr><td><b>Quality of Faculty Rank</b></td><td>109</td></tr>
 <tr><td><b>Research Output Rank</b></td><td>4</td></tr>
 <tr><td><b>Quality Publications Rank</b></td><td>60</td></tr>
 <tr><td><b>Influence Rank</b></td><td>162</td></tr>
 <tr><td><b>Citations Rank</b></td><td>139</td></tr>
 <tr><td><b>Overall Score</b></td><td>82.6</td></tr>
 <tr><td><b>Domain</b></td><td>usp.br</td></tr>
    </table>

  </div>
  <div class="col-md-4">
   <div class="table-responsive">
    <table class="table table-bordered table-hover">
 <tr><td><a href="http://cwur.org/2020-21.php">Top 2000 Universities (2020-21)</a></td></tr>
 <tr><td><a href="http://cwur.org/2019-20.php">Top 2000 Universities (2019-20)</a></td></tr>
 <tr><td><a href="http://cwur.org/2018-19.php">Top 1000 Universities (2018-19)</a></td></tr>
 <tr><td><a href="http://cwur.org/2018-19/country.php">Ranking by Country (2018-2019)</a></td></tr>
 <tr><td><a href="http://cwur.org/2017.php">Top 1000 Universities (2017)</a></td></tr>
 <tr><td><a href="http://cwur.org/2017/country.php">Ranking by Country (2017)</a></td></tr>
 <tr><td><a href="http://cwur.org/2017/subjects.php">Rankings by Subject</a></td></tr>
 <tr><td><a href="http://cwur.org/2016.php">Top 1000 Universities (2016)</a></td></tr>
 <tr><td><a href="http://cwur.org/2016/country.php">Ranking by Country (2016)</a></td></tr>
 <tr><td><a href="http://cwur.org/2015.php">Top 1000 Universities (2015)</a></td></tr>
 <tr><td><a href="http://cwur.org/2015/country.php">Ranking by Country (2015)</a></td></tr>
 <tr><td><a href="http://cwur.org/2014.php">Top 1000 Universities (2014)</a></td></tr>
 <tr><td><a href="http://cwur.org/2014/country.php">Ranking by Country (2014)</a></td></tr>
    </table>
   </div>
  </div>
 </div>
   <p>Copyright &copy; 2012-2020 Center for World University Rankings</p>

</div>


<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>window.jQuery || document.write('<script src="../../assets/js/vendor/jquery.min.js"><\/script>')</script>
<script src="../../dist/js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="../../assets/js/ie10-viewport-bug-workaround.js"></script>

<!-- Go to www.addthis.com/dashboard to customize your tools -->
<script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-5316b43f5ee1fc57"></script>
</body>
</html>

Обновление:

В случае URL-адреса Unicode, вы можете преобразовать их в строку

import requests

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
url = "https://cwur.org/2018-19/University-of-S\xc3\xa3o-Paulo.php"
new_url = url.encode("iso-8859-1").decode()
res = requests.get(new_url, headers=headers)
print(res.status_code)
print("---" * 10)
print(res.text)
0 голосов
/ 03 августа 2020

Я рекомендую попытаться сохранить данные, полученные из метода .get(), в словаре, а затем использовать модуль pprint для аккуратного отображения:

import requests
from pprint import pprint

url = 'https://cwur.org/2018-19/University-of-Wisconsin–Madison.php'
res = requests.get(url)

# printing the status code is also helpful to see if the API call was successful
print("Status code:", r.status_code)

r_dict = res.json()
pprint(r_dict)

Если вы получили код состояния 200, значит, вызов API был успешным. Это дополнительная документация по другому ответу с кодом состояния: ссылка Надеюсь, это поможет вам найти проблему с вашей ссылкой.

...