Почему страница не возвращает xpath? - PullRequest
0 голосов
/ 28 сентября 2018

Я наткнулся на несколько интернет-страниц, где, когда я запускаю запросы xpath (которые РАБОТАЮТ в 2 разных расширениях chpath checker xpath), они не возвращаются на мою страницу PHP, с которой я их запускаю.Мне интересно, есть ли на этих страницах какой-то тип блокировщика xpath или что-то еще (да, я проверяю их robots.txt для разрешения).Или, может быть, какой-то другой вуду?Спасибо за любую помощь, которую вы можете предоставить!

Вот 2 (отредактированные, чтобы добавить больше) строки из моего кода:

    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 10);

    // Grab the data.
    $html = curl_exec($c);
    curl_close($c);
$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXPath($dom);

$jsonScripts = $xpath->query('//script[@type="application/ld+json"]');
if($TEST){echo "there are " . $jsonScripts->length . " JSONs<br>";}

И с интернет-страницы, которая ничего не возвращает

<script type="application/ld+json">{"@context":"http:\/\/schema.org\/","@type":"Recipe","name":"Healthy Garlic Scallops Recipe","author":{"@type":"Person","name":"Florentina"},"datePublished":"2015-07-29T22:39:18+00:00","description":"Italian garlic scallops, seared to a golden perfection in a cast iron pan and cooked in healthy clarified butter for the ultimate seafood meal!","image":["https:\/\/ciaoflorentina.com\/wp-content\/uploads\/2015\/07\/Garlic-Scallops-Healthy-4.jpg"],"recipeYield":"2","prepTime":"PT5M","cookTime":"PT5M","totalTime":"PT10M","recipeIngredient":["1 lb large scallops","1\/4 c clarified butter ghee","5 cloves garlic (grated)","1  large lemon (zested)","1\/4 c Italian parsley (roughly chopped)","1\/2 tsp sea salt + more to taste","1\/4 tsp peppercorn medley (freshly ground)","1\/4 tsp red pepper flakes","A pinch of sweet paprika","1 tsp extra virgin olive oil"],"recipeInstructions":[{"@type":"HowToStep","text":"Make sure to pat dry the scallops on paper towels very well before cooking."},{"@type":"HowToStep","text":"Heat up a large cast iron skillet on medium flame."},{"@type":"HowToStep","text":"Meanwhile in a medium bowl toss the scallops with a drizzle of olive oil or butter ghee, just enough to coat it all over. Sprinkle them with the sea salt, cracked pepper, red pepper flakes and sweet paprika. Toss to coat gently."},{"@type":"HowToStep","text":"Add a little drizzle of butter ghee to the hot skillet, just enough to coat the bottom. Add the scallops making sure not to overcrowd the pan, and sear for about 2 minutes on each side until nicely golden. ( Use a small spatula to flip them over individually )"},{"@type":"HowToStep","text":"Add the butter ghee to the skillet with the scallops and then add the garlic. Remove from heat and using a spatula push the garlic around to infuse the sauce for about 30 seconds. The heat from the skillet will be enough for the garlic to work its magic into the butter. This is how you avoid that pungent burnt garlicky taste we don\u2019t like."},{"@type":"HowToStep","text":"We are just looking to extract all that sweetness from the garlic, and this is how you do it, without burning."},{"@type":"HowToStep","text":"Squeeze half of the lemon all over the scallops and move the skillet around a little so it combines with the butter. Sprinkle with the minced parsley, lemon zest and a drizzle of extra virgin olive oil. Serve with crusty bread or al dente capellini noodles."}],"recipeCategory":["Main Dishes"],"recipeCuisine":["Italian"],"aggregateRating":{"@type":"AggregateRating","ratingValue":"5","ratingCount":"8"}}</script>

1 Ответ

0 голосов
/ 29 сентября 2018

Похоже, что серверы (Nginx) удваивают ответ (но только иногда !?).Ваш код в порядке, вы можете попробовать gzdecode , если вы не получили ожидаемых результатов.Я взломал этот тестовый скрипт, чтобы продемонстрировать.

<?php
$url = 'http://ciaoflorentina.com/garlic-scallops-recipe-healthy/';

$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko');
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 10);

// Grab the data.
$html = curl_exec($c);
curl_close($c);

$iterations = 0;
do
{
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    $xpath = new DOMXpath($dom);

    $jsonScripts = $xpath->query('//script[@type="application/ld+json"]');
    $nodeCount = $jsonScripts->length;

    echo "there are " . $nodeCount . " JSONs".PHP_EOL;

    if($nodeCount == 0)
    {
        //If garbage is coming from server, it's double encoded!
        $html = gzdecode($html);
    }

    $iterations++;
} while($nodeCount==0 && $iterations < 2);
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...