Я пытался реализовать проект для очистки вопросов от Quora на основе темы и использовал этот ресурс в качестве основы - https://github.com/Theminijohn/quora-scraper
Как показано на этой странице, подписчики извлекаются, как и ожидалось, для каждого вопроса. Однако при реализации того же самого в моей системе для каждого вопроса число подписчиков отображается равным нулю, даже если это не ноль.
Колонка всегда имеет нулевое значение, как показано здесь
Код, который отвечает за извлечение числа подписчиков, таков:
follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i
Все остальное работает как положено. Что мне здесь не хватает?
Редактировать: весь фрагмент кода выглядит следующим образом:
require 'rubygems'
require 'ruby-progressbar'
require 'Nokogiri'
require 'csv'
require 'pry'
ENGAGEMENT_THRESHOLD = 5
# init progressbar
progressbar = ProgressBar.create( format: '%a %bᗧ%i %p%% %t',
progress_mark: ' ',
remainder_mark: '・')
# parse file
doc = File.open("input.html") { |x| Nokogiri::HTML(x) }
questions = doc.css('.TopicAllQuestionsList .pagedlist_item')
# identifiers
canonical_link = doc.at('link[rel="canonical"]')['href']
topic_name = canonical_link.match(/quora.com\/topic\/(.*)/)[1]
# update progressbar
progressbar.total = questions.count
# prepare csv
unless File.exist?('quora-data.csv')
CSV.open("quora-data.csv", "w+") do |csv|
csv << [
"Topic", "Title", "Followers", "Answers", "Ratio", "Engagement potential",
"Last action", "Parsed time", "Question link"
]
end
end
questions.each do |q|
link = "https://www.quora.com" + q.css('a.question_link').attr('href').value
title = q.css('a.question_link').text.strip
answer_count = q.css('.QuestionFooter .answer_count_prominent').text.strip.to_i
follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i
ratio = "#{follower_count}/#{answer_count}"
if answer_count == 0
take_action = (follower_count >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
else
take_action = ((follower_count / answer_count) >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
end
# timestamps
raw_time = q.css('.QuestionFooter .question_timestamp').text.strip
last_action = raw_time.include?("Last requested") ? "Requested" : "Followed"
if raw_time.include?('ago')
if raw_time.scan(/(\d*)h/).flatten.any?
hours_ago = raw_time.scan(/(\d*)h/).flatten[0].to_f
parsed_time = (DateTime.now - (hours_ago / 24)).strftime('%Y-%m-%d')
elsif raw_time.scan(/(\d*)m/).flatten.any?
minutes_ago = raw_time.scan(/(\d*)m/).flatten[0].to_f
parsed_time = (DateTime.now - (1.0 / 24 / 60)).strftime('%Y-%m-%d')
end
else
if raw_time.count("0-9") > 0
parsed_time = Date.parse(raw_time).strftime("%Y-%m-%d")
else
parsed_time =
(Date.today < Date.parse(raw_time)) ? (Date.parse(raw_time) - 7) : Date.parse(raw_time)
end
end
CSV.open("quora-data.csv", "a+") do |csv|
csv << [
topic_name, title, follower_count, answer_count, ratio,
take_action, last_action, parsed_time, link
]
end
# move progressbar
progressbar.increment
end
<!DOCTYPE html>
<!-- saved from url=(0099)file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora.html -->
<html lang="en" class="js-wf-loaded"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link rel="icon" href="https://qsf.fs.quoracdn.net/-3-images.favicon.ico-26-ae77b637b1e7ed2c.ico"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q-icons.q-icons.woff2-26-9afc20a49e3ef2cf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular.woff2-26-7ace3bc4cbe404d9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular_italic.woff2-26-9d81ab3229809d01.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold.woff2-26-b55bf39d9018ace9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold_italic.woff2-26-4c39f22524232bf2.woff2"><script src="./input_files/sdk.js.download" async="" crossorigin="anonymous"></script><script src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js.download" async="" crossorigin="anonymous"></script><script async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/analytics.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/widgets.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js(1).download"></script><script type="text/javascript">window.Q = {"fontFamilies": ["q-icons", "q_serif"], "errorSamplingRate": 1.0, "revision": "41e9b4435b78728ddf351e72a6dc45ca9708ebc2", "subdomainSuffix": "quora.com"};window["webpackManifest"] = {"ads_manager": "https://qsc.fs.quoracdn.net/-3-chunk.web.ads_manager.js.out-34-1e09a2ca57288a3c.webpack", "content_widgets": "https://qsc.fs.quoracdn.net/-3-chunk.web.content_widgets.js.out-34-9a6c124eee999cb7.webpack", "dev": "https://qsc.fs.quoracdn.net/-3-chunk.web.dev.js.out-34-5d22ece0a38f03a1.webpack", "internal": "https://qsc.fs.quoracdn.net/-3-chunk.web.internal.js.out-34-2e41b1b9af1f0f88.webpack", "qtext2": "https://qsc.fs.quoracdn.net/-3-chunk.web.qtext2.js.out-34-b3d77df0693a06da.webpack", "main": "https://qsc.fs.quoracdn.net/-3-chunk.web.main.js.out-34-835b38fb05330b9f.webpack", "firebase": "https://qsc.fs.quoracdn.net/-3-chunk.web.firebase.js.out-34-eadc5f3144befc37.webpack", "publisher_dashboard": "https://qsc.fs.quoracdn.net/-3-chunk.web.publisher_dashboard.js.out-34-0c43bcc87e209b23.webpack"};window["webpackChunks"] = ["main"];window["PAGE_IS_MOBILE"] = false;var assetErrs=[];document.addEventListener("DOMContentLoaded",function(e){if(0!==assetErrs.length){var s="assets="+encodeURIComponent(JSON.stringify(assetErrs)),t=new XMLHttpRequest;t.open("POST","/ajax/log_browser_asset_load_error_3RD_PARTY_POST",!0),t.setRequestHeader("Content-Type","application/x-www-form-urlencoded; charset=UTF-8"),t.setRequestHeader("Accept","*/*"),t.send(s.replace(/%20/g,"+"))}}),window.addAssetErr=function(e){e&&assetErrs.push(e)};
Полный HTML-файл можно найти здесь - https://drive.google.com/file/d/1_X86tq5TTw4ikk-hQ2Ixd13Y_hR4scBg/view?usp=sharing
HTML-код, содержащий информацию о количестве подписчиков:
<div class="FollowActionItem ItemComponent primary_item u-relative"><span id="wVP1Ux4a11"><a class="ui_button ui_button--styled ui_button--FlatStyle ui_button--FlatStyle--gray ui_button--size_regular u-inline-block ui_button--non_link ui_button--supports_icon ui_button--has_icon" href="#" role="button" action_click="QuestionFollow" action_target="{"qid": 44394942, "type": "question"}" id="__w2_wVP1Ux4a27_button"><div class="ui_button_inner" id="__w2_wVP1Ux4a27_inner"><div class="ui_button_icon_wrapper u-relative u-flex-inline"><div id="__w2_wVP1Ux4a27_icon"><span class="ui_button_icon" aria-hidden="true"><svg width="24px" height="24px" viewBox="0 0 24 24" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g stroke="none" fill="none" fill-rule="evenodd" stroke-linecap="round">
<g id="follow" class="icon_svg-stroke" stroke="#666" stroke-width="1.5">
<path d="M14.5,19 C14.5,13.3369229 11.1630771,10 5.5,10 M19.5,19 C19.5,10.1907689 14.3092311,5 5.5,5" id="lines"></path>
<circle id="circle" cx="7.5" cy="17" r="2" class="icon_svg-fill" fill="none"></circle>
</g>
</g>
</svg></span></div></div><div class="ui_button_label_count_wrapper"><span class="ui_button_label" id="__w2_wVP1Ux4a27_label">Follow</span><span class="ui_button_count" aria-hidden="true" id="__w2_wVP1Ux4a27_count_wrapper"><span class="bullet"> · </span><span class="ui_button_count_inner" id="__w2_wVP1Ux4a27_count">1</span></span></div></div></a></span></div>