Как собрать количество последователей вопросов из Quora с помощью Ruby? - PullRequest
2 голосов
/ 22 марта 2019

Я пытался реализовать проект для очистки вопросов от Quora на основе темы и использовал этот ресурс в качестве основы - https://github.com/Theminijohn/quora-scraper Как показано на этой странице, подписчики извлекаются, как и ожидалось, для каждого вопроса. Однако при реализации того же самого в моей системе для каждого вопроса число подписчиков отображается равным нулю, даже если это не ноль. Колонка всегда имеет нулевое значение, как показано здесь

Код, который отвечает за извлечение числа подписчиков, таков:

follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i

Все остальное работает как положено. Что мне здесь не хватает?

Редактировать: весь фрагмент кода выглядит следующим образом:

    require 'rubygems'
require 'ruby-progressbar'
require 'Nokogiri'
require 'csv'
require 'pry'

ENGAGEMENT_THRESHOLD = 5

# init progressbar
progressbar = ProgressBar.create( format:         '%a %bᗧ%i %p%% %t',
                                  progress_mark:  ' ',
                                  remainder_mark: '・')

# parse file
doc = File.open("input.html") { |x| Nokogiri::HTML(x) }
questions = doc.css('.TopicAllQuestionsList .pagedlist_item')

# identifiers
canonical_link = doc.at('link[rel="canonical"]')['href']
topic_name = canonical_link.match(/quora.com\/topic\/(.*)/)[1]

# update progressbar
progressbar.total = questions.count

# prepare csv
unless File.exist?('quora-data.csv')
  CSV.open("quora-data.csv", "w+") do |csv|
    csv << [
      "Topic", "Title", "Followers", "Answers", "Ratio", "Engagement potential",
      "Last action", "Parsed time", "Question link"
    ]
  end
end

questions.each do |q|
  link = "https://www.quora.com" + q.css('a.question_link').attr('href').value
  title = q.css('a.question_link').text.strip
  answer_count = q.css('.QuestionFooter .answer_count_prominent').text.strip.to_i
  follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i
  ratio = "#{follower_count}/#{answer_count}"

  if answer_count == 0
    take_action = (follower_count >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
  else
    take_action = ((follower_count / answer_count) >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
  end

  # timestamps
  raw_time = q.css('.QuestionFooter .question_timestamp').text.strip
  last_action = raw_time.include?("Last requested") ? "Requested" : "Followed"

  if raw_time.include?('ago')
    if raw_time.scan(/(\d*)h/).flatten.any?
      hours_ago = raw_time.scan(/(\d*)h/).flatten[0].to_f
      parsed_time = (DateTime.now - (hours_ago / 24)).strftime('%Y-%m-%d')
    elsif raw_time.scan(/(\d*)m/).flatten.any?
      minutes_ago = raw_time.scan(/(\d*)m/).flatten[0].to_f
      parsed_time = (DateTime.now - (1.0 / 24 / 60)).strftime('%Y-%m-%d')
    end
  else
    if raw_time.count("0-9") > 0
      parsed_time = Date.parse(raw_time).strftime("%Y-%m-%d")
    else
      parsed_time =
        (Date.today < Date.parse(raw_time)) ? (Date.parse(raw_time) - 7) : Date.parse(raw_time)
    end
  end

  CSV.open("quora-data.csv", "a+") do |csv|
    csv << [
      topic_name, title, follower_count, answer_count, ratio,
      take_action, last_action, parsed_time, link
    ]
  end

  # move progressbar
  progressbar.increment
end

<!DOCTYPE html>
<!-- saved from url=(0099)file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora.html -->
<html lang="en" class="js-wf-loaded"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link rel="icon" href="https://qsf.fs.quoracdn.net/-3-images.favicon.ico-26-ae77b637b1e7ed2c.ico"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q-icons.q-icons.woff2-26-9afc20a49e3ef2cf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular.woff2-26-7ace3bc4cbe404d9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular_italic.woff2-26-9d81ab3229809d01.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold.woff2-26-b55bf39d9018ace9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold_italic.woff2-26-4c39f22524232bf2.woff2"><script src="./input_files/sdk.js.download" async="" crossorigin="anonymous"></script><script src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js.download" async="" crossorigin="anonymous"></script><script async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/analytics.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/widgets.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js(1).download"></script><script type="text/javascript">window.Q = {"fontFamilies": ["q-icons", "q_serif"], "errorSamplingRate": 1.0, "revision": "41e9b4435b78728ddf351e72a6dc45ca9708ebc2", "subdomainSuffix": "quora.com"};window["webpackManifest"] = {"ads_manager": "https://qsc.fs.quoracdn.net/-3-chunk.web.ads_manager.js.out-34-1e09a2ca57288a3c.webpack", "content_widgets": "https://qsc.fs.quoracdn.net/-3-chunk.web.content_widgets.js.out-34-9a6c124eee999cb7.webpack", "dev": "https://qsc.fs.quoracdn.net/-3-chunk.web.dev.js.out-34-5d22ece0a38f03a1.webpack", "internal": "https://qsc.fs.quoracdn.net/-3-chunk.web.internal.js.out-34-2e41b1b9af1f0f88.webpack", "qtext2": "https://qsc.fs.quoracdn.net/-3-chunk.web.qtext2.js.out-34-b3d77df0693a06da.webpack", "main": "https://qsc.fs.quoracdn.net/-3-chunk.web.main.js.out-34-835b38fb05330b9f.webpack", "firebase": "https://qsc.fs.quoracdn.net/-3-chunk.web.firebase.js.out-34-eadc5f3144befc37.webpack", "publisher_dashboard": "https://qsc.fs.quoracdn.net/-3-chunk.web.publisher_dashboard.js.out-34-0c43bcc87e209b23.webpack"};window["webpackChunks"] = ["main"];window["PAGE_IS_MOBILE"] = false;var assetErrs=[];document.addEventListener("DOMContentLoaded",function(e){if(0!==assetErrs.length){var s="assets="+encodeURIComponent(JSON.stringify(assetErrs)),t=new XMLHttpRequest;t.open("POST","/ajax/log_browser_asset_load_error_3RD_PARTY_POST",!0),t.setRequestHeader("Content-Type","application/x-www-form-urlencoded; charset=UTF-8"),t.setRequestHeader("Accept","*/*"),t.send(s.replace(/%20/g,"+"))}}),window.addAssetErr=function(e){e&&assetErrs.push(e)};

Полный HTML-файл можно найти здесь - https://drive.google.com/file/d/1_X86tq5TTw4ikk-hQ2Ixd13Y_hR4scBg/view?usp=sharing

HTML-код, содержащий информацию о количестве подписчиков:

<div class="FollowActionItem ItemComponent primary_item u-relative"><span id="wVP1Ux4a11"><a class="ui_button ui_button--styled ui_button--FlatStyle ui_button--FlatStyle--gray ui_button--size_regular u-inline-block ui_button--non_link ui_button--supports_icon ui_button--has_icon" href="#" role="button" action_click="QuestionFollow" action_target="{&quot;qid&quot;: 44394942, &quot;type&quot;: &quot;question&quot;}" id="__w2_wVP1Ux4a27_button"><div class="ui_button_inner" id="__w2_wVP1Ux4a27_inner"><div class="ui_button_icon_wrapper u-relative u-flex-inline"><div id="__w2_wVP1Ux4a27_icon"><span class="ui_button_icon" aria-hidden="true"><svg width="24px" height="24px" viewBox="0 0 24 24" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <g stroke="none" fill="none" fill-rule="evenodd" stroke-linecap="round">
        <g id="follow" class="icon_svg-stroke" stroke="#666" stroke-width="1.5">
            <path d="M14.5,19 C14.5,13.3369229 11.1630771,10 5.5,10 M19.5,19 C19.5,10.1907689 14.3092311,5 5.5,5" id="lines"></path>
            <circle id="circle" cx="7.5" cy="17" r="2" class="icon_svg-fill" fill="none"></circle>
        </g>
    </g>
</svg></span></div></div><div class="ui_button_label_count_wrapper"><span class="ui_button_label" id="__w2_wVP1Ux4a27_label">Follow</span><span class="ui_button_count" aria-hidden="true" id="__w2_wVP1Ux4a27_count_wrapper"><span class="bullet"> · </span><span class="ui_button_count_inner" id="__w2_wVP1Ux4a27_count">1</span></span></div></div></a></span></div>
...