Получение деталей Opengraph внутри канала RSS / Atom - PullRequest
0 голосов
/ 21 сентября 2019

В своем приключении по анализу RSS я столкнулся с тем, что некоторые сайты встраивают метаданные Open Graph в канал RSS.Некоторые данные OG полезны для извлечения вместе с другими стандартными элементами RSS-канала, которые я уже могу извлечь с помощью feedparser в Python.

Я не нашел хороших ссылок, в частности, о том, какчитать метаданные OG из RSS-каналов в общем, и я не уверен, с чего начать при извлечении метаданных OG из одного элемента, который я перебираю с помощью библиотеки feedparser.

В одном случае я нашелв частности, один блог фактически встраивает «краткое описание» в метаданные OG, но включает полный пост в стандартное поле описания RSS.В этом случае я бы предпочел краткое описание и, скорее всего, добавлю некоторую логику, чтобы определить, какие из дублированных, но по-разному заполненных полей я хочу сохранить.

Этот фид от Google является хорошим примером этогов действии, хотя я заметил, что он непоследователен в сообщениях и каналах блога (ура для стандартов!):

https://cloudblog.withgoogle.com/products/gcp/rss/

<item>
<title>
Moving a publishing workflow to BigQuery for new data insights
</title>
<link>
https://cloud.google.com/blog/products/data-analytics/moving-a-publishing-workflow-to-bigquery-for-new-data-insights/
</link>
<description>
<html><head></head><body><div class="block-paragraph"><div class="rich-text"><p>Google Cloud’s technology powers both our customers as well as our internal teams. Recently, the Solutions Architect team decided to move an internal process to use BigQuery to streamline and better focus efforts across the team. </p><p>The Solutions Architect team publishes <a href="https://cloud.google.com/docs/tutorials">reference guides</a> for customers to use as they build applications on Google Cloud. Our publishing process has many steps, including outline approval, draft, peer review, technical editing, legal review, PR approval and finally, publishing on our site. This process involves collaboration across the technical editing, legal, and PR teams.</p></div></div><div class="block-image_full_width"><div class="article-module h-c-page"><div class="h-c-grid"><figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 "><img alt="publishing process.png" src="https://storage.googleapis.com/gweb-cloudblog-publish/images/publishing_process.0903034614780631.max-1000x1000.png"/></figure></div></div></div><div class="block-paragraph"><div class="rich-text"><p>With so many steps and people involved, it’s important that we effectively collaborate. Our team uses a collaboration tool running on <a href="https://cloud.google.com/">Google Cloud Platform</a> (GCP) as a central repository and workflow for our reference guides. </p><h2>Increased data needs required more sophisticated tools</h2><p>As our team of solution architects grew and our reporting needs became more sophisticated, we realized that we couldn’t effectively provide the insights that we needed directly in our existing collaboration tool. For example, we needed to build and share status dashboards of our reference guides, build a roadmap for upcoming work, and analyze how long our solutions take to publish, from outline approval to publication. We also needed to share this information outside our team, but didn’t want to share unnecessary information by broadly granting access to our entire collaboration instance.</p><h2>Building a script with BigQuery on the back end</h2><p>Since our collaboration tool provides a robust and flexible REST API, we decided to write an export script which stored the results in <a href="https://cloud.google.com/bigquery/">BigQuery</a>. We chose BigQuery because we knew that we could write advanced queries against the data and then use <a href="https://datastudio.google.com/">Data Studio</a> to build our dashboards. Using BigQuery for analysis provided a scalable solution that is well-integrated into other GCP tools and has support for both batch and real-time inserts using the <a href="https://cloud.google.com/bigquery/streaming-data-into-bigquery">streaming API</a>.</p></div></div><div class="block-image_full_width"><div class="article-module h-c-page"><div class="h-c-grid"><figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 "><img alt="script with BigQuery.png" src="https://storage.googleapis.com/gweb-cloudblog-publish/images/script_with_BigQuery.0682026612930467.max-1000x1000.png"/></figure></div></div></div><div class="block-paragraph"><div class="rich-text"><p>We used a simple Python script to read the issues from the API and then insert the entries into BigQuery using the <a href="https://cloud.google.com/bigquery/streaming-data-into-bigquery">streaming API</a>. We chose the streaming API, rather than Cloud Pub/Sub or Cloud Dataflow, because we wanted to repopulate the BigQuery content with the latest data several times a day. The <a href="https://pypi.org/project/google-api-python-client/">Google API Python client</a> library was an obvious choice, because it provides an idiomatic way to interact with the Google APIs, including the BigQuery streaming API. </p><p>Since this data would only be used for reporting purposes, we opted to keep only the most recent version of the data as extracted. There were two reasons for this decision:</p><ol><li> <b>Master data</b>: There would never be any question about which data was the master version of the data. </li><li><b>Historical data</b>: We had no use cases that required capturing any historical data that wasn’t already captured in the data extract. </li></ol><p>Following common extract, transform, load (<a href="https://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a>) best practices, we used a staging table and a separate production table so that we could load data into the staging table without impacting users of the data. The design we created based on ETL best practices called for first deleting all the records from the staging table, loading the staging table, and then replacing the production table with the contents. </p><p>When using the streaming API, the BigQuery streaming buffer remains active for about 30 to 60 minutes or more after use, which means that you can’t delete or change data during that time. Since we used the streaming API, we scheduled the load every three hours to balance getting data into BigQuery quickly and being able to subsequently delete the data from the staging table during the load process.</p><p>Once our data was in BigQuery, we could write SQL queries directly against the data or use any of the wide range of <a href="https://cloud.google.com/bigquery/providers/">integrated tools</a> available to analyze the data. We chose <a href="https://datastudio.google.com/">Data Studio</a> for visualization because it’s well-integrated with BigQuery, offers customizable dashboard capabilities, provides the ability to collaborate, and of course, is free. </p><p>Because BigQuery datasets can be shared with users, this opened up the usability of the data for whomever was granted access and also had appropriate authorization. This also meant that we could combine this data in BigQuery with other datasets. For example, we track the online engagement metrics for our reference guides and load them into BigQuery. With both datasets in BigQuery, it made it easy to factor in the online engagement numbers to build dashboards.</p><h2>Creating a sample dashboard</h2><p>One of the biggest reasons that we wanted to create reporting against our publishing process is to track the publishing process over time. Data Studio made it easy to build a dashboard with charts, similar to the two charts below. Building the dashboard in Data Studio allowed us to easily analyze our publication metrics over time and then share the specific dashboards with teams outside ours.</p></div></div><div class="block-image_full_width"><div class="article-module h-c-page"><div class="h-c-grid"><figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 "><img alt="sample dashboard.png" src="https://storage.googleapis.com/gweb-cloudblog-publish/images/sample_dashboard.max-1000x1000.png"/></figure></div></div></div><div class="block-paragraph"><div class="rich-text"><h2>Monitoring the load process</h2><p>Monitoring is an important part of any ETL pipeline. <a href="https://cloud.google.com/monitoring">Stackdriver Monitoring</a> provides monitoring, alerting and dashboards for GCP environments. We opted to use the <a href="https://cloud.google.com/logging/docs/reference/libraries#client-libraries-install-python">Google Cloud Logging</a> module in the Python load script, because this would generate logs for errors in Stackdriver Logging that we could use for error alerting in Stackdriver Monitoring. We set up a Stackdriver Monitoring Workspace specifically for the project with the load process. We then created a management dashboard to track any application errors. We set up alerts to send an SMS notification whenever errors appeared in the load process log files. Here’s a look at the dashboards in the Stackdriver Workspace:</p></div></div><div class="block-image_full_width"><div class="article-module h-c-page"><div class="h-c-grid"><figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 "><img alt="Monitoring the load process.png" src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Monitoring_the_load_process.max-1000x1000.png"/></figure></div></div></div><div class="block-paragraph"><div class="rich-text"><p>And this shows the details of the alerts we set up:</p></div></div><div class="block-image_full_width"><div class="article-module h-c-page"><div class="h-c-grid"><figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 "><img alt="details of the alerts.png" src="https://storage.googleapis.com/gweb-cloudblog-publish/images/details_of_the_alerts.max-1000x1000.png"/></figure></div></div></div><div class="block-paragraph"><div class="rich-text"><p>BigQuery provides the flexibility for you to meet your business or analytical needs, whether they’re petabyte-sized or not. BigQuery’s streaming API means that you can stream data directly into BigQuery and provide end users with rapid access to data. Data Studio provides an easy-to-use integration with BigQuery that makes it simple to develop advanced dashboards. The cost-per-query approach means that you’ll pay for what you store and analyze, though BigQuery also offers <a href="https://cloud.google.com/bigquery/pricing#flat_rate_pricing">flat-rate pricing</a> if you have a high number of large queries. For our team, we’ve been able to gain considerable new insights into our publishing process using BigQuery, which have helped us both refine our publishing process and focus more effort on the most popular technical topics. </p><p>If you haven’t already, check out what BigQuery can do using the <a href="https://cloud.google.com/bigquery/public-data/">BigQuery public datasets</a> and see <a href="https://cloud.google.com/docs/tutorials">what else you can do with GCP in our reference guides</a>.</p></div></div></body></html>
</description>
<pubDate>Fri, 20 Sep 2019 16:00:00 -0000</pubDate>
<guid>
https://cloud.google.com/blog/products/data-analytics/moving-a-publishing-workflow-to-bigquery-for-new-data-insights/
</guid>
<category>Google Cloud Platform</category>
<category>Data Analytics</category>
<media:content url="https://storage.googleapis.com/gweb-cloudblog-publish/images/GCP_Data_Analytics3.max-600x600.jpg" width="540" height="540"/>
<og xmlns:og="http://ogp.me/ns#">
<type>article</type>
<title>
Moving a publishing workflow to BigQuery for new data insights
</title>
<description>
Using BigQuery from Google Cloud can help streamline an internal process, like web publishing, to get better data insights faster.
</description>

Обратите внимание на полное описание в стандартном RSSэлемент, но гораздо более четкое описание в элементе OG.

 <og xmlns:og="http://ogp.me/ns#">
<type>article</type>
<title>
Moving a publishing workflow to BigQuery for new data insights
</title><description> Using BigQuery from Google Cloud can help streamline an internal process, like web publishing, to get better data insights faster.
    </description>
...