Удалить HTML из тела текстового файла - PullRequest
1 голос
/ 04 марта 2020

В настоящее время я пишу функцию для получения обзора и рейтинга альбома, получая его от Вил и удаляя HTML. Результатом должен стать список из 2 элементов: обзор и оценка этого альбома. Пока у меня есть это, и я все еще выясняю, что возвращать, регулярное выражение части HTML и функцию paste0. Спасибо за ваше время!

pitchfork = function(url){
  save = getURL(url)
  cat(save,file = "review.txt")
  a1 = '<div class="contents dropcap"><p>'
  b1 = str_replace(save, paste0("^.*",a1),"")
  a2 = '</div><a class="end-mark-container" href="/">'
  b2 = str_replace(b1, paste0(a2,".*$"),"")
}

1 Ответ

0 голосов
/ 04 марта 2020

Как насчет чего-то подобного?

library(xml2)
library(rvest)
library(tidyverse)

url <- "http://pitchfork.com/reviews/albums/grimes-miss-anthropocene"
html <- read_html(url)

review <- html %>%
    xml_nodes("p") %>%
    html_text() %>%
    enframe("paragraph_no", "text")
review
## A tibble: 14 x 2
#   paragraph_no text
#          <int> <chr>
# 1            1 Best new music
# 2            2 Grimes’ first project as a bona fide pop star is more morose th…
# 3            3 In 2011, Grimes was eager to say in an interview that she had “…
# 4            4 Miss Anthropocene is Grimes’ fifth album and her first as that …
# 5            5 The result is a record that’s more morose than her previous wor…
# 6            6 In November 2018, Grimes released “We Appreciate Power,” a coll…
# 7            7 When Grimes veers away from high concept toward examining intim…
# 8            8 Miss Anthropocene thrills when it reveals a refined, linear evo…
# 9            9 So much about the actual music of Miss Anthropocene succeeds th…
#10           10 And that’s the obstacle, the slimy mouthfeel, standing in the w…
#11           11 Correction: An earlier version of this review erroneously state…
#12           12 Listen to our Best New Music playlist on Spotify and Apple Musi…
#13           13 Buy: Rough Trade
#14           14 (Pitchfork may earn a commission from purchases made through af…

review - это tibble и содержит обзор, разделенный по абзацам; может потребоваться дополнительная очистка (например, удаление первой и последней строки).

Для оценки можно использовать селектор атрибутов класса

score <- html %>% xml_nodes("[class='score']") %>% html_text() %>% as.numeric()
score
#[1] 8.2

Обтекание up (в функции)

Обернем все в function, который возвращает list с обзором tibble и цифрой c счет.

get_pitchfork_data <- function(url) {
    html <- read_html(url)
    list(
        review = html %>%
            xml_nodes("p") %>%
            html_text() %>%
            trimws() %>%
            enframe("paragraph_no", "text"),
        score = html %>%
            xml_nodes("[class='score']") %>%
            html_text() %>%
            as.numeric())
}

Тест 1 :

Граймс - Мисс Антропоцен

get_pitchfork_data("http://pitchfork.com/reviews/albums/grimes-miss-anthropocene")
#$review
## A tibble: 14 x 2
#   paragraph_no text
#          <int> <chr>
# 1            1 Best new music
# 2            2 Grimes’ first project as a bona fide pop star is more morose th…
# 3            3 In 2011, Grimes was eager to say in an interview that she had “…
# 4            4 Miss Anthropocene is Grimes’ fifth album and her first as that …
# 5            5 The result is a record that’s more morose than her previous wor…
# 6            6 In November 2018, Grimes released “We Appreciate Power,” a coll…
# 7            7 When Grimes veers away from high concept toward examining intim…
# 8            8 Miss Anthropocene thrills when it reveals a refined, linear evo…
# 9            9 So much about the actual music of Miss Anthropocene succeeds th…
#10           10 And that’s the obstacle, the slimy mouthfeel, standing in the w…
#11           11 Correction: An earlier version of this review erroneously state…
#12           12 Listen to our Best New Music playlist on Spotify and Apple Musi…
#13           13 Buy: Rough Trade
#14           14 (Pitchfork may earn a commission from purchases made through af…
#
#$score
#[1] 8.2

Тест 2:

Радиоголовка - OK Компьютер (переиздание)

get_pitchfork_data("https://pitchfork.com/reviews/albums/radiohead-ok-computer-oknotok-1997-2017/")
#$review
## A tibble: 12 x 2
#   paragraph_no text
#          <int> <chr>
# 1            1 Best new reissue
# 2            2 Twenty years on, Radiohead revisit their 1997 masterpiece with …
# 3            3 As they regrouped to figure out what their third album might be…
# 4            4 It’s still funny to think, two decades later, that Thom Yorke’s…
# 5            5 It’s unclear what happened to that album. OK Computer obviously…
# 6            6 OKNOTOK is something a little more interesting than a remaster …
# 7            7 But “Lift’s” reputation for positivity might be a little confus…
# 8            8 The most fun to be had with OKNOTOK is in these line-blurring m…
# 9            9 This fondness for camp and schlock has always been latent in Ra…
#10           10 The ghost of Bond followed them once they decamped from their s…
#11           11 Radiohead have been at least as brilliant at packaging and posi…
#12           12 Now that they have arrived at an autumnal, valedictory stage in…
#
#$score
#[1] 10
...