У меня есть таблицы в формате, аналогичном этому ...
что я пытаюсь извлечь текст и ссылки с помощью R.
# write the HTML code from R to reproduce
x <- "
<html>
<head>
</head>
<body>
<table>
<tbody>
<tr>
<th>site</th>
<th>country</th>
</tr>
<tr>
<td> <a href='http://www.nbc.com'>NBC</a> <a href='https://www.cnn.com'>CNN</a> <a href='https://www.nytimes.com'>NY Times</a> </td>
<td> US</td>
</tr>
<tr>
<td> <a href='http://www.dw-world.de/'>DW</a> </td>
<td> DE</td>
</tr>
<tr>
<td></td>
<td>FR</td>
</tr>
<tr>
<td> <a href='http://www.bbc.co.uk'>BBC</a> <a href='https://www.itv.co.uk'>ITV</a></td>
<td> UK</td>
</tr>
</tbody>
</table>
</body>
</html>"
write.table(x = x, file = "table.html", quote = FALSE,
col.names = FALSE,
row.names = FALSE)
file.show("table.html")
В конечном итоге я хочу получить такой аккуратный кадр данных ...
# # A tibble: 7 x 3
# site site_name country
# <chr> <chr> <chr>
# 1 http://www.nbc.com NBC US
# 2 https://www.cnn.com CNN US
# 3 https://www.nytimes.com NY Times US
# 4 http://www.dw-world.de/ DW DE
# 5 NA NA FR
# 6 http://www.bbc.co.uk BBC UK
# 7 https://www.itv.co.uk ITV UK
Я играл с rvest
функциями, но не могу извлечь ссылки и вспомнить, из какой строки они поступают, чтобы построить фрейм данных, как указано выше?
library(tidyverse)
library(rvest)
h <- read_html("table.html")
# a table without any of the links... no good
h %>%
html_table() %>%
.[[1]]
# site country
# 1 NBC CNN NY Times US
# 2 DW DE
# 3 FR
# 4 BBC ITV UK
# pulls the site urls
h %>%
html_nodes("a") %>%
html_attr("href")
# [1] "http://www.nbc.com" "https://www.cnn.com" "https://www.nytimes.com" "http://www.dw-world.de/" "http://www.bbc.co.uk" "https://www.itv.co.uk"
# pulls the site names
h %>%
html_nodes("a") %>%
html_text()
# [1] "NBC" "CNN" "NY Times" "DW" "BBC" "ITV"
# looks promising, perhaps can combine with results from html_table()
library(XML)
tables <- getNodeSet(htmlParse("table.html"), "//table")
hrefFun <- function(x){
xpathSApply(x,'./a',xmlAttrs)
}
readHTMLTable(doc = tables[[1]], elFun = hrefFun)
# V1 V2
# 1 list() list()
# 2 c(href = "http://www.nbc.com", href = "https://www.cnn.com", href = "https://www.nytimes.com") list()
# 3 http://www.dw-world.de/ list()
# 4 list() list()
# 5 c(href = "http://www.bbc.co.uk", href = "https://www.itv.co.uk") list()
# looks promising for the rows.. don't know where to go from here
h %>%
html_nodes("tr")
# {xml_nodeset (5)}
# [1] <tr>\n<th>site</th>\r\n<th>country</th>\r\n</tr>\n
# [2] <tr>\n<td> <a href="http://www.nbc.com">NBC</a> <a href="https://www.cnn.com">CNN</a> <a href="https://www.nytimes.com">NY Times</a> ...
# [3] <tr>\n<td> <a href="http://www.dw-world.de/">DW</a> </td>\r\n<td> DE</td>\r\n</tr>\n
# [4] <tr>\n<td></td>\r\n<td>FR</td>\r\n</tr>\n
# [5] <tr>\n<td> <a href="http://www.bbc.co.uk">BBC</a> <a href="https://www.itv.co.uk">ITV</a>\n</td>\r\n<td> UK</td>\r\n</tr>