Мы можем использовать следующие манипуляции.
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl) - 1)]
# Getting rid of the last line in every page
dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl)
# Splitting not just by \n, but by \n that goes right before a date (positive lookahead)
dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})")
# For each page...
dfl <- lapply(dfl, function(df) {
# Split vectors into 4 columns (sometimes we may have 5 due to the issue that
# you mentioned, so str_split_fixed becomes useful) by possibly \n and
# at least two spaces.
df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4)
# Replace any remaining (in the last columns) cases of possibly \n and
# at least two spaces.
df <- gsub("(\n)*[ ]{2,}", " ", df)
colnames(df) <- c("Date", "Added", "Deleted", "Notes")
df[df == ""] <- NA
data.frame(df[-1, ])
})
head(dfl[[1]])
# Date Added Deleted Notes
# 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries
# 2 02-Apr-84 Lonrho Magnet & Southerns <NA>
# 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA>
# 4 02-Jul-84 Woolworths Barratt Development <NA>
# 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc
# 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>
Я полагаю, что в конечном итоге вам понадобится один фрейм данных, а не их список.Для этого вы можете использовать do.call(rbind, dfl)
.