Question

Ниже приведен код для создания некоторых тестовых данных:

df <- data.frame(page_id = c(3,3,3,3,3), element_id = c(19, 22, 26, 31, 31), 
                 text = c("The Protected Percentage of your property value thats has been chosen is 0%", 
                          "The Arrangement Fee payable at complettion: £50.00", 
                          "Interest rate is fixed for the life of the period is: 5.40%", 
                          "The Benchmark rate that will be used to calculate any early repayment 2.08%", 
                          "The property value used in this scenario is 275,000.00"))

У меня есть много файлов PDF, из которых я хочу извлечь ту же информацию, используя регулярные выражения.Мне удалось извлечь всю необходимую мне информацию из 1 PDF-файла.Ниже приведен код для этого - с комментариями:

library("textreadr")
library("pdftools")   
library("tidyverse")
library("tidytext")    
library("textreadr")
library("tm")

# read in the PDF file
Off_let_data <- read_pdf("50045400_K021_2017-V001_300547.pdf")

# read all pdf file from a folder
files <- list.files(pattern = "pdf$")[1]

# extract the account number from the first pdf file
acc_num <- str_extract(files, "^\\d+")

# The RegEx's used to extract the relevant information
protec_per_reg <- "Protected\\sP\\w+\\sof"
Arr_Fee_reg <- "^The\\sArrangement\\sF\\w+"
Fix_inter_reg <- "Fixed\\sI\\w+\\sR\\w+"
Bench_rate_reg <- "Benchmark\\sR\\w+\\sthat"

# create a df that only includes the rows which match the above RegEx
Off_let <- Off_let_data %>% filter(page_id == 3, str_detect(Off_let_data$text, protec_per_reg)|
                                     str_detect(Off_let_data$text, Arr_Fee_reg) | str_detect(Off_let_data$text, Fix_inter_reg) | 
                                     str_detect(Off_let_data$text, Bench_rate_reg))

# Now only extract the numbers from the above DF
off_let_num <- str_extract(Off_let$text, "\\d+\\.?\\d+")

# The first element is always a NA value - based on the structure of these PDF files
# replace the first element of this character vector with the below
off_let_num[is.na(off_let_num)] <- str_extract(Off_let$text, "\\d+%")[[1]] 
off_let_num

Переменная off_let_num - это вектор, имеющий 4 элемента, которые требуются из файла PDF.

СЕЙЧАС Я хотел бы применить все эти шаги к папке, содержащей много файлов PDF.Итак, мне уже удалось прочитать весь файл PDF в отдельных фреймах данных, код которых приведен ниже:

# read all pdf files into a list
 file_list <- list.files(pattern = '*.pdf')

# Read in all the pdf files into seperate data frames
for (file_name in off_let) {
  assign(paste0("off","_",sub(".pdf","",file_name)), read_pdf(file_name))
}

У меня сейчас есть много фреймов данных в моем рабочем каталоге.Я хотел бы применить тот же процесс, который я применил к одному pdf-файлу в начале, ко всем этим фреймам данных, начинающимся с 'off'.

Полагаю, можно пойти путем преобразования начального процесса в функцию, а затем вызвать эту функцию для применения ко всем фреймам данных, начинающимся с 'off'.Результаты должны быть добавлены во фрейм данных, который должен включать все элементы (4), извлеченные из этих файлов PDF.Я не уверен, как этого добиться.Пожалуйста, помогите!

Извлечение данных из нескольких файлов PDF в R

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Извлечение данных из нескольких файлов PDF в R

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Нет похожих вопросов