Я успешно извлек текст из одного PDF-файла, используя комбинацию magick-r и tesseract, но столкнулся с препятствиями при попытке обработать несколько изображений (это некоммерческая организация)
Я приветствую ответы на bash, но прошу, чтобы они были исчерпывающими и не пропускали компонент tesseract.
Ответы на на этот вопрос предназначены для очистки изображения без использования оптического распознавания текста, поэтому не знаете, как можно интегрировать первый ответ здесь.
данные изображения:
Мой процесс:
library(tesseract)
library(dplyr)
library(stringr)
library(pdftools)
library(readr)
library(magick)
library(purrr)
# original data
#pdf <- https://github.com/pembletonc/Project44_Text_Extraction/blob/master/test-data/001_0145.pdf
#image file (note that size here doesn't match processing below because of 2mb limit)[![enter image description here][2]][2]
file_name <- tools::list_files_with_exts(dir = "./test-data", exts = "pdf")
page_count <- pdf_info(file_name)$pages
multi_files <- list(pdftools::pdf_convert(file_name, page = 1:page_count,
filenames = paste0("./test-data/", "page", 1:page_count, ".png"),dpi = 250))
#or just get the file extensions for the file if already created[![enter image description here][1]][1]
#multi_files <- list(tools::list_files_with_exts(dir = "./test-data", exts = "png"))
Чтобы прочитать изображения как магические файлы:
multi_images <- map(multi_files, image_read)
which creates a tibble magick pointer object with the images sort of joined as a frame:
[[1]]
# A tibble: 5 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 3243 2010 sRGB FALSE 0 98x98
2 PNG 3247 2013 sRGB FALSE 4515441 98x98
3 PNG 3243 2013 sRGB FALSE 4559229 98x98
4 PNG 3247 2010 sRGB FALSE 4270145 98x98
5 PNG 3247 2010 sRGB FALSE 3212528 98x98
Как получить доступ к этому на каждом PNG, чтобы я мог очистить и обработать в OCR?
multi_text_clean <- function(images){
Map(function(x) {
x %>%
image_crop(geometry_area(width = 2200, height = 1600, y_off = 500, x_off = 650)) %>%
image_resize("2000x") %>%
image_background("white", flatten = TRUE) %>%
image_noise(noisetype = "Uniform") %>% # Reduce noise in image using a noise peak elimination filter
image_enhance() %>% # Enhance image (minimize noise)
image_normalize() %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_contrast(sharpen = 1) %>%
#image_deskew(threshold = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
}, images)
}
Это только запускает его на первом изображении:
text_list <- multi_text_clean(multi_images)
(text_multi <- stringr::str_split(text_list, pattern = "\\s{5,}"))
[[1]]
[1] "Weather clear all day. A small arms inspection held at 1400 hrs. A recce party went\njout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nfor B Coy personnel by our YMCA Supervisor."
[2] ")\nWeather clear and cold all day. Personnel packed equipment early in the morning and |~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,,\nPW brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division.\nNo other activity during the day. Patrols were sent out during the night by all coys}) u\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chateawv .\n\\Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
[3] "“y\neather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."
[4] "f\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\nling but the CO cancelled it. Two Polish deserters from the German army walked into\n|A Coy lines."
[5] "iz\nWeather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\nnew location at 0830 hrs. Unit started to move to new location at 1200 hrs, Unit Bs\narrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."
[6] "| 9\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.| |\nQuiet all day. No enemy activity during the day."
[7] "|\neather overcast and snowing. Intelligence Section set up another OP at MR 268814.\nNo enemy activity during the day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945."
[8] ":"
[9] "‘\nWeather clear and cold, Bm started to move at 0830 hrs. Bn reached Champlon"
[10] "&\nFamenine, MR 3182 at 1230 hrs. Bn relieved the HLI. Coys immediately took up"
[11] ":\npositions for all around defence."
[12] "4\n"
Как я могу выполнить это через каждое изображение в этом магическом объекте?