Как (быстро) пакетно обработать несколько изображений и запустить через tesseract - PullRequest
0 голосов
/ 04 октября 2019

Я успешно извлек текст из одного PDF-файла, используя комбинацию magick-r и tesseract, но столкнулся с препятствиями при попытке обработать несколько изображений (это некоммерческая организация)

Я приветствую ответы на bash, но прошу, чтобы они были исчерпывающими и не пропускали компонент tesseract.

Ответы на на этот вопрос предназначены для очистки изображения без использования оптического распознавания текста, поэтому не знаете, как можно интегрировать первый ответ здесь.

данные изображения: enter image description here

Мой процесс:

library(tesseract)
library(dplyr)
library(stringr)
library(pdftools)
library(readr)
library(magick)
library(purrr)
# original data
#pdf <- https://github.com/pembletonc/Project44_Text_Extraction/blob/master/test-data/001_0145.pdf

#image file (note that size here doesn't match processing below because of 2mb limit)[![enter image description here][2]][2]

file_name <- tools::list_files_with_exts(dir = "./test-data", exts = "pdf")
page_count <- pdf_info(file_name)$pages  

multi_files <- list(pdftools::pdf_convert(file_name, page = 1:page_count,
                                          filenames = paste0("./test-data/", "page", 1:page_count, ".png"),dpi = 250))

#or just get the file extensions for the file if already created[![enter image description here][1]][1]
#multi_files <- list(tools::list_files_with_exts(dir = "./test-data", exts = "png"))

Чтобы прочитать изображения как магические файлы:

multi_images <- map(multi_files, image_read)

which creates a tibble magick pointer object with the images sort of joined as a frame:

[[1]]
# A tibble: 5 x 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG     3243   2010 sRGB       FALSE        0 98x98  
2 PNG     3247   2013 sRGB       FALSE  4515441 98x98  
3 PNG     3243   2013 sRGB       FALSE  4559229 98x98  
4 PNG     3247   2010 sRGB       FALSE  4270145 98x98  
5 PNG     3247   2010 sRGB       FALSE  3212528 98x98  

Как получить доступ к этому на каждом PNG, чтобы я мог очистить и обработать в OCR?

multi_text_clean <- function(images){

  Map(function(x) {
    x %>% 
      image_crop(geometry_area(width = 2200, height = 1600, y_off = 500, x_off = 650)) %>%  
      image_resize("2000x") %>%
      image_background("white", flatten = TRUE) %>% 
      image_noise(noisetype = "Uniform") %>%          # Reduce noise in image using a noise peak elimination filter
      image_enhance() %>%                             # Enhance image (minimize noise)
      image_normalize() %>% 
      image_convert(type = 'Grayscale') %>%
      image_trim(fuzz = 40) %>%
      image_contrast(sharpen = 1) %>%
      #image_deskew(threshold = 40) %>% 
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
  }, images)

}

Это только запускает его на первом изображении:

text_list <-  multi_text_clean(multi_images)
(text_multi <- stringr::str_split(text_list, pattern = "\\s{5,}"))

[[1]]
 [1] "Weather clear all day. A small arms inspection held at 1400 hrs. A recce party went\njout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nfor B Coy personnel by our YMCA Supervisor."                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [2] ")\nWeather clear and cold all day. Personnel packed equipment early in the morning and |~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,,\nPW brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division.\nNo other activity during the day. Patrols were sent out during the night by all coys}) u\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chateawv .\n\\Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
 [3] "“y\neather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [4] "f\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\nling but the CO cancelled it. Two Polish deserters from the German army walked into\n|A Coy lines."                                                                                                                                                                                                                                                                                                                          
 [5] "iz\nWeather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\nnew location at 0830 hrs. Unit started to move to new location at 1200 hrs, Unit   Bs\narrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [6] "| 9\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.| |\nQuiet all day. No enemy activity during the day."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
 [7] "|\neather overcast and snowing. Intelligence Section set up another OP at MR 268814.\nNo enemy activity during the day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [8] ":"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [9] "‘\nWeather clear and cold, Bm started to move at 0830 hrs. Bn reached Champlon"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[10] "&\nFamenine, MR 3182 at 1230 hrs. Bn relieved the HLI. Coys immediately took up"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[11] ":\npositions for all around defence."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[12] "4\n"                                                                                                                                                                                                                             

Как я могу выполнить это через каждое изображение в этом магическом объекте?

1 Ответ

1 голос
/ 04 октября 2019

Вы можете сделать следующее в ImageMagick.

Ввод:

enter image description here

convert img.jpg -negate -lat 20x20+10% -negate img_lat.jpg


enter image description here

Или у меня есть сценарий оболочки bash, который использует ImageMagick под названием textcleaner , который будет выполнять следующие действия:

textcleaner -f 20 -o 10 img.jpg img_textcleaner.jpg


enter image description here

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...