Как разобрать изображения с помощью tika-app - *. Jar? - PullRequest
0 голосов
/ 19 сентября 2018

Я скачал tika-app-1.18.jar, jdk-8u181-linux-i586.tar.gz, затем извлек jdk-8u181-linux-i586.tar.gz и экспортировал путь bin в переменную окружения PATH.Когда я пытаюсь разобрать текст на изображении, он показывает метаданные данного файла с некоторыми предупреждениями, но без текста на данном изображении.Когда я передаю текстовый файл вместо изображения, он работает нормально.Требует ли tika-app-1.18.jar другие файлы .jar для анализа изображений?

Вывод некоторых команд приведен ниже:

#java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) Server VM (build 25.181-b13, mixed mode)

#java -jar tika-app-1.18.jar --version
Sep 19, 2018 12:19:09 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Sep 19, 2018 12:19:10 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Sep 19, 2018 12:19:10 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Apache Tika 1.18

#java -jar tika-app-1.18.jar images/nitin.png -t
Sep 19, 2018 12:19:45 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Sep 19, 2018 12:19:46 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Sep 19, 2018 12:19:46 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Transparency Alpha" content="none"/>
<meta name="tiff:ImageLength" content="460"/>
<meta name="Compression CompressionTypeName" content="deflate"/>
<meta name="Data BitsPerSample" content="8 8 8"/>
<meta name="Data PlanarConfiguration" content="PixelInterleaved"/>
<meta name="Dimension VerticalPixelSize" content="0.26462027"/>
<meta name="IHDR" content="width=819, height=460, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none"/>
<meta name="Chroma ColorSpaceType" content="RGB"/>
<meta name="Content-Length" content="6837"/>
<meta name="tiff:BitsPerSample" content="8 8 8"/>
<meta name="Content-Type" content="image/png"/>
<meta name="height" content="460"/>
<meta name="gAMA" content="45455"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.image.ImageParser"/>
<meta name="pHYs" content="pixelsPerUnitXAxis=3779, pixelsPerUnitYAxis=3779, unitSpecifier=meter"/>
<meta name="Chroma Gamma" content="0.45455"/>
<meta name="Dimension PixelAspectRatio" content="1.0"/>
<meta name="resourceName" content="nitin.png"/>
<meta name="sRGB" content="Perceptual"/>
<meta name="Compression NumProgressiveScans" content="1"/>
<meta name="Dimension HorizontalPixelSize" content="0.26462027"/>
<meta name="Chroma BlackIsZero" content="true"/>
<meta name="Compression Lossless" content="true"/>
<meta name="width" content="819"/>
<meta name="Dimension ImageOrientation" content="Normal"/>
<meta name="tiff:ImageWidth" content="819"/>
<meta name="Chroma NumChannels" content="3"/>
<meta name="Data SampleFormat" content="UnsignedIntegral"/>
<title/>
</head>
<body/></html>
...