Я запускаю эту программу на Java в Google Colab
https://github.com/allenai/science-parse
Это код, который я использую
# Get cli
!wget https://github.com/allenai/science-parse/releases/download/v2.0.3/science-parse-cli-assembly-2.0.3.jar
# install wget
!pip install wget
# Install Java
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!java -version
currentOut = 'outputFile'
currentIn = 'inputFile
!java -Xmx6600M -jar science-parse-cli-assembly-2.0.3.jar {currentIn} -o {currentOut}
Это командная строка Science ParseИнтерфейс, описанный здесь
https://github.com/allenai/science-parse/blob/master/cli/README.md
В нем говорится:
RunSP can parse multiple files at the same time. You can parse thousands of PDFs like this. It will try to parse as many of them in parallel as your computer allows.
В настоящее время в Colab, как в режиме GPU, так и в режиме CPU, похоже, что он работает с 3 рабочими.Просто глядя на линейный вывод
01:00:22.397 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.RunSP$ - Finished 10183.pdf
01:00:22.397 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.RunSP$ - Starting 11270.pdf
01:00:22.603 [ForkJoinPool-1-worker-3] INFO org.allenai.scienceparse.RunSP$ - Finished 11596.pdf
01:00:22.603 [ForkJoinPool-1-worker-3] INFO org.allenai.scienceparse.RunSP$ - Starting 13086.pdf
01:00:22.706 [main] INFO org.allenai.scienceparse.RunSP$ - Finished 12954.pdf
01:00:22.706 [main] INFO org.allenai.scienceparse.RunSP$ - Starting 13581.pdf
01:00:23.872 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.RunSP$ - Finished 11270.pdf
01:00:23.877 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.RunSP$ - Starting 12734.pdf
01:00:24.183 [main] WARN org.allenai.scienceparse.Parser - Exception Page 5 is an image and allow OCR is turned off while getting sections. Section data will be missing.
01:00:24.190 [main] INFO org.allenai.scienceparse.RunSP$ - Finished 13581.pdf
01:00:24.190 [main] INFO org.allenai.scienceparse.RunSP$ - Starting 11083.pdf
01:00:24.460 [ForkJoinPool-1-worker-3] INFO org.allenai.scienceparse.RunSP$ - Finished 13086.pdf
01:00:24.460 [ForkJoinPool-1-worker-3] INFO org.allenai.scienceparse.RunSP$ - Starting 12247.pdf
01:00:25.723 [ForkJoinPool-1-worker-1] WARN org.allenai.scienceparse.Parser - Exception Page 4 is an im
Интересно, есть ли способ запустить это через графический процессор Colab, чтобы еще больше рабочих работали параллельно?