Исключение во время выполнения Java в Solr при индексации основных данных - PullRequest
0 голосов
/ 22 марта 2019

Я использую Nutch 2.3.1, Solr 6.5.1 и Mongodb для сканирования и индексации данных. Я успешно просканировал максимум 5 URL-адресов в файле seed.text, но при попытке сканировать 499 URL-адресов при индексации возникла следующая ошибка.

> $ runtime/local/bin/nutch solrindex http://localhost:8983/solr/nutch -all
    IndexingJob: starting
    SolrIndexerJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local505251134_0001
            at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
            at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
            at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
            at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
            at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Мой лог-файл орехов выглядит следующим образом

> 2019-03-22 16:45:07,991 INFO  indexer.IndexingJob - IndexingJob: starting
2019-03-22 16:45:08,203 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2019-03-22 16:45:08,203 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2019-03-22 16:45:08,204 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2019-03-22 16:45:08,204 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2019-03-22 16:45:08,208 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer
2019-03-22 16:45:08,358 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2019-03-22 16:45:09,206 WARN  conf.Configuration - file:/tmp/hadoop-USER/mapred/staging/USER505251134/.staging/job_local505251134_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2019-03-22 16:45:09,208 WARN  conf.Configuration - file:/tmp/hadoop-USER/mapred/staging/USER505251134/.staging/job_local505251134_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2019-03-22 16:45:09,264 WARN  conf.Configuration - file:/tmp/hadoop-USER/mapred/local/localRunner/USER/job_local505251134_0001/job_local505251134_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2019-03-22 16:45:09,265 WARN  conf.Configuration - file:/tmp/hadoop-USER/mapred/local/localRunner/USER/job_local505251134_0001/job_local505251134_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2019-03-22 16:45:09,390 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: content dest: content
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: title dest: title
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: host dest: host
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: batchId dest: batchId
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: boost dest: boost
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: digest dest: digest
2019-03-22 16:45:09,408 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2019-03-22 16:45:09,410 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2019-03-22 16:45:09,411 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2019-03-22 16:45:09,411 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2019-03-22 16:45:09,411 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2019-03-22 16:45:09,411 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer
2019-03-22 16:45:09,411 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2019-03-22 16:45:09,625 INFO  solr.SolrIndexWriter - Adding 250 documents
2019-03-22 16:45:09,934 INFO  solr.SolrIndexWriter - Adding 250 documents
2019-03-22 16:45:10,317 INFO  solr.SolrIndexWriter - Adding 129 documents
2019-03-22 16:45:10,395 INFO  solr.SolrIndexWriter - Adding 129 documents
2019-03-22 16:45:10,466 WARN  mapred.LocalJobRunner - job_local505251134_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=jp.or.nhk.www:http] multiple values encountered for non multiValued field meta_description: [NHK??????????????????????????????????????????????????NHK???????????????????, Japanese public broadcaster's official website with online news, profile, and press releases.]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=jp.or.nhk.www:http] multiple values encountered for non multiValued field meta_description: [NHK??????????????????????????????????????????????????NHK???????????????????, Japanese public broadcaster's official website with online news, profile, and press releases.]
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
    at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2019-03-22 16:45:11,278 ERROR indexer.IndexingJob - SolrIndexerJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local505251134_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Я попытался перезапустить базу данных в соответствии с this . Но не удалось устранить ошибку.

...