Я только что установил Nutch 1.6 на Fedora 30. Я go, выполнив шаги по начальной загрузке начального списка (inject), сгенерирован список выборки, анализ, обновление БД и обратных ссылок ... Перед индексированием я обновил index-writers. xml:
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<writers xmlns="http://lucene.apache.org/nutch"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd">
<writer id="indexer_csv_1" class="org.apache.nutch.indexwriter.csv.CSVIndexWriter">
<parameters>
<param name="fields" value="id,title,content"/>
<param name="charset" value="UTF-8"/>
<param name="separator" value=","/>
<param name="valuesep" value="|"/>
<param name="quotechar" value="""/>
<param name="escapechar" value="""/>
<param name="maxfieldlength" value="4096"/>
<param name="maxfieldvalues" value="12"/>
<param name="header" value="true"/>
<param name="outpath" value="csvindexwriter"/>
</parameters>
<mapping>
<copy />
<rename />
<remove />
</mapping>
</writer>
</writers>
затем я запускаю:
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/2020* -filter -normalize -deleteGone
Ниже приведена ошибка, с которой я сталкиваюсь, но я не уверен, почему:
2020-01-31 12:03:09,385 INFO crawl.LinkDb - LinkDb: finished at 2020-01-31 12:03:09, elapsed: 00:00:04
2020-01-31 12:04:24,945 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-31 12:04:25,260 INFO segment.SegmentChecker - Segment dir is complete: crawl/segments/20200127084916.
2020-01-31 12:04:25,264 INFO segment.SegmentChecker - Segment dir is complete: crawl/segments/20200127093759.
2020-01-31 12:04:25,268 INFO segment.SegmentChecker - Segment dir is complete: crawl/segments/20200130115418.
2020-01-31 12:04:25,271 INFO segment.SegmentChecker - Segment dir is complete: crawl/segments/20200131101723.
2020-01-31 12:04:25,273 INFO indexer.IndexingJob - Indexer: starting at 2020-01-31 12:04:25
2020-01-31 12:04:25,282 INFO indexer.IndexingJob - Indexer: deleting gone documents: true
2020-01-31 12:04:25,282 INFO indexer.IndexingJob - Indexer: URL filtering: true
2020-01-31 12:04:25,283 INFO indexer.IndexingJob - Indexer: URL normalizing: true
2020-01-31 12:04:25,283 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2020-01-31 12:04:25,283 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2020-01-31 12:04:25,284 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200127084916
2020-01-31 12:04:25,286 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200127093759
2020-01-31 12:04:25,288 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200130115418
2020-01-31 12:04:25,290 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200131101723
2020-01-31 12:04:26,115 INFO mapreduce.Job - The url to track the job: http://localhost:8080/
2020-01-31 12:04:26,116 INFO mapreduce.Job - Running job: job_local1773068951_0001
2020-01-31 12:04:27,120 INFO mapreduce.Job - Job job_local1773068951_0001 running in uber mode : false
2020-01-31 12:04:27,122 INFO mapreduce.Job - map 0% reduce 0%
2020-01-31 12:04:34,127 INFO mapreduce.Job - map 100% reduce 0%
2020-01-31 12:04:45,868 INFO indexer.IndexWriters - Index writer org.apache.nutch.indexwriter.solr.SolrIndexWriter identified.
2020-01-31 12:04:45,965 WARN exchange.Exchanges - No exchange was configured. The documents will be routed to all index writers.
2020-01-31 12:04:46,272 INFO indexer.IndexerOutputFormat - Active IndexWriters :
SolrIndexWriter:
┌────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────┐
│type │Specifies the SolrClient implementation to use. This is a string value of one of the following "cloud" or│http │
│ │"http". The values represent CloudSolrServer or HttpSolrServer respectively. │ │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│url │Defines the fully qualified URL of Solr into which data should be indexed. Multiple URL can be provided│http://localhost:8983/solr/nutch│
│ │using comma as a delimiter. When the value of type property is cloud, the URL should not include any│ │
│ │collections or cores; just the root Solr path. │ │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│collection │The collection used in requests. Only used when the value of type property is cloud. │ │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│commitSize │Defines the number of documents to send to Solr in a single update batch. Decrease when handling very│100 │
│ │large documents to prevent Nutch from running out of memory. Note: It does not explicitly trigger a server│ │
│ │side commit. │ │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│weight.field│Field's name where the weight of the documents will be written. If it is empty no field will be used. │ │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│auth │Whether to enable HTTP basic authentication for communicating with Solr. Use the username and password│false │
│ │properties to configure your credentials. │ │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│username │The username of Solr server. │username │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│password │The password of Solr server. │password │
└────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────┘
2020-01-31 12:04:46,448 INFO solr.SolrIndexWriter - Indexing 72/72 documents
2020-01-31 12:04:46,449 INFO solr.SolrIndexWriter - Deleting 0 documents
2020-01-31 12:04:46,490 INFO solr.SolrIndexWriter - Indexing 72/72 documents
2020-01-31 12:04:46,490 INFO solr.SolrIndexWriter - Deleting 0 documents
2020-01-31 12:04:46,528 WARN mapred.LocalJobRunner - job_local1773068951_0001
java.lang.Exception: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/nutch
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:558)
Caused by: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/nutch
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:282)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:250)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:214)
at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:264)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/nutch
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:650)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:247)
... 12 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
... 16 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:606)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 26 more
2020-01-31 12:04:47,133 INFO mapreduce.Job - Job job_local1773068951_0001 failed with state FAILED due to: NA
2020-01-31 12:04:47,167 INFO mapreduce.Job - Counters: 30
File System Counters
FILE: Number of bytes read=2027841168
FILE: Number of bytes written=3564196112
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=711822
Map output records=711822
Map output bytes=224057287
Map output materialized bytes=225563661
Input split bytes=3175
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=225563661
Reduce input records=0
Reduce output records=0
Spilled Records=711822
Shuffled Maps =19
Failed Shuffles=0
Merged Map outputs=19
GC time elapsed (ms)=667
Total committed heap usage (bytes)=16629366784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=124418962
File Output Format Counters
Bytes Written=0
2020-01-31 12:04:47,167 ERROR indexer.IndexingJob - Indexing job did not succeed, job status:FAILED, reason: NA
2020-01-31 12:04:47,168 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Indexing job did not succeed, job status:FAILED, reason: NA
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:231)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:240)
Есть идеи, почему пишущий индекс CSV не работает?
С уважением,