Индексатор поиска Google Cloud "Indexer: java.io.IOException: Задание не выполнено!" - PullRequest
0 голосов
/ 02 декабря 2018

Я молодой разработчик, я относительно новичок в продуктах Google Cloud Platform, и в частности в Google Cloud Search.Я пытался следовать также https://developers.google.com/cloud-search/docs/guides/apache-nutch-connector учебник.

То, что я сделал, это просто воспроизвел учебник по изменению файла nutch-site.xml следующим образом

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|more|metadata)|indexer-google-cloud-search|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>



<property>
  <name>gcs.config.file</name>
  <value>/home/joys/Downloads/apache-nutch-1.14/sdk-configuration.properties</value>
  <description>Location of GCS Connector SDK configuration file.</description>
</property>

<property>
  <name>gcs.uploadFormat</name>
  <value>text</value>
  <description></description>
</property>

<property>
  <name>fetcher.parse</name>
  <value>true</value>
  <description></description>
</property>

<property>
  <name>http.agent.name</name>
  <value>Joy Spider</value>
  <description></description>
</property>



<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>


<property>
  <name>fetcher.store.content</name>
  <value>true</value>
  <description>If true, fetcher will store content.</description>
</property>


<property>
  <name>metatags.names</name>
  <value>metatag.*</value>
  <description>Location of GCS Connector SDK configuration file.</description>
</property>

<property>
  <name>index.parse.md</name>
  <value>metatag.*</value>
  <description>Location of GCS Connector SDK configuration file.</description>
</property>


<property>
  <name>index.metadata</name>
  <value>metatag.*</value>
  <description>Location of GCS Connector SDK configuration file.</description>
</property>


<property>
  <name>http.robot.rules.whitelist</name>
  <value>*</value>
  <description>Location of GCS Connector SDK configuration file.</description>
</property>

</configuration>

sdk-configuration.properties как этот

Required properties for accessing data source
# (These values are created by the admin before running the connector)
api.sourceId=id

# Path to service account credentials
api.serviceAccountPrivateKeyFile=/path/to/.json

#connector.runOnce=true

defaultAcl.mode=FALLBACK
defaultAcl.public=true

api.rootUrl=https://cloudsearch.googleapis.com

# The schema name is read from the data source and used for repository structured data.The default is an empty string. 
structuredData.localSchema=schema.json

#The metadata attribute that contains the value corresponding to the document title. The default value is an empty string.
itemMetadata.title.field=title



#The metadata attribute that contains the value for the document URL for search results. 
itemMetadata.sourceRepositoryUrl.field=url


#The content language for documents being indexed
itemMetadata.contentLanguage.field=languageCode

#The object type used by the site, as defined in the data source schema object definitions. The connector won't index any structured data if this property is not specified.

Note: This configuration property points to a value rather than a metadata attribute, and the .field and .defaultValue sufffixes are not supported. 
itemMetadata.objectType=file

#The metadata attribute that contains the value for the last modification timestamp for the document. 
itemMetadata.updatetime.field=updateAt

#The metadata attribute that contains the value for the document creation timestamp. 
itemMetadata.createTime.field=updateAt

contentTemplate.templateName.title=filetitle

Кроме того, я не добавил опцию -addBinaryContent -base64 в crawl.sh.В учебнике сказано, что эти параметры следует указывать только в том случае, если параметр gcs.uploadFormat отсутствует или имеет значение «raw».И я установил его в «текст»

И все идет хорошо, пока GCS не начнет индексирование, в этот момент я всегда получаю эту ошибку:

2

018-12-01 21:39:53,368 INFO  gcs.GoogleCloudSearchIndexWriter - Starting up!
2018-12-01 21:40:01,002 WARN  mapred.LocalJobRunner - job_local1304604211_0001
java.lang.Exception: java.lang.NullPointerException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.NullPointerException
    at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.getValueExtractor(StructuredData.java:375)
    at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.lambda$new$3(StructuredData.java:294)
    at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
    at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.<init>(StructuredData.java:294)
    at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.lambda$init$1(StructuredData.java:234)
    at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
    at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.init(StructuredData.java:231)
    at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.initFromConfiguration(StructuredData.java:199)
    at org.apache.nutch.indexwriter.gcs.GoogleCloudSearchIndexWriter.open(GoogleCloudSearchIndexWriter.java:104)
    at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:77)
    at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2018-12-01 21:40:01,414 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Этовыдает ошибку в этой строке файла crawl.sh

apache-nutch-1.14/bin/nutch index crawl-test//crawldb -linkdb crawl-test//linkdb crawl-test//segments/20181201213917
Failed with exit value 255.

с указателем команды.Я заканчиваю идеи, и у меня нет больше понятия о том, как я могу это исправить.

Занимаясь серфингом в сети, я обнаружил, что в папке hadoop я должен найти файл mapred-site.xml, и мне нужно положить

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

и поместить в hadoop-yarn-site.xml

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>map reduce_shuffle</value>
</property>

но у меня это не работает.Есть идеи?

...