Попытка начать работу с Jest REST API для Elasticsearch 5x - PullRequest
0 голосов
/ 27 августа 2018

Попытка настроить Nutch 1.14 для использования плагина indexer -astic-rest-rest для использования с ES 5.3 (или любой версией, которую я могу заставить его работать). Я абсолютно новичок в этом - очень привык использовать старую команду bin / nutch crawl, которая обрабатывает все за 1 шаг (так приятно:)).

Это мои настройки Nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>file.content.limit</name>
    <value>65536</value><!--65536 is default, 90000 or more for prod-->
    <description>The length limit for downloaded content using the file://
    protocol, in bytes. If this value is nonnegative (>=0), content longer
    than it will be truncated; otherwise, no truncation at all. Do not
    confuse this setting with the http.content.limit setting.
    </description>
  </property>

  <!-- HTTP properties -->
  <property>
    <name>http.agent.name</name>
    <value>Nutch Spider</value>
    <description>HTTP 'User-Agent' request header. MUST NOT be empty -
    please set this to a single word uniquely related to your organization.
    </description>
  </property>

  <!--web db properties-->
  <property>
    <name>http.agent.version</name>
    <value>0.0.0</value>
    <description>A version string to advertise in the User-Agent
     header.</description>
  </property>

  <property>
    <name>db.ignore.internal.links</name>
    <value>true</value><!--true-->
    <description>If true, when adding new links to a page, links from
    the same host are ignored.  This is an effective way to limit the
    size of the link database, keeping only the highest quality
    links.
    </description>
  </property>

  <property>
    <name>db.max.inlinks</name>
    <value>10000</value><!--10000-->
    <description>Maximum number of Inlinks per URL to be kept in LinkDb.
    If "invertlinks" finds more inlinks than this number, only the first
    N inlinks will be stored, and the rest will be discarded.
    </description>
  </property>


  <!-- Applicable plugins-->
  <property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-elastic|indexer-elastic-rest|urlnormalizer-(pass|regex|basic)</value>
    <description> At the very least, I needed to add the parse-html, urlfilter-regex, and the indexer-elastic.
    </description>
  </property>

  <!--https://wiki.apache.org/nutch/IndexMetatags - next 3 properties are taken from above wiki url-->
  <!-- Used only if plugin parse-metatags is enabled. -->
  <property>
  <name>metatags.names</name>
  <value>description,keywords</value>
  <description> Names of the metatags to extract, separated by ','.
    Use '*' to extract all metatags. Prefixes the names with 'metatag.'
    in the parse-metadata. For instance to index description and keywords,
    you need to activate the plugin index-metadata and set the value of the
    parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
  </description>
  </property>

  <property>
    <name>index.parse.md</name>
    <value>metatag.description,metatag.keywords</value>
    <description>
    Comma-separated list of keys to be taken from the parse metadata to generate fields.
    Can be used e.g. for 'description' or 'keywords' provided that these values are generated
    by a parser (see parse-metatags plugin)
    </description>
  </property>

  <property>
    <name>index.metadata</name>
    <value>description,keywords</value>
    <description>
    Comma-separated list of keys to be taken from the metadata to generate fields.
    Can be used e.g. for 'description' or 'keywords' provided that these values are generated
    by a parser (see parse-metatags plugin), and property 'metatags.names'.
    </description>
  </property>

И мои свойства Elasticesearch Rest (заменил обычные свойства Elasticicearch).

<!--elasticsearch rest properties-->
<property>
    <name>elastic.rest.host</name>
    <value>localhost</value>
    <description>The hostname to send documents to using Elasticsearch Jest. Both host
        and port must be defined</description>
</property>

<property>
    <name>elastic.rest.port</name>
    <value>9200</value>
    <description>The port to connect to using Elasticsearch Jest.</description>
</property>

<property>
    <name>elastic.rest.index</name>
    <value>search-index</value>
    <description>Default index to send documents to.</description>
</property>

<property>
    <name>elastic.rest.index.languages</name>
    <value></value>
    <description>
        A list of strings denoting the supported languages (e.g. `en,de,fr,it`).
        If this value is empty all documents will be sent to index ${elastic.rest.index}.
        If not empty the Rest client will distribute documents in different indices based on their `lang` property.
        Indices are named with the following schema: ${elastic.rest.index}${elastic.rest.index.separator}${lang} (e.g. `nutch_de`).
        Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink} (e.g. `nutch_others`).
    </description>
</property>

<property>
    <name>elastic.rest.index.separator</name>
    <value>_</value>
    <description>
        Default value is `_`. Is used only if `elastic.rest.index.languages` is defined to build the index name (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${lang}). 
    </description>
</property>

<property>
    <name>elastic.rest.index.sink</name>
    <value>others</value>
    <description>
        Default value is `others`. Is used only if `elastic.rest.index.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink}).
    </description>
</property>

<property>
    <name>elastic.rest.type</name>
    <value>doc</value>
    <description>Default type to send documents to.</description>
</property>

<property>
    <name>elastic.rest.max.bulk.docs</name>
    <value>250</value>
    <description>Maximum size of the bulk in number of documents.</description>
</property>

<property>
    <name>elastic.rest.max.bulk.size</name>
    <value>26214400</value>
    <description>Maximum size of the bulk in bytes.</description>
</property>

<property>
    <name>elastic.rest.https</name>
    <value>false</value>
    <description>
        "true" to enable https, "false" to disable https
        If you've disabled http access (by forcing https), be sure to
        set this to true, otherwise you might get "connection reset by peer".
    </description>
</property>

<property>
    <name>elastic.rest.user</name>
    <value></value>
    <description>Username for auth credentials (only used when https is enabled)</description>
</property>

<property>
    <name>elastic.rest.password</name>
    <value></value>
    <description>Password for auth credentials (only used when https is enabled)</description>
</property>

<property>
    <name>elastic.rest.trustallhostnames</name>
    <value>false</value>
    <description>
        "true" to trust elasticsearch server's certificate even if its listed domain name does not
        match the domain they are hosted on
        "false" to check if the elasticsearch server's certificate's listed domain is the same domain
        that it is hosted on, and if it doesn't, then fail to index
        (only used when https is enabled)
    </description>
</property>
</configuration>

С этими настройками я получаю эту ошибку, когда Nutch пытается передать данные сканирования в ES для индексации:

ElasticIndexWriter
    elastic.cluster : elastic prefix cluster
    elastic.host : hostname
    elastic.port : port
    elastic.index : elastic index command 
    elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
    elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
    elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
    elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
    elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)

ElasticRestIndexWriter
    elastic.rest.host : hostname
    elastic.rest.port : port
    elastic.rest.index : elastic index command 
    elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250) 
    elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Я заменил обычные свойства Elasticsearch, которые я обычно использую, на TransportClient.

Я пытался сделать bin / nutch startserver , и мой Терминал не дает никакого вывода - я не совсем уверен, как заставить сервер Nutch начать использовать REST API с 1x?

Я запускаю его в «локальном режиме» (на ноутбуке).

Должен ли я использовать Почтальон вместо Терминала для использования API REST? Похоже, что Nutch 1x поставляется с 2-й версией Jest REST API, я думаю, что я успешно обновил его и его зависимости - я вижу новые jar-файлы в / runtime / local / plugins / indexer -astic-rest-rest

...