Nutch 1.16 проблема парсекера с файлом: / directory / input - PullRequest
0 голосов
/ 31 марта 2020

Сборка из пропускается файл Nutch 1.16: ссылки в каталоге / в обходе файловой системы , я пытался (и не смог) заставить Nutch сканировать различные каталоги и подкаталоги на Windows 10 установка, вызов команд с Cygwin. Файл dirs / seed.txt, используемый для запуска сканирования, содержит следующее:

file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

Запуск cat ./dirs/seed.txt | ./bin/nutch normalizerchecker -stdin, чтобы проверить, как нормализуется Nutch (по умолчанию regex-normalize. xml) дает

 file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
 file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
 file:/localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

При работе cat ./dirs/seed.txt | ./bin/nutch filterchecker -stdin возвращает:

+file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

То есть все каталоги считаются действительными. Пока все хорошо, но затем, выполнив следующее:

cat ./dirs/seed.txt | ./bin/nutch parsechecker -stdin

выдает одинаковую ошибку для всех трех каталогов, а именно:

Fetch failed with protocol status: notfound(14), lastModified=0

Файлы в журналах также не очень расскажите мне что-нибудь о том, что пошло не так, только то, что он не будет читать входные данные, несмотря ни на что, поскольку журналы содержат только сообщение «выбор каталога X» для каждой записи ...

Итак, что именно происходит Вот? Я также оставлю сайт Nutch. xml, regex-urlfilter.txt и файлы regex-normalize. xml, для полноты.

сайт NUTCH. xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>I am just testing nutch, please tell me if it's bothering your website</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>


<property>
    <name>parser.skip.truncated</name>
    <value>false</value>
    <description>Boolean value for whether we should skip parsing for truncated documents. By default this
        property is activated due to extremely high levels of CPU which parsing can sometimes take.
    </description>
</property>
<!-- the following is just an attempt at using a solution I found elsewhere, didn't work -->
<property>
  <name>http.robot.rules.whitelist</name>
  <value>file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

</configuration>

regex-urlfilter.txt:

# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http: ftp: mailto: and https: urls
-^(http|ftp|mailto|https):

# This change is not necessary but may make your life easier.  
# Any file types you do not want to index need to be added to the list otherwise 
# Nutch will often try to parse them and fail in doing so as it doesnt know 
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
+.*(/[^/]+)/[^/]+\1/[^/]+\1/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

regex-normalize.txt:

<?xml version="1.0"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
     This is intended so that users can specify substitutions to be
     done on URLs using the Java regex syntax, see
     https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
     The rules are applied to URLs in the order they occur in this file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
     expanded to &amp; -->

<!-- The following rules show how to strip out session IDs, default pages, 
     interpage anchors, etc. Order does matter!  -->
<regex-normalize>

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

<!-- changes default pages into standard for /index.html, etc. into /
<regex>
  <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/$3</substitution>
</regex> -->

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>

<!-- cleans ?&amp;var=value into ?var=value -->
<regex>
  <pattern>\?&amp;</pattern>
  <substitution>\?</substitution>
</regex>

<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
  <pattern>&amp;{2,}</pattern>
  <substitution>&amp;</substitution>
</regex>

<!-- removes trailing ? -->
<regex>
  <pattern>[\?&amp;\.]$</pattern>
  <substitution></substitution>
</regex>

<!-- normalize file:/// protocol prefix: -->
<!--  keep one single slash (NUTCH-1483) -->
<regex>
  <pattern>^file://+</pattern>
  <substitution>file:/</substitution>
</regex>

<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->

<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>

</regex-normalize>

Есть идеи, что я здесь не так делаю?

1 Ответ

0 голосов
/ 02 апреля 2020

Файл Nutch: реализация протокола «выбирает» локальные файлы, создавая объект File , используя компонент пути URL: /cygdrive/c/Users/abc/Desktop/anotherdirectory/. Как указано в обсуждении «Существует ли java sdk для cygwin?» , Java не переводит путь, но замена cygdrive/c/ на c:/ должна работать.

...