Nutch 1.16 пропускает файл: / каталог в стиле ссылки в обход файловой системы - PullRequest
0 голосов
/ 24 марта 2020

Я пытаюсь запустить Nutch в качестве сканера для некоторых локальных каталогов, используя примеры, взятые как из основного учебника (https://cwiki.apache.org/confluence/display/nutch/FAQ#FAQ -HowdoIindexmylocalfilesystem ?), Так и из других источников. Nutch прекрасно может сканировать Интернет без проблем, но по какой-то причине он отказывается сканировать локальные каталоги.

Мои файлы конфигурации выглядят следующим образом:

regex-urlfilter:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# This change is not necessary but may make your life easier.  
# Any file types you do not want to index need to be added to the list otherwise 
# Nutch will often try to parse them and fail in doing so as it doesnt know 
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

nutch-site. xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>I am just testing nutch, please tell me if it's bothering your website</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>

</configuration>

И, наконец, я закомментировал эту часть regex-normalize. xml:

<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<!-- we do not need this with files
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->

Запуск Nutch на Cygwin, Windows 10 в дистрибутиве, построенном с ant в каталоге времени выполнения / локальном, с помощью команды:

bin/crawl -s dirs dircrawl 2 >& dircrawl.log

С dirs папка со следующим файлом seed.txt (я пытался включить разные версии ссылок, так как не похоже, какая версия должна работать, но я мог бы записать это на свой счет, не найдя однозначного ответа =:

/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/

dircrawl - это каталог, в который я хочу сохранить сканирование, и указав количество раундов / макс. Глубину в «2». Через несколько секунд Nutch Crawl выдает следующее: oop. файл журнала txt:

2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: starting at 2020-03-24 14:08:58
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: crawlDb: dircrawl/crawldb
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: urlDir: dirs
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2020-03-24 14:08:58,948 INFO  crawl.Injector - Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
2020-03-24 14:08:59,011 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:08:59,888 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:08:59,890 INFO  mapreduce.Job - Running job: job_local1269520609_0001
2020-03-24 14:09:00,897 WARN  crawl.Injector - Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
2020-03-24 14:09:00,902 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2020-03-24 14:09:00,906 INFO  mapreduce.Job - Job job_local1269520609_0001 running in uber mode : false
2020-03-24 14:09:00,908 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:01,158 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:01,447 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: overwrite: false
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: update: false
2020-03-24 14:09:01,924 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:01,926 INFO  mapreduce.Job - Job job_local1269520609_0001 completed successfully
2020-03-24 14:09:01,951 INFO  mapreduce.Job - Counters: 31
    File System Counters
        FILE: Number of bytes read=1857050
        FILE: Number of bytes written=3067581
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=5
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=6
        Input split bytes=289
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=6
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=13
        Total committed heap usage (bytes)=402653184
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    injector
        urls_filtered=5
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=239
2020-03-24 14:09:02,022 INFO  crawl.Injector - Injector: Total urls rejected by filters: 5
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total new urls injected: 0
2020-03-24 14:09:02,054 INFO  crawl.Injector - Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: starting at 2020-03-24 14:09:08
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: filtering: false
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: normalizing: true
2020-03-24 14:09:08,715 INFO  crawl.Generator - Generator: topN: 50000
2020-03-24 14:09:08,879 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:09:10,418 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:09:10,424 INFO  mapreduce.Job - Running job: job_local828841059_0001
2020-03-24 14:09:11,450 INFO  mapreduce.Job - Job job_local828841059_0001 running in uber mode : false
2020-03-24 14:09:11,453 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:11,784 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2020-03-24 14:09:11,816 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:12,073 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:12,475 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:12,505 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:13,485 INFO  mapreduce.Job - Job job_local828841059_0001 completed successfully
2020-03-24 14:09:13,502 INFO  mapreduce.Job - Counters: 30
    File System Counters
        FILE: Number of bytes read=2784859
        FILE: Number of bytes written=4605489
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=0
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=28
        Input split bytes=156
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=28
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=15
        Total committed heap usage (bytes)=603979776
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=98
    File Output Format Counters 
        Bytes Written=16
2020-03-24 14:09:13,502 INFO  crawl.Generator - Generator: number of items rejected during selection:
2020-03-24 14:09:13,521 WARN  crawl.Generator - Generator: 0 records selected for fetching, exiting ...

В то время как журнал dirclaw.txt дает :

Injecting seed URLs
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch inject dircrawl/crawldb dirs
Injector: starting at 2020-03-24 14:08:58
Injector: crawlDb: dircrawl/crawldb
Injector: urlDir: dirs
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 5
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
24 Mar 2020 14:09:02 : Iteration 1 of 2
Generating a new segment
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true dircrawl/crawldb dircrawl/segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2020-03-24 14:09:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: number of items rejected during selection:
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

Так что теперь я застрял. Я пытался отменить некоторые из моих изменений, но независимо от того, что я делаю, я не могу заставить конфигурацию работать с локальными каталогами. Кто-нибудь знает, что я делаю не так?

1 Ответ

0 голосов
/ 25 марта 2020

Проблемы с сканируемым файлом: URL-адреса и почему значение количества косых черт описано в NUTCH-1483 :

  • эти начальные URL-адреса должны работать:
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
  • это не потому, что в качестве имени хоста используется cygdrive:
file://cygdrive/c/Users/abc/Desktop/anotherdirectory/

Я могу подтвердить, что обход файловых систем работает с использованием Nutch 1.16 на Linux (нет Windows под рукой). Примечания: - urlfilter-validator предназначен для целых rnet URL только потому, что имя хоста должно содержать точку - файл конфигурации urlnormalizer-regex содержит специальное правило для исправления количества слешей после файла: - есть также инструмент " normalizerchecker "- вы также можете попробовать" parsechecker ", чтобы быстро проверить, какая форма файла: URL-адреса определенно работают с учетом вашей конфигурации:

$> bin/nutch parsechecker file://var/www/html/
fetching: file://var/www/html/
Fetch failed with protocol status: notfound(14), lastModified=0

$> bin/nutch parsechecker file:///var/www/html/
fetching: file:///var/www/html/
parsing: file:///var/www/html/
...
Status: success(1,0)
Title: Index of /mnt/data/var_www_html
Outlinks: 2
  outlink: toUrl: file:/mnt/data/ anchor: ../
  outlink: toUrl: file:/mnt/data/var_www_html/index.html anchor: index.html
...
  • , вы также должны проверить все свойства Nutch с префиксом" file ".
...