Apache Nutch не работает для URL https - PullRequest
0 голосов
/ 10 июня 2019

У меня Apache Nutch, настроенный на Hbase и Solr, и он отлично работает для http URL.У меня есть требование для сканирования https URL.После некоторого гугла я обнаружил, что мне нужно включить protocol-httpclient.Я обновил свои nutch-default.xml и nutch-site.xml с protocol-httpclient.После запуска bin/nutch fetch -all - результат

FetcherJob: starting at 2019-06-10 09:44:14
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching https://some-url.gov (queue crawl delay=5000ms)
fetching http://some-url.gov (queue crawl delay=5000ms)
-finishing thread FetcherThread2, activeThreads=2
-finishing thread FetcherThread3, activeThreads=2
-finishing thread FetcherThread4, activeThreads=2
-finishing thread FetcherThread5, activeThreads=2
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread8, activeThreads=2
Fetcher: throughput threshold: -1
-finishing thread FetcherThread9, activeThreads=2
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0 pages/s, 185 185 kb/s, 0 URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
Fetcher: throughput threshold: -1
-finishing thread FetcherThread9, activeThreads=0
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2019-06-10 09:44:26, time elapsed: 00:00:12

Пожалуйста, руководство.Спасибо за помощь

...