Как получить вторичный элемент списка из ThreadPoolExecutor при отправке запросов? - PullRequest
2 голосов
/ 20 января 2020

Используя документацию python в ThreadPoolExecutor, есть такая функция запроса:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

И если список URL-адресов был настроен следующим образом:

URLS = [['http://www.foxnews.com/','American'],
        ['http://www.cnn.com/','American'],
        ['http://europe.wsj.com/', 'European'],
        ['http://www.bbc.co.uk/', 'Eurpoean']
        ['http://some-made-up-domain.com/','Unknown']]

Вы можете легко извлечь URL путем индексации списка:

future_to_url = {executor.submit(load_url, url, 60): url[0] for url in URLS}

С чем я борюсь, так это как я go должен извлечь регион из этого списка (индекс 1), который будет включен в результат as_completed, чтобы печать была чем-то как:

print('%r %r page is %d bytes' % (region, url, len(data))

1 Ответ

3 голосов
/ 23 января 2020

Вы можете преобразовать список URLS в словарь (url_region_mapper), который отображает URL-адрес с его регионом, чтобы вы знали, в каком регионе он основан на данном URL.

import concurrent.futures
import urllib.request

URLS = [['http://www.foxnews.com/','American'],
        ['http://www.cnn.com/','American'],
        ['http://europe.wsj.com/', 'European'],
        ['http://www.bbc.co.uk/', 'Eurpoean'],
        ['http://some-made-up-domain.com/','Unknown']]

url_region_mapper = dict(URLS)

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url[0], 60): url[0] for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r %r page is %d bytes' % (url_region_mapper[url], url, len(data)))

В случае дублирования URL-адреса, который отображается на разные регионы, вместо слов URL-адреса в строку future_to_url можно включить URL-адрес и регион в виде списка.

future_to_url = {executor.submit(load_url, url[0], 60): [url[0], url[1]] for url in URLS}`)
import concurrent.futures
import urllib.request

URLS = [['http://www.foxnews.com/','American'],
        ['http://www.cnn.com/','American'],
        ['http://europe.wsj.com/', 'European'],
        ['http://www.bbc.co.uk/', 'Eurpoean'],
        ['http://some-made-up-domain.com/','Unknown']]

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url[0], 60): [url[0], url[1]] for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future][0]
        region = future_to_url[future][1]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r %r page is %d bytes' % (region, url, len(data)))
...