Как использовать воздушный поток DataFlowPythonOperator для балки трубопровода? - PullRequest
6 голосов
/ 06 марта 2020

Перед использованием DataFlowPythonOperator я использовал BashOperator от airflow. Работало нормально. Мой конвейер лучей требовал определенного аргумента, вот команда, которую я использовал в BashOperator.

Только для информации - Этот конвейер лучей предназначен для преобразования файла CSV в паркет.

python /home/airflow/gcs/pyFile.py --runner DataflowRunner --project my-project --jobname my-job--num-workers 3 --temp_location gs://path/Temp/ --staging_location gs://path/Staging/ --input gs://path/*.txt --odir gs://path/output --ofile current

Это обязательный аргумент, который я должен передать, чтобы заставить мой конвейер лучей работать должным образом.

Теперь, как мне передать эти параметры в DataFlowPythonOperator ?

Я пытался, но я не понимаю, где именно я должен упомянуть все параметры. Примерно так я и пробовал:

    task1 = DataFlowPythonOperator(
    task_id = 'my_task',
    py_file = '/home/airflow/gcs/pyfile.py',
    gcp_conn_id='google_cloud_default',
    options={
        "num-workers" : 3,
        "input" : 'gs://path/*.txt',
        "odir" : 'gs://path/',
        "ofile" : 'current',
        "jobname" : 'my-job'
    },
    dataflow_default_options={
        "project": 'my-project',
        "staging_location": 'gs://path/Staging/',
        "temp_location": 'gs://path/Temp/',    
  },
  dag=dag
)

С текущим скриптом (хотя я не уверен, в правильном ли он формате или нет) вот что я получаю в логах:

    [2020-03-06 05:08:48,070] {base_task_runner.py:115} INFO - Job 810: Subtask my_task [2020-03-06 05:08:48,070] {cli.py:545} INFO - Running <TaskInstance: test-df-po.my_task 2020-02-29T00:00:00+00:00 [running]> on host airflow-worker-69b88ff66d-5wwrn
[2020-03-06 05:08:48,245] {taskinstance.py:1059} ERROR - 'int' object has no attribute '__len__'
Traceback (most recent call last)
  File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_tas
    result = task_copy.execute(context=context
  File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 381, in execut
    self.py_file, self.py_options
  File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 240, in start_python_dataflo
    label_formatter
  File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 368, in wrappe
    return func(self, *args, **kwargs
  File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 197, in _start_dataflo
    cmd = command_prefix + self._build_cmd(variables, label_formatter
  File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 266, in _build_cm
    elif value is None or value.__len__() < 1
AttributeError: 'int' object has no attribute '__len__
[2020-03-06 05:08:48,247] {base_task_runner.py:115} INFO - Job 810: Subtask my_task [2020-03-06 05:08:48,245] {taskinstance.py:1059} ERROR - 'int' object has no attribute '__len__'
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task Traceback (most recent call last):
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     result = task_copy.execute(context=context)
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 381, in execute
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     self.py_file, self.py_options)
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 240, in start_python_dataflow
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     label_formatter)
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 368, in wrapper
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     return func(self, *args, **kwargs)
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 197, in _start_dataflow
[2020-03-06 05:08:48,250] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     cmd = command_prefix + self._build_cmd(variables, label_formatter)
[2020-03-06 05:08:48,250] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 266, in _build_cmd
[2020-03-06 05:08:48,251] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     elif value is None or value.__len__() < 1:
[2020-03-06 05:08:48,251] {taskinstance.py:1082} INFO - Marking task as UP_FOR_RETRY
[2020-03-06 05:08:48,253] {base_task_runner.py:115} INFO - Job 810: Subtask my_task AttributeError: 'int' object has no attribute '__len__'
[2020-03-06 05:08:48,254] {base_task_runner.py:115} INFO - Job 810: Subtask my_task [2020-03-06 05:08:48,251] {taskinstance.py:1082} INFO - Marking task as UP_FOR_RETRY
[2020-03-06 05:08:48,331] {base_task_runner.py:115} INFO - Job 810: Subtask my_task Traceback (most recent call last):
[2020-03-06 05:08:48,332] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/bin/airflow", line 7, in <module>
[2020-03-06 05:08:48,334] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     exec(compile(f.read(), __file__, 'exec'))
[2020-03-06 05:08:48,334] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/bin/airflow", line 37, in <module>
[2020-03-06 05:08:48,334] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     args.func(args)
[2020-03-06 05:08:48,335] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/utils/cli.py", line 74, in wrapper
[2020-03-06 05:08:48,336] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     return f(*args, **kwargs)
[2020-03-06 05:08:48,336] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/bin/cli.py", line 551, in run
[2020-03-06 05:08:48,337] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     _run(args, dag, ti)
[2020-03-06 05:08:48,338] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/bin/cli.py", line 469, in _run
[2020-03-06 05:08:48,338] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     pool=args.pool,
[2020-03-06 05:08:48,339] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/utils/db.py", line 74, in wrapper
[2020-03-06 05:08:48,340] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     return func(*args, **kwargs)
[2020-03-06 05:08:48,341] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
[2020-03-06 05:08:48,342] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     result = task_copy.execute(context=context)
[2020-03-06 05:08:48,342] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 381, in execute
[2020-03-06 05:08:48,343] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     self.py_file, self.py_options)
[2020-03-06 05:08:48,343] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 240, in start_python_dataflow
[2020-03-06 05:08:48,344] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     label_formatter)
[2020-03-06 05:08:48,345] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 368, in wrapper
[2020-03-06 05:08:48,345] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     return func(self, *args, **kwargs)
[2020-03-06 05:08:48,346] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 197, in _start_dataflow
[2020-03-06 05:08:48,347] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     cmd = command_prefix + self._build_cmd(variables, label_formatter)
[2020-03-06 05:08:48,349] {base_task_runner.py:115} INFO - Job 810: Subtask my_task   File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 266, in _build_cmd
[2020-03-06 05:08:48,350] {base_task_runner.py:115} INFO - Job 810: Subtask my_task     elif value is None or value.__len__() < 1:
[2020-03-06 05:08:48,350] {base_task_runner.py:115} INFO - Job 810: Subtask my_task AttributeError: 'int' object has no attribute '__len__'
[2020-03-06 05:08:48,638] {helpers.py:308} INFO - Sending Signals.SIGTERM to GPID 8481
[2020-03-06 05:08:48,697] {helpers.py:286} INFO - Process psutil.Process(pid=8481, status='terminated') (8481) terminated with exit code -15

dataflow_operator документы здесь

1 Ответ

3 голосов
/ 06 марта 2020

В gcp_dataflow_hook.py, _build_cmd () проверяет options и создает команды. И исключение было выдано в elif value is None or value.__len__() < 1:, потому что значение num-workers, 3, является целым числом. Поэтому вам просто нужно изменить 3 на «3» в виде строки:

options={
    "num-workers" : '3',
    "input" : 'gs://path/*.txt',
    "odir" : 'gs://path/',
    "ofile" : 'current'
},

DataFlowHook._build_cmd ():

@staticmethod
def _build_cmd(variables, label_formatter):
    command = ["--runner=DataflowRunner"]
    if variables is not None:
        for attr, value in variables.items():
            if attr == 'labels':
                command += label_formatter(value)
            elif value is None or value.__len__() < 1:
                command.append("--" + attr)
            else:
                command.append("--" + attr + "=" + value)
    return command
...