Квантильное объединение с метками в Dask DataFrame - PullRequest
0 голосов
/ 05 апреля 2019

Я использую Dask для манипулирования набором данных.Я хочу связать эти наборы данных без уникальных значений, основанных на соответствующем им квантиле, а затем пометить каждый.

В пандах это довольно просто:

tags = range(4, 0, -1)
groups = pd.qcut(df.column_name.rank(method='first'), q = 4, labels = tags)
df['ranks'] = groups.values

Но я понятия не имею, как это сделать.это в Dask, использовать ли map или map_partitions.

Я перечитывал документацию снова и снова безрезультатно, я также нашел похожий вопрос , но в ответе отсутствует объяснение.нужно.

Мой начальный код

data['tot_top_amt'].map_partitions(pd.qcut,4, duplicates='drop')

Но я получаю сообщение об ошибке:

Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/utils.py", line 162, in raise_on_meta_error
    yield
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/core.py", line 3740, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/reshape/tile.py", line 306, in qcut
    dtype=dtype, duplicates=duplicates)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/reshape/tile.py", line 350, in _bins_to_cuts
    dtype=dtype)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/reshape/tile.py", line 457, in _format_labels
    v = adjust(labels[0].left)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/indexes/interval.py", line 1303, in __getitem__
    mask = self._isnan[value]
IndexError: index 0 is out of bounds for axis 0 with size 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dask_eda.py", line 7, in <module>
    a = data['tot_top_amt'].map_partitions(pd.qcut,4, duplicates='drop')
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/core.py", line 568, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/core.py", line 3779, in map_partitions
    meta = _emulate(func, *args, udf=True, **kwargs2)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/core.py", line 3740, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/utils.py", line 179, in raise_on_meta_error
    raise ValueError(msg)
ValueError: Metadata inference failed in `qcut`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
IndexError('index 0 is out of bounds for axis 0 with size 0',)

Traceback:
---------
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/utils.py", line 162, in raise_on_meta_error
    yield
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/dataframe/core.py", line 3740, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/reshape/tile.py", line 306, in qcut
    dtype=dtype, duplicates=duplicates)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/reshape/tile.py", line 350, in _bins_to_cuts
    dtype=dtype)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/reshape/tile.py", line 457, in _format_labels
    v = adjust(labels[0].left)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/pandas/core/indexes/interval.py", line 1303, in __getitem__
    mask = self._isnan[value]
...