TypeError в кадре данных Dask при преобразовании в панд с помощью compute () - PullRequest
0 голосов
/ 08 мая 2018

Я не могу понять, в чем проблема в данном коде:

Я использую dask для объединения нескольких фреймов данных. После слияния хочу найти уникальные значения из одного столбца. Я получаю ошибку типа при преобразовании из dask в pandas, используя unique().compute(). Но я не могу понять, в чем на самом деле проблема. Он говорит, что str нельзя назначить как int, но в некоторых файлах код проходит, а в некоторых - нет. Я также не могу найти проблему со структурой данных.

Есть предложения ??

import pandas as pd
import dask.dataframe as dd

# Everything is fine until merging
# I have put several print(markers) to find the problem code

print('dask cols')
print(df_by_dask_merged.columns)
print()
print(dask_cols)
print()

print('find unique contigs values in dask dataframe')
pd_df = df_by_dask_merged['contig']
print(pd_df)
print()
print('mark 02')

# this is the problem code ??
pd_df_contig = pd_df.unique().compute()  
print(pd_df_contig)

print('mark 03')

Выход на клемму:

dask cols
Index(['contig', 'pos', 'ref', 'all-alleles', 'ms01e_PI', 'ms01e_PG_al',
       'ms02g_PI', 'ms02g_PG_al', 'all-freq'],
      dtype='object')

['contig', 'pos', 'ref', 'all-alleles', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'all-freq']


find unique contigs values in dask dataframe
Dask Series Structure:
npartitions=1
int64
  ...
Name: contig, dtype: int64
Dask Name: getitem, 52 tasks

mark 02
Traceback (most recent call last):
  File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2145, in get_value
    return tslib.get_value_box(s, key)
  File "pandas/tslib.pyx", line 880, in pandas.tslib.get_value_box (pandas/tslib.c:17368)
  File "pandas/tslib.pyx", line 889, in pandas.tslib.get_value_box (pandas/tslib.c:17042)
TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "merge_haplotype.py", line 305, in <module>
    main()
  File "merge_haplotype.py", line 152, in main
    pd_df_contig = pd_df.unique().compute()
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/base.py", line 155, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/base.py", line 404, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
 raise_exception(exc, tb)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py", line 3404, in apply_and_enforce
    df = func(*args, **kwargs)
  File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/utils.py", line 687, in __call__
    return getattr(obj, self.method)(*args, **kwargs)
  File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 4133, in apply
    return self._apply_standard(f, axis, reduce=reduce)
  File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 4229, in _apply_standard
    results[i] = func(v)
  File "merge_haplotype.py", line 249, in <lambda>
    apply(lambda row : update_cols(row, sample_name), axis=1, meta=(int))
  File "merge_haplotype.py", line 278, in update_cols
    if 'N|N' in df_by_dask[sample_name + '_PG_al']:
  File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2153, in get_value
    raise e1
  File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2139, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3338)
  File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3041)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('ms02g_PG_al', 'occurred at index 0')
...