Question

У меня есть кадр данных следующего формата.
df

A   B  Target
5   4   3
1   3   4

Я нахожу корреляцию каждого столбца (кроме цели) со столбцом цели, используя pd.DataFrame(df.corr().iloc[:-1,-1]).
Но проблема в том, что размер моего фактического фрейма данных составляет (216, 72391), что по крайней мере занимает 30 минут для обработки в моей системе. Есть ли способ распараллелить его с помощью графического процессора? Мне нужно найти значения одного и того же вида несколько раз, поэтому я не могу ждать обычного времени обработки 30 минут каждый раз.

Mohammed Kashif · Answer 1 · 08 марта 2019

Здесь я попытался реализовать вашу операцию, используя numba

import numpy as np
import pandas as pd
from numba import jit, int64, float64

# 
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
  df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)

df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------

# Assume df is your original DataFrame
target_array = df['target'].values

# You can choose to restore this column later
# But for now we will remove it, since we will 
# call the df.values and find correlation of each 
# column with target
df.drop(['target'], inplace=True, axis=1)

# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,) 
def do_stuff(df_values, target_arr):
  # Just create a random array to store result
  # df_values.shape[0] = 72391, equal to no. of columns in df
  result = np.random.random(df_values.shape[0])

  # Iterator over each column
  for i in range(0, df_values.shape[0]):

    # Find correlation of a column with target column
    # In order to find correlation we must transpose array to make them compatible
    result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
  return result

# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)

# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)

Ссылка на блокнот colab .

Nethale · Answer 2 · 08 марта 2019

Вы должны взглянуть на dask .Он должен иметь возможность делать то, что вы хотите, и многое другое.Он распараллеливает большинство функций DataFrame.

Подход параллельного программирования для решения проблем панд

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Подход параллельного программирования для решения проблем панд

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы