Question

У меня очень медленная часть моего рабочего процесса, которая, возможно, могла бы использовать некоторую оптимизацию, поскольку мне потребовались дни, чтобы запустить ее.У меня есть словарь под названием «databaseHash», где KEY: в секундах с начала эпохи (временная точка) и VALUE: датафрейм, представляющий собой множество столбцов на одну строку (содержащий данные объекта для этой временной точки).Я пытаюсь собрать фрейм данных для анализа временных рядов, поэтому в основном я беру несколько временных точек из этого словаря, объединяю их по столбцам, переименовываю имена столбцов, чтобы они были уникальными и последовательными, затем беру эти строки и объединяюих по строкам, чтобы сделать окончательный кадр данных.

def getExampleRow(inputs):

    global databaseHash 
    #KEY: (int) timepoint in seconds since epoch
    #VALUE: (dataFrame) N Columns x 1 Row

    concatThese = [] #To hold all the dataFrames for a single row

    count = 0 #To create unique, sequential names for each set of columns from each timepoint.

    #Go through the range of desired timepoints
    for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
        concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
        #For each timepoint, append the dataframe and add a suffix to each column name.
        count += 1

    #Target timepoints are the data points in the future that the previously appended rows are intended to predict. 
    targetCount = 0
    #For each target timepoint (predicting multiple future timepoints)
    for targetTimepoint in inputs[2]:
        #Do the same concatenating as the previous loop
        concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
        #Change the column names for the target rows
    targetCount += 1

return pd.concat(concatThese, axis=1) #concat and return the single row dataframe



parallelInputs = []
#Generate all the appropriate time points, here for reference, probably not important to optimize. 
while max(targetTimepoints) < largestTimepoint:

    parallelInputs.append((startingTrainingTimepoint, endingTrainingTimepoint, targetTimepoints, secondsSpacing, numBufferPoints))
    #Create the list of inputs for multiprocessing

    ####UPDATE THE VARIABLES####
    offset += secondsSpacing
    startingTrainingTimepoint = earliestTimepoint + offset
    endingTrainingTimepoint = startingTrainingTimepoint+numTrainingPoints*secondsSpacing
    targetTimepoints = [endingTrainingTimepoint + x*secondsSpacing for x in numBufferPoints]
    ############################

df = None
#Run in parallel and calculate all example rows
    results = Parallel(n_jobs=50, verbose=7)(delayed(getExampleRow)(i) for i in parallelInputs)
df = pd.concat(results, axis=0)
results = None
gc.collect()

Я запустил профилировщик для функции getExampleRow (), вот результаты:

Timer unit: 1e-06 s

Total time: 0.994672 s
File: <ipython-input-5-631cda7694d5>
Function: getExampleRow at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def getExampleRow(inputs):
 2                                               #onle input index 2 is an array, the taret Timepoint
 3                                               global databaseHash 
 4         1          4.0      4.0      0.0      concatThese = []
 5         1          2.0      2.0      0.0      count = 0
 6       541        820.0      1.5      0.1      for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
 7       540     824867.0   1527.5     82.9          concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
 8       540       1525.0      2.8      0.2          count += 1
 9                                                   
10                                               #Add all of the the target timepoints at the end
11         1          2.0      2.0      0.0      targetCount = 0
12        16         29.0      1.8      0.0      for targetTimepoint in inputs[2]: 
13        15      22245.0   1483.0      2.2          concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14        15         40.0      2.7      0.0          targetCount += 1
15                                                                      
16         1     145138.0 145138.0     14.6      return pd.concat(concatThese, axis=1)

Когда я избавляюсь от вызова _suffix (), он падает до минимумапроцент от общего времени.

Timer unit: 1e-06 s
Total time: 0.160778 s
File: <ipython-input-11-573f87244998>
Function: getExampleRow at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def getExampleRow(inputs):
 2                                               #onle input index 2 is an array, the taret Timepoint
 3                                               global databaseHash 
 4         1          3.0      3.0      0.0      concatThese = []
 5         1          2.0      2.0      0.0      count = 0
 6       541        520.0      1.0      0.3      for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
 7       540       1005.0      1.9      0.6          concatThese.append(databaseHash[currentTimepoint])
 8       540        522.0      1.0      0.3          count += 1
 9                                                   
10                                               #Add all of the the target timepoints at the end
11         1          1.0      1.0      0.0      targetCount = 0
12        16         24.0      1.5      0.0      for targetTimepoint in inputs[2]: 
13        15      16415.0   1094.3     10.2          concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14        15         36.0      2.4      0.0          targetCount += 1
15                                                                      
16         1     142250.0 142250.0     88.5      return pd.concat(concatThese, axis=1)

Существует ли быстрый способ сделать имена моих столбцов уникальными для каждого набора столбцов времени?

Самый быстрый способ сделать имена столбцов уникальными при создании dataFrame для анализа временных рядов?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Самый быстрый способ сделать имена столбцов уникальными при создании dataFrame для анализа временных рядов?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Нет похожих вопросов