Я борюсь с этой проблемой уже несколько часов и не могу понять ее. Я был бы очень признателен за любой ввод, который мог бы помочь.
Фон
Я пытаюсь автоматизировать манипуляции с данными для своей исследовательской лаборатории в школе с помощью python. В результате эксперимента будет создан файл .csv
, содержащий 41 строку данных без заголовка, как показано ниже.
Sometimes, multiple runs of the same experiment exist and that will produce .csv
files with the same header, and taking an average of them is needed for accuracy. Something like this with the same number of rows and headers:
So far I was able to filter the basenames to only contain the .csv
files of the same parameters and have them added to a data frame. However, my issue is that I don't know how to continue to get an average.
My Current Code and output
Code:
import pandas as pd
import os
dir = "/Users/luke/Desktop/testfolder"
files = os.listdir(dir)
files_of_interests = {}
for filename in files:
if filename[-4:] == '.csv':
key = filename[:-5]
files_of_interests.setdefault(key, [])
files_of_interests[key].append(filename)
print(files_of_interests)
for key in files_of_interests:
stack_df = pd.DataFrame()
print(stack_df)
for filename in files_of_interests[key]:
stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
print(stack_df)
Output:
Empty DataFrame
Columns: []
Index: []
Unnamed: 0 Wavelength S2c Wavelength.1 S2
0 0 1100 0.000342 1100 0.000304
1 1 1110 0.000452 1110 0.000410
2 2 1120 0.000468 1120 0.000430
3 3 1130 0.000330 1130 0.000306
4 4 1140 0.000345 1140 0.000323
.. ... ... ... ... ...
36 36 1460 0.002120 1460 0.001773
37 37 1470 0.002065 1470 0.001693
38 38 1480 0.002514 1480 0.002019
39 39 1490 0.002505 1490 0.001967
40 40 1500 0.002461 1500 0.001891
[164 rows x 5 columns]
Question Here!
So my question is, how do I get it to append towards the right individually for each S2c
and S2
?
Explanation:
With multiple .csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv
file which led to the [164 rows x 5 columns]
from the previous section. My original idea is to create a new data frame and only appending S2c
and S2
from each of those .csv
files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right. Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key])
under the second FOR loop).
What I have tried
I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10)
using pd.DataFrame.from_records()
. And append S2c
and S2
to the data frame as I have described from the previous section. The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.
I have read up on multiple other questions posted here, many suggested using pd.concat
but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.
Thank you in advance for your help!
Additional Info
I am using macOS and ATOM for the code.
The csv files can be found here!
github: https://github.com/teoyi/PROJECT-Automate-Research-Process
Опробование метода @zabop
Код:
dflist = []
for key in files_of_interests:
for filename in files_of_interests[key]:
dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')
Вывод:
Trying @SergeBallesta method
Code:
df = pd.concat([pd.read_csv(os.path.join(dir, filename))
for key in files_of_interests for filename in files_of_interests[key]])
df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)
Output:
введите описание изображения здесь