Вот моя попытка:
Пример CSV для меня, чтобы загрузить в:
descr,serial,ref,type,val,qty,uom
Product 1,,12345,type 1,,6,PCS
Product 2,,23456,,,4,PCS
Product 3,,66778 MAKER: MANUFACTURER 1,type 2,,4,PCS
Product 4,,88776 MAKER: MANUFACTURER 2,,,2,
Загрузка данных и создание нового фрейма данных с именем cleaned
, который будет обрабатываться и массироваться для получения желаемого результата.
import pandas as pd
import numpy as np
raw = pd.read_csv("data.csv") # reading in the example file
cleaned = pd.DataFrame() # creating new dataframe
cleaned['ref (int)'] = raw['ref'].str.split(' ').str[0].copy() # creating ref (int) column that is just the first plat of the ref colum
# moving the rest of the data over
cleaned['description'] = raw['descr']
cleaned['ref_maker'] = raw['ref'].str.split(' ').str[1:].apply(' '.join) # making a new column for the rest of ref description if there is a text part after the integer in the ref column
cleaned['type_full'] = raw['type']
cleaned['qty'] = raw['qty']
Теперь у нас есть кадр данных (cleaned
), который выглядит следующим образом:
ref (int) description ref_maker type_full qty
0 12345 Product 1 type 1 6
1 23456 Product 2 NaN 4
2 66778 Product 3 MAKER: MANUFACTURER 1 type 2 4
3 88776 Product 4 MAKER: MANUFACTURER 2 NaN 2
Теперь нам нужно почистить
cleaned.replace('', np.NaN, inplace=True) # replacing empty strings with NaN
cleaned.set_index(['ref (int)', 'qty'], inplace=True) # fixing ref and qty columns for when it stacks (stacking will help make the multi-lined duplicates you wanted)
cleaned = cleaned.stack().to_frame().reset_index() # stacking the dataframe and then converting it back to a dataframe
(для справки), команда .stack()
даст вам это (что почти то, что вы хотите):
ref (int) qty
12345 6 description Product 1
type_full type 1
23456 4 description Product 2
66778 4 description Product 3
ref_maker MAKER: MANUFACTURER 1
type_full type 2
88776 2 description Product 4
ref_maker MAKER: MANUFACTURER 2
Теперь сделаем немного больше уборки:
del cleaned['level_2'] # cleaning up old remnants from the stack (level_2 corresponds to the column names that you dont want in your final output)
cleaned.dropna() # deleting rows that have no values
cleaned.columns = ['ref', 'qty', 'desc'] # renaming the columns for clarity
Теперь это выглядит так:
ref (int) qty desc
0 12345 6 Product 1
1 12345 6 type 1
2 23456 4 Product 2
3 66778 4 Product 3
4 66778 4 MAKER: MANUFACTURER 1
5 66778 4 type 2
6 88776 2 Product 4
7 88776 2 MAKER: MANUFACTURER 2
Последний шаг - заменить дублирующиеся значения пустыми строками, чтобы они соответствовали желаемому результату.
clear_mask = cleaned.duplicated(['ref', 'qty'], keep='first') # looking for rows where the ref and qty values are the same as above, we dont want that to show up so this creates a series of booleans
cleaned.loc[clear_mask, 'qty'] = '' # setting duplicates to empty strings
cleaned.loc[clear_mask, 'ref'] = ''
cols = cleaned.columns.tolist() # rearranging columns so that qty is at the end
cols.append(cols.pop(cols.index('qty')))
cleaned = cleaned[cols]
print(cleaned)
Вот окончательный результат:
ref (int) desc qty
0 12345 Product 1 6
1 type 1
2 23456 Product 2 4
3 66778 Product 3 4
4 MAKER: MANUFACTURER 1
5 type 2
6 88776 Product 4 2
7 MAKER: MANUFACTURER 2