Вы можете попробовать следующее:
task_mask = df.Task.str.match("Task\s+\d")
df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill() \
.groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
.reset_index()
вывод
# Task ID Date Supervisor
# 0 Task 1 13588 Monday, 13 January 2020 Jack Address 1 City 1
# 1 Task 2 13589 Monday, 13 January 2020 Ammie Address 2 City
# 2 Task 3 13589 Monday, 13 January 2020 Amanda Address 3 City 3
# 3 Task 4 13587 Tuesday, 14 January 2020 Chelsea Address 4 City 4
# 4 Task 5 13586 Tuesday, 14 January 2020 Ibrahim Address 5 City 5
# 5 Task 6 13585 Tuesday, 14 January 2020 Kate Address 6 City 6
Пояснения :
Фильтруйте столбцы Task
: dates
и task id
.
- Одним из решений является использование регулярного выражения для сопоставления
task id
. pandas.Series.str.match
делает работу. Используемое регулярное выражение довольно просто: "Task\s+\d"
означает Task
+ любой пробел + число.
task_mask = df.Task.str.match("Task\s+\d")
Из этой маски мы можем извлечь Date
и Tasks
. Задачи легко доступны из task_mask
с df.Task[task_mask]
Dates
немного сложнее извлечь.
- Мы используем
np.where
для установки Task
значения или NaN
. - Затем мы преобразуем
array
в pd.Series
- Наконец, мы сдвигаем все значения на 1, используя
shift
. Сдвиг строк позволяет легко удалить строки NaN
на шаге 5.
pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()
Заменить всю пустую строку на NaN
, используя replace
Удалить пустые строки (например, старые строки, где есть только Date
) используя dropna
с how="all"
Заполните все NaN
значения предыдущим значением не NaN
, используя ffill
Группировка по "Task", "ID"a and "Date"
и агрегирование строк с использованием agg
. Функция агрегирования основана на str.join
: lambda x: " ".join(x)
Сбросить индекс с groupby с использованием reset_index
.
Надеюсь, это понятно!
Код + иллюстрация
# Create dataframe
data = [['Monday, 13 January 2020', '', ''], ['Task 1', 13588, 'Jack'], ['', '', 'Address 1'], ['', '', 'City 1'], ['Task 2', 13589, 'Ammie'], ['', '', 'Address 2'], ['', '', 'City'], ['Task 3', 13589, 'Amanda'], ['', '', 'Address 3'], ['', '', 'City 3'], [
'Tuesday, 14 January 2020', '', ''], ['Task 4', 13587, 'Chelsea'], ['', '', 'Address 4'], ['', '', 'City 4'], ['Task 5', '13586', 'Ibrahim'], ['', '', 'Address 5'], ['', '', 'City 5'], ['Task 6', 13585, 'Kate'], ['', '', 'Address 6'], ['', '', 'City 6']]
df = pd.DataFrame(data)
df.columns = ['Task', 'ID', 'Supervisor']
print(df)
# Step 1
task_mask = df.Task.str.match("Task\s+\d")
print(task_mask)
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
# 7 True
# 8 False
# 9 False
# 10 False
# 11 True
# 12 False
# 13 False
# 14 True
# 15 False
# 16 False
# 17 True
# 18 False
# 19 False
# Name: Task, dtype: bool
# Step 2
print(df.Task[task_mask])
# 1 Task 1
# 4 Task 2
# 7 Task 3
# 11 Task 4
# 14 Task 5
# 17 Task 6
# Name: Task, dtype: object
# Step 3
print(pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
# 0 NaN
# 1 Monday, 13 January 2020
# 2 NaN
# 3
# 4
# 5 NaN
# 6
# 7
# 8 NaN
# 9
# 10
# 11 Tuesday, 14 January 2020
# 12 NaN
# 13
# 14
# 15 NaN
# 16
# 17
# 18 NaN
# 19
# dtype: object
# Step 4
print(df.assign(Task=df.Task[task_mask],
Date=pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
.replace("", np.NaN))
# Task ID Supervisor Date
# 0 NaN NaN NaN NaN
# 1 Task 1 13588 Jack Monday, 13 January 2020
# 2 NaN NaN Address 1 NaN
# 3 NaN NaN City 1 NaN
# 4 Task 2 13589 Ammie NaN
# 5 NaN NaN Address 2 NaN
# 6 NaN NaN City NaN
# 7 Task 3 13589 Amanda NaN
# 8 NaN NaN Address 3 NaN
# 9 NaN NaN City 3 NaN
# 10 NaN NaN NaN NaN
# 11 Task 4 13587 Chelsea Tuesday, 14 January 2020
# 12 NaN NaN Address 4 NaN
# 13 NaN NaN City 4 NaN
# 14 Task 5 13586 Ibrahim NaN
# 15 NaN NaN Address 5 NaN
# 16 NaN NaN City 5 NaN
# 17 Task 6 13585 Kate NaN
# 18 NaN NaN Address 6 NaN
# 19 NaN NaN City 6 NaN
# Step 5:
print(df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all'))
# Task ID Supervisor Date
# 1 Task 1 13588 Jack Monday, 13 January 2020
# 2 NaN NaN Address 1 NaN
# 3 NaN NaN City 1 NaN
# 4 Task 2 13589 Ammie NaN
# 5 NaN NaN Address 2 NaN
# 6 NaN NaN City NaN
# 7 Task 3 13589 Amanda NaN
# 8 NaN NaN Address 3 NaN
# 9 NaN NaN City 3 NaN
# 11 Task 4 13587 Chelsea Tuesday, 14 January 2020
# 12 NaN NaN Address 4 NaN
# 13 NaN NaN City 4 NaN
# 14 Task 5 13586 Ibrahim NaN
# 15 NaN NaN Address 5 NaN
# 16 NaN NaN City 5 NaN
# 17 Task 6 13585 Kate NaN
# 18 NaN NaN Address 6 NaN
# 19 NaN NaN City 6 NaN
# Step 6:
print(df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill())
# Task ID Supervisor Date
# 1 Task 1 13588 Jack Monday, 13 January 2020
# 2 Task 1 13588 Address 1 Monday, 13 January 2020
# 3 Task 1 13588 City 1 Monday, 13 January 2020
# 4 Task 2 13589 Ammie Monday, 13 January 2020
# 5 Task 2 13589 Address 2 Monday, 13 January 2020
# 6 Task 2 13589 City Monday, 13 January 2020
# 7 Task 3 13589 Amanda Monday, 13 January 2020
# 8 Task 3 13589 Address 3 Monday, 13 January 2020
# 9 Task 3 13589 City 3 Monday, 13 January 2020
# 11 Task 4 13587 Chelsea Tuesday, 14 January 2020
# 12 Task 4 13587 Address 4 Tuesday, 14 January 2020
# 13 Task 4 13587 City 4 Tuesday, 14 January 2020
# 14 Task 5 13586 Ibrahim Tuesday, 14 January 2020
# 15 Task 5 13586 Address 5 Tuesday, 14 January 2020
# 16 Task 5 13586 City 5 Tuesday, 14 January 2020
# 17 Task 6 13585 Kate Tuesday, 14 January 2020
# 18 Task 6 13585 Address 6 Tuesday, 14 January 2020
# 19 Task 6 13585 City 6 Tuesday, 14 January 2020
# Step 7
print(df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill() \
.groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}))
# Supervisor
# Task ID Date
# Task 1 13588 Monday, 13 January 2020 Jack Address 1 City 1
# Task 2 13589 Monday, 13 January 2020 Ammie Address 2 City
# Task 3 13589 Monday, 13 January 2020 Amanda Address 3 City 3
# Task 4 13587 Tuesday, 14 January 2020 Chelsea Address 4 City 4
# Task 5 13586 Tuesday, 14 January 2020 Ibrahim Address 5 City 5
# Task 6 13585 Tuesday, 14 January 2020 Kate Address 6 City 6
# Step 8
df = df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill() \
.groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
.reset_index()
print(df)
# Task ID Date Supervisor
# 0 Task 1 13588 Monday, 13 January 2020 Jack Address 1 City 1
# 1 Task 2 13589 Monday, 13 January 2020 Ammie Address 2 City
# 2 Task 3 13589 Monday, 13 January 2020 Amanda Address 3 City 3
# 3 Task 4 13587 Tuesday, 14 January 2020 Chelsea Address 4 City 4
# 4 Task 5 13586 Tuesday, 14 January 2020 Ibrahim Address 5 City 5
# 5 Task 6 13585 Tuesday, 14 January 2020 Kate Address 6 City 6