Как изменить порядок строки данных, используя python? - PullRequest
2 голосов
/ 02 мая 2020

У меня есть фрейм данных, в котором данные расположены в другом столбце, и я хочу взять эти даты из этого столбца, создать столбец дат и сохранить их. Вот мои примеры данных.

df=[['Monday, 13 January 2020','',''],['Task 1',13588,'Jack'],['','','Address 1'],['','','City 1'],['Task 2',13589,'Ammie'],['','','Address 2'],['','','City'],['Task 3',13589,'Amanda'],['','','Address 3'],['','','City 3'],['Tuesday, 14 January 2020','',''],['Task 4',13587,'Chelsea'],['','','Address 4'],['','','City 4'],['Task 5','13586','Ibrahim'],['','','Address 5'],['','','City 5'],['Task 6',13585,'Kate'],['','','Address 6'],['','','City 6']]

df=pd.DataFrame(df)
df.columns = ['Task','ID','Supervisor']
df=df.replace(np.nan,'')
df

    Task    ID  Supervisor
0   Monday, 13 January 2020     
1   Task 1  13588   Jack
2           Address 1
3           City 1
4   Task 2  13589   Ammie
5           Address 2
6           City
7   Task 3  13589   Amanda
8           Address 3
9           City 3
10  Tuesday, 14 January 2020        
11  Task 4  13587   Chelsea
12          Address 4
13          City 4
14  Task 5  13586   Ibrahim
15          Address 5
16          City 5
17  Task 6  13585   Kate
18          Address 6
19          City 6

И я хочу получить следующий вывод.

    Date                    Task    ID      Supervisor
0 Monday, 13 January 2020   Task 1  13588   Jack Address 1 City 1
1 Monday, 13 January 2020   Task 2  13589   Ammie Address 2 City
2 Monday, 13 January 2020   Task 3  13589   Amanda Address 3 City 3
3 Tuesday, 14 January 2020  Task 4  13587   Chelsea Address 4 City 4
4 Tuesday, 14 January 2020  Task 5  13586   Ibrahim Address 5 City 5
5 Tuesday, 14 January 2020  Task 6  13585   Kate Address 6 City 6

Вот моя попытка.

def rowMerger(a,b):
    try:
        rule1 = lambda x: x not in ['']
        u = a.loc[a.iloc[:,0].apply(rule1) & a.iloc[:,1].apply(rule1) & a.iloc[:,2].apply(rule1)].index
        print(u)
        findMergerindexs = list(u)
        findMergerindexs.sort()
        a = pd.DataFrame(a)
        tabcolumns = pd.DataFrame(a.columns)
        totalcolumns = len(tabcolumns)
        b = pd.DataFrame(columns = list(tabcolumns))
        if (len(findMergerindexs) > 0):
            for m in range(len(findMergerindexs))
                if not (m == (len(findMergerindexs)-1)): 
                    startLoop = findMergerindexs[m]
                    endLoop = findMergerindexs[m+1]
                else:
                    startLoop = findMergerindexs[m]
                    endLoop = len(a)
                listValues = []
                for i in range(totalcolumns):
                    value = ' '
                    for n in range(startLoop,endLoop):
                        value = value + ' ' + str(a.iloc[n,i])
                    listValues.insert(i,(value.strip()))
                b = b.append(pd.Series(listValues),ignore_index = True)
        else:
            print("File is not having a row for merging instances - Please check the file manually for instance - ")
        return b
    except: 
        print("Error - While merging the rows")
    return b

И этот код дает мне вывод ниже.

rowMerger(df,0)
       0       1             2
0   Task 1  13588   Jack Address 1 City 1
1   Task 2  13589   Ammie Address 2 City
2   Task 3 Tuesday, 14 January 2020 13589   Amanda Address 3 City 3
3   Task 4  13587   Chelsea Address 4 City 4
4   Task 5  13586   Ibrahim Address 5 City 5
5   Task 6  13585   Kate Address 6 City 6

Но проблема в том, что этот код будет объединять только строки. Не знаете, как продублировать даты в соответствующих строках, как показано в желаемом выводе, и поместить их в другой столбец. Может кто-нибудь помочь мне, как этого добиться?

Ответы [ 3 ]

2 голосов
/ 02 мая 2020

Вы можете попробовать следующее:

task_mask = df.Task.str.match("Task\s+\d")
df.assign(Task = df.Task[task_mask],
          Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
        .reset_index()

вывод

#      Task     ID                      Date                Supervisor
# 0  Task 1  13588   Monday, 13 January 2020     Jack Address 1 City 1
# 1  Task 2  13589   Monday, 13 January 2020      Ammie Address 2 City
# 2  Task 3  13589   Monday, 13 January 2020   Amanda Address 3 City 3
# 3  Task 4  13587  Tuesday, 14 January 2020  Chelsea Address 4 City 4
# 4  Task 5  13586  Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# 5  Task 6  13585  Tuesday, 14 January 2020     Kate Address 6 City 6

Пояснения :

  1. Фильтруйте столбцы Task: dates и task id.

    • Одним из решений является использование регулярного выражения для сопоставления task id. pandas.Series.str.match делает работу. Используемое регулярное выражение довольно просто: "Task\s+\d" означает Task + любой пробел + число.
task_mask = df.Task.str.match("Task\s+\d")

Из этой маски мы можем извлечь Date и Tasks. Задачи легко доступны из task_mask с df.Task[task_mask]

Dates немного сложнее извлечь.

  • Мы используем np.where для установки Task значения или NaN.
  • Затем мы преобразуем array в pd.Series
  • Наконец, мы сдвигаем все значения на 1, используя shift. Сдвиг строк позволяет легко удалить строки NaN на шаге 5.
pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()

Заменить всю пустую строку на NaN, используя replace

Удалить пустые строки (например, старые строки, где есть только Date ) используя dropna с how="all"

Заполните все NaN значения предыдущим значением не NaN, используя ffill

Группировка по "Task", "ID"a and "Date" и агрегирование строк с использованием agg. Функция агрегирования основана на str.join: lambda x: " ".join(x)

Сбросить индекс с groupby с использованием reset_index.

Надеюсь, это понятно!


Код + иллюстрация

# Create dataframe
data = [['Monday, 13 January 2020', '', ''], ['Task 1', 13588, 'Jack'], ['', '', 'Address 1'], ['', '', 'City 1'], ['Task 2', 13589, 'Ammie'], ['', '', 'Address 2'], ['', '', 'City'], ['Task 3', 13589, 'Amanda'], ['', '', 'Address 3'], ['', '', 'City 3'], [
    'Tuesday, 14 January 2020', '', ''], ['Task 4', 13587, 'Chelsea'], ['', '', 'Address 4'], ['', '', 'City 4'], ['Task 5', '13586', 'Ibrahim'], ['', '', 'Address 5'], ['', '', 'City 5'], ['Task 6', 13585, 'Kate'], ['', '', 'Address 6'], ['', '', 'City 6']]
df = pd.DataFrame(data)
df.columns = ['Task', 'ID', 'Supervisor']
print(df)

# Step 1
task_mask = df.Task.str.match("Task\s+\d")
print(task_mask)
# 0     False
# 1      True
# 2     False
# 3     False
# 4      True
# 5     False
# 6     False
# 7      True
# 8     False
# 9     False
# 10    False
# 11     True
# 12    False
# 13    False
# 14     True
# 15    False
# 16    False
# 17     True
# 18    False
# 19    False
# Name: Task, dtype: bool

# Step 2
print(df.Task[task_mask])
# 1     Task 1
# 4     Task 2
# 7     Task 3
# 11    Task 4
# 14    Task 5
# 17    Task 6
# Name: Task, dtype: object

# Step 3
print(pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
# 0                          NaN
# 1      Monday, 13 January 2020
# 2                          NaN
# 3
# 4
# 5                          NaN
# 6
# 7
# 8                          NaN
# 9
# 10
# 11    Tuesday, 14 January 2020
# 12                         NaN
# 13
# 14
# 15                         NaN
# 16
# 17
# 18                         NaN
# 19
# dtype: object

# Step 4
print(df.assign(Task=df.Task[task_mask],
                Date=pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
        .replace("", np.NaN))
#       Task     ID Supervisor                      Date
# 0      NaN    NaN        NaN                       NaN
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2      NaN    NaN  Address 1                       NaN
# 3      NaN    NaN     City 1                       NaN
# 4   Task 2  13589      Ammie                       NaN
# 5      NaN    NaN  Address 2                       NaN
# 6      NaN    NaN       City                       NaN
# 7   Task 3  13589     Amanda                       NaN
# 8      NaN    NaN  Address 3                       NaN
# 9      NaN    NaN     City 3                       NaN
# 10     NaN    NaN        NaN                       NaN
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12     NaN    NaN  Address 4                       NaN
# 13     NaN    NaN     City 4                       NaN
# 14  Task 5  13586    Ibrahim                       NaN
# 15     NaN    NaN  Address 5                       NaN
# 16     NaN    NaN     City 5                       NaN
# 17  Task 6  13585       Kate                       NaN
# 18     NaN    NaN  Address 6                       NaN
# 19     NaN    NaN     City 6                       NaN

# Step 5:
print(df.assign(Task = df.Task[task_mask],
                Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all'))
#       Task     ID Supervisor                      Date
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2      NaN    NaN  Address 1                       NaN
# 3      NaN    NaN     City 1                       NaN
# 4   Task 2  13589      Ammie                       NaN
# 5      NaN    NaN  Address 2                       NaN
# 6      NaN    NaN       City                       NaN
# 7   Task 3  13589     Amanda                       NaN
# 8      NaN    NaN  Address 3                       NaN
# 9      NaN    NaN     City 3                       NaN
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12     NaN    NaN  Address 4                       NaN
# 13     NaN    NaN     City 4                       NaN
# 14  Task 5  13586    Ibrahim                       NaN
# 15     NaN    NaN  Address 5                       NaN
# 16     NaN    NaN     City 5                       NaN
# 17  Task 6  13585       Kate                       NaN
# 18     NaN    NaN  Address 6                       NaN
# 19     NaN    NaN     City 6                       NaN

# Step 6:
print(df.assign(Task = df.Task[task_mask],
                 Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
            .replace("", np.NaN) \
            .dropna(how='all') \
            .ffill())
#       Task     ID Supervisor                      Date
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2   Task 1  13588  Address 1   Monday, 13 January 2020
# 3   Task 1  13588     City 1   Monday, 13 January 2020
# 4   Task 2  13589      Ammie   Monday, 13 January 2020
# 5   Task 2  13589  Address 2   Monday, 13 January 2020
# 6   Task 2  13589       City   Monday, 13 January 2020
# 7   Task 3  13589     Amanda   Monday, 13 January 2020
# 8   Task 3  13589  Address 3   Monday, 13 January 2020
# 9   Task 3  13589     City 3   Monday, 13 January 2020
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12  Task 4  13587  Address 4  Tuesday, 14 January 2020
# 13  Task 4  13587     City 4  Tuesday, 14 January 2020
# 14  Task 5  13586    Ibrahim  Tuesday, 14 January 2020
# 15  Task 5  13586  Address 5  Tuesday, 14 January 2020
# 16  Task 5  13586     City 5  Tuesday, 14 January 2020
# 17  Task 6  13585       Kate  Tuesday, 14 January 2020
# 18  Task 6  13585  Address 6  Tuesday, 14 January 2020
# 19  Task 6  13585     City 6  Tuesday, 14 January 2020

# Step 7
print(df.assign(Task = df.Task[task_mask],
                Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}))
#                                                      Supervisor
# Task   ID    Date
# Task 1 13588 Monday, 13 January 2020      Jack Address 1 City 1
# Task 2 13589 Monday, 13 January 2020       Ammie Address 2 City
# Task 3 13589 Monday, 13 January 2020    Amanda Address 3 City 3
# Task 4 13587 Tuesday, 14 January 2020  Chelsea Address 4 City 4
# Task 5 13586 Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# Task 6 13585 Tuesday, 14 January 2020     Kate Address 6 City 6

# Step 8
df = df.assign(Task = df.Task[task_mask],
               Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
        .reset_index()
print(df)

#      Task     ID                      Date                Supervisor
# 0  Task 1  13588   Monday, 13 January 2020     Jack Address 1 City 1
# 1  Task 2  13589   Monday, 13 January 2020      Ammie Address 2 City
# 2  Task 3  13589   Monday, 13 January 2020   Amanda Address 3 City 3
# 3  Task 4  13587  Tuesday, 14 January 2020  Chelsea Address 4 City 4
# 4  Task 5  13586  Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# 5  Task 6  13585  Tuesday, 14 January 2020     Kate Address 6 City 6
2 голосов
/ 02 мая 2020

@ Ответ Александра великолепен - это альтернатива, где я избегаю извлечения регулярных выражений и сдвига:

#convert empty cells to null
(df.replace("",np.nan)
 #create a new column containing only the dates
 #we'll use the null cells in Supervisor column to pick out the dates
 .assign(Date = lambda x: x.loc[x.Supervisor.isna(),'Task'])
 .ffill()
 .dropna(subset=['ID'])
 #drop Dates found in the Task column
 .query('Task != Date')
 .groupby(['Date','Task','ID'],as_index=False)
 .Supervisor.agg(' '.join)
)

        Date                    Task      ID      Supervisor
0   Monday, 13 January 2020     Task 1  13588   Jack Address 1 City 1
1   Monday, 13 January 2020     Task 2  13589   Ammie Address 2 City
2   Monday, 13 January 2020     Task 3  13589   Amanda Address 3 City 3
3   Tuesday, 14 January 2020    Task 4  13587   Chelsea Address 4 City 4
4   Tuesday, 14 January 2020    Task 5  13586   Ibrahim Address 5 City 5
5   Tuesday, 14 January 2020    Task 6  13585   Kate Address 6 City 6
1 голос
/ 02 мая 2020

Таким образом, по сути мы используем лямбду, чтобы отделить дату от номера задачи, и pd.Series.fillna(method='ffill'), чтобы заполнить последнюю действительную дату.

Итак, мы собираемся добавить эти строки:

# Split Date column
df['Date'] = df.apply(lambda x: " ".join(x[0].split(' ')[2:]) if len(x[0].split(' ')) > 2 else np.nan,axis=1).fillna(method='ffill')

# Clean Task column
df['Task'] = df.apply(lambda x: " ".join(x[0].split(' ')[:2]) if len(x[0].split(' ')) > 1 else x[0],axis=1)

# Rename and reorder remaining columns
df['ID'] = df[1]
df['Supervisor'] = df[2]
df = df[['Date','Task','ID','Supervisor']]
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...