Проблемы при попытке создать новый столбец идентификатора на основе трех критериев? - PullRequest
0 голосов
/ 11 ноября 2018

У меня есть дата-фрейм с такими разговорами и временными метками:

timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...
2018-10-05 23:07:13 01  a large text blob...
2018-10-05 23:07:23 01  a large text blob...
2018-10-05 23:07:36 01  a large text blob...
2018-10-05 23:08:02 01  a large text blob...
2018-10-05 23:09:16 01  a large text blob...
2018-10-05 23:09:21 01  a large text blob...
2018-10-05 23:09:39 01  a large text blob...
2018-10-05 23:09:47 01  a large text blob...
2018-10-05 23:10:01 01  a large text blob...
2018-10-05 23:10:11 01  a large text blob...
2018-10-05 23:10:23 01  restart             
2018-10-05 23:10:59 01  a large text blob...
2018-10-05 23:11:03 01  a large text blob...
2018-10-08 23:11:32 02  a large text blob...
2018-10-08 23:12:58 02  a large text blob...
2018-10-08 23:13:16 02  a large text blob...
2018-10-08 23:14:04 02  a large text blob...
2018-10-08 03:38:36 02  a large text blob...
2018-10-08 03:38:42 02  a large text blob...
2018-10-08 03:38:52 02  a large text blob...
2018-10-08 03:38:57 02  a large text blob...
2018-10-08 03:39:10 02  a large text blob...
2018-10-08 03:39:27 02  Restart             
2018-10-08 03:40:47 02  a large text blob...
2018-10-08 03:40:54 02  a large text blob...
2018-10-08 03:41:02 02  a large text blob...
2018-10-08 03:41:12 02  a large text blob...
2018-10-08 03:41:32 02  a large text blob...
2018-10-08 03:41:39 02  a large text blob...
2018-10-08 03:42:20 02  a large text blob...
2018-10-08 03:44:58 02  a large text blob...
2018-10-08 03:45:54 02  a large text blob...
2018-10-08 03:46:06 02  a large text blob...
2018-10-08 05:06:42 03  a large text blob...
2018-10-08 05:06:53 03  a large text blob...
2018-10-08 05:08:49 03  a large text blob...
2018-10-08 05:08:58 03  a large text blob...
2018-10-08 05:58:18 04  a large text blob...
2018-10-08 05:58:26 04  a large text blob...
2018-10-08 05:58:37 04  a large text blob...
2018-10-08 05:58:58 04  a large text blob...
2018-10-08 06:00:31 04  a large text blob...
2018-10-08 06:01:00 04  a large text blob...
2018-10-08 06:01:14 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:06:03 04  a large text blob...
2018-10-08 06:10:00 04  a large text blob...
2018-10-08 09:07:03 04  a large text blob...
2018-10-08 09:09:03 04  a large text blob...
2018-10-09 10:01:00 04  a large text blob...
2018-10-09 10:02:00 04  a large text blob...
2018-10-09 10:03:00 04  a large text blob...
2018-10-09 10:09:00 04  a large text blob...
2018-10-09 10:09:00 05  a large text blob...

В данный момент я бы хотел идентифицировать с id разговоры внутри фрейма данных. Проблема в том, что пользователь может иметь несколько разговоров (то есть userID может иметь несколько textBlob связанных). Таким образом, я хотел бы добавить new_id, чтобы иметь возможность идентифицировать разговоры внутри вышеупомянутого кадра данных.

Для этого я хотел бы создать столбец new_id на основе трех критериев:

  1. 10-минутные периоды
  2. вхождение ключевого слова
  3. когда у пользователя нет больше текстовых объектов

Ожидаемый результат выглядит следующим образом (*):

timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...    001
2018-10-05 23:07:13 01  a large text blob...    001
2018-10-05 23:07:23 01  a large text blob...    001
2018-10-05 23:07:36 01  a large text blob...    001
2018-10-05 23:08:02 01  a large text blob...    001
2018-10-05 23:09:16 01  a large text blob...    001
2018-10-05 23:09:21 01  a large text blob...    001
2018-10-05 23:09:39 01  a large text blob...    001
2018-10-05 23:09:47 01  a large text blob...    001
2018-10-05 23:10:01 01  a large text blob...    001
2018-10-05 23:10:11 01  a large text blob...    001
2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01  a large text blob...    002
2018-10-05 23:11:03 01  a large text blob...    002
2018-10-08 23:11:32 02  a large text blob...    002
2018-10-08 23:12:58 02  a large text blob...    002
2018-10-08 23:13:16 02  a large text blob...    002
2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02  a large text blob...    003
2018-10-08 03:38:42 02  a large text blob...    003
2018-10-08 03:38:52 02  a large text blob...    003
2018-10-08 03:38:57 02  a large text blob...    003
2018-10-08 03:39:10 02  a large text blob...    003
2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02  a large text blob...    004
2018-10-08 03:40:54 02  a large text blob...    004
2018-10-08 03:41:02 02  a large text blob...    004
2018-10-08 03:41:12 02  a large text blob...    004
2018-10-08 03:41:32 02  a large text blob...    004
2018-10-08 03:41:39 02  a large text blob...    004
2018-10-08 03:42:20 02  a large text blob...    004
2018-10-08 03:44:58 02  a large text blob...    004
2018-10-08 03:45:54 02  a large text blob...    004
2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03  a large text blob...    005
2018-10-08 05:06:53 03  a large text blob...    005
2018-10-08 05:08:49 03  a large text blob...    005
2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04  a large text blob...    006
2018-10-08 05:58:26 04  a large text blob...    006
2018-10-08 05:58:37 04  a large text blob...    006
2018-10-08 05:58:58 04  a large text blob...    006
2018-10-08 06:00:31 04  a large text blob...    006
2018-10-08 06:01:00 04  a large text blob...    006
2018-10-08 06:01:14 04  a large text blob...    006
2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04  a large text blob...    007
2018-10-08 06:06:03 04  a large text blob...    007
2018-10-08 06:10:00 04  a large text blob...    007
2018-10-08 09:07:03 04  a large text blob...    007
2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04  a large text blob...    008
2018-10-09 10:02:00 04  a large text blob...    008
2018-10-09 10:03:00 04  a large text blob...    008
2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05  a large text blob...    010

Пока я пытался:

searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

И

dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

Однако мне также нужно учитывать идентификатор пользователя, и я получаю несколько столбцов. Есть ли способ выполнить три условия и получить ожидаемый результат (*)?

Ответы [ 2 ]

0 голосов
/ 11 ноября 2018

Хорошо, я думал, что 10-минутный период должен отсчитываться с начала разговора, а не с непосредственного сообщения ниже, в этом случае вам нужно будет перебирать строки как:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
    timedelta = timestamp - start_time

    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
        current_id += 1
        start_time = timestamp

    new_id_prev = new_id    
    df.new_id.iloc[i] = current_id
0 голосов
/ 11 ноября 2018

Ты большую часть пути туда. Чтобы сложить все вместе, создайте логическую маску для каждого условия, затем преобразуйте маски в int и возьмите их суммарную сумму:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp  userID              textBlob  new_id
2018-10-05 23:07:02       1  a_large_text_blob...       1
2018-10-05 23:07:13       1  a_large_text_blob...       1
2018-10-05 23:07:23       1  a_large_text_blob...       1
2018-10-05 23:07:36       1  a_large_text_blob...       1
2018-10-05 23:08:02       1  a_large_text_blob...       1
2018-10-05 23:09:16       1  a_large_text_blob...       1
2018-10-05 23:09:21       1  a_large_text_blob...       1
2018-10-05 23:09:39       1  a_large_text_blob...       1
2018-10-05 23:09:47       1  a_large_text_blob...       1
2018-10-05 23:10:01       1  a_large_text_blob...       1
2018-10-05 23:10:11       1  a_large_text_blob...       1
2018-10-05 23:10:23       1               restart       1
2018-10-05 23:10:59       1  a_large_text_blob...       2
2018-10-05 23:11:03       1  a_large_text_blob...       2
2018-10-08 03:11:32       2  a_large_text_blob...       3
2018-10-08 03:12:58       2  a_large_text_blob...       3
2018-10-08 03:13:16       2  a_large_text_blob...       3
2018-10-08 03:14:04       2  a_large_text_blob...       3
2018-10-08 03:38:36       2  a_large_text_blob...       4
2018-10-08 03:38:42       2  a_large_text_blob...       4
2018-10-08 03:38:52       2  a_large_text_blob...       4
2018-10-08 03:38:57       2  a_large_text_blob...       4
2018-10-08 03:39:10       2  a_large_text_blob...       4
2018-10-08 03:39:27       2               Restart       4
2018-10-08 03:40:47       2  a_large_text_blob...       5
2018-10-08 03:40:54       2  a_large_text_blob...       5
2018-10-08 03:41:02       2  a_large_text_blob...       5
2018-10-08 03:41:12       2  a_large_text_blob...       5
2018-10-08 03:41:32       2  a_large_text_blob...       5
2018-10-08 03:41:39       2  a_large_text_blob...       5
2018-10-08 03:42:20       2  a_large_text_blob...       5
2018-10-08 03:44:58       2  a_large_text_blob...       5
2018-10-08 03:45:54       2  a_large_text_blob...       5
2018-10-08 03:46:06       2  a_large_text_blob...       5
2018-10-08 05:06:42       3  a_large_text_blob...       6
2018-10-08 05:06:53       3  a_large_text_blob...       6
2018-10-08 05:08:49       3  a_large_text_blob...       6
2018-10-08 05:08:58       3  a_large_text_blob...       6
2018-10-08 05:58:18       4  a_large_text_blob...       7
2018-10-08 05:58:26       4  a_large_text_blob...       7
2018-10-08 05:58:37       4  a_large_text_blob...       7
2018-10-08 05:58:58       4  a_large_text_blob...       7
2018-10-08 06:00:31       4  a_large_text_blob...       7
2018-10-08 06:01:00       4  a_large_text_blob...       7
2018-10-08 06:01:14       4  a_large_text_blob...       7
2018-10-08 06:02:03       4  a_large_text_blob...       7
2018-10-08 06:02:03       4  a_large_text_blob...       7
2018-10-08 06:06:03       4  a_large_text_blob...       7
2018-10-08 06:10:00       4  a_large_text_blob...       7
2018-10-08 09:07:03       4  a_large_text_blob...       8
2018-10-08 09:09:03       4  a_large_text_blob...       8
2018-10-09 10:01:00       4  a_large_text_blob...       9
2018-10-09 10:02:00       4  a_large_text_blob...       9
2018-10-09 10:03:00       4  a_large_text_blob...       9
2018-10-09 10:09:00       4  a_large_text_blob...       9
2018-10-09 10:09:00       5  a_large_text_blob...      10
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...