Если бы у меня были следующие данные:
+---------------+---------------------+---------------------+----------+--------------+
| email | date_opened | order_date | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234 | xyz |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | 2019-01-03 09:21:43 | 1234 | xyz |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ... | ... | ... | ... | ... |
+---------------+---------------------+---------------------+----------+--------------+
Как мне сохранить один order_date
, order_id
, product_name
против самого раннего или минимального date_opened
за email
и заменить все остальные дубликаты order_date
, order_id
и product_names
на NaN
s
Код:
import pandas as pd
import numpy as np
import psycopg2
import pyodbc
dwh_conn = psycopg2.connect(...)
dm_query = ...
dm = pd.read_sql(dm_query, dwh_conn, parse_dates='date_opened', index_col='email')
dfdev_conn = pyodbc.connect(...)
bkgs_query = ...
bkgs = pd.read_sql(bkgs_query, dfdev_conn, parse_dates='order_date', index_col='email')
dm_bkgs = pd.merge(dm, bkgs, how='left', left_index=True, right_index=True)
dm_bkgs['diff_days'] = dm_bkgs['date_opened'] - dm_bkgs['order_date']
dm_bkgs['diff_days'] = dm_bkgs['diff_days']/np.timedelta64(1,'D')
dm_bkgs.index.name = 'email'
dm_bkgs.sort_values(by=['email','diff_days'], inplace=True)
dm_bkgs['order_date'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['product_name'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['diff_days'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['booking_id'])
dm_bkgs['order_id'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
Мой код несколько работает, но я заметил, что в фрейме данных dm
имеется 1433 строки, а после merge
или объединения число строк увеличивается до 1448. Не уверен, почему это как bkgs
фрейм данных один не имеет дубликатов ...
Также чувствую, что код немного неопрятный ...
Ожидать:
+---------------+---------------------+---------------------+----------+--------------+
| email | date_opened | order_date | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234 | xyz |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ... | ... | ... | ... | ... |
+---------------+---------------------+---------------------+----------+--------------+