Использование:
#data to DataFrame
url = 'https://raw.githubusercontent.com/fivethirtyeight/checking-our-work-data/master/us_house_elections.csv'
election_sub = pd.read_csv(url, parse_dates=['election_date','forecast_date'])
#filter out `NaN`s
election_sub=election_sub.dropna(subset=['candidate'])
#filter rows for match one OR another datetime
df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
#get number of unique datetimes per groups
s = df.groupby('candidate')['forecast_date'].nunique()
#filter candidates only with both datetimes, like condition AND
cand = s.index[s.eq(2)].unique()
print (cand)
Index(['A. Donald McEachin', 'Aaron Andrus', 'Aaron Swisher',
'Abby Finkenauer', 'Abigail Spanberger', 'Adam B. Schiff',
'Adam Kinzinger', 'Adam Smith', 'Adrian Smith', 'Adriano Espaillat',
...
'William Lacy Clay', 'William Tanoos', 'William Timmons',
'Willie Billups', 'Xochitl Torres Small', 'Young Kim', 'Yvette Clarke',
'Yvette Herrell', 'Yvonne Hayes Hinson', 'Zoe Lofgren'],
dtype='object', name='candidate', length=960)
#filter original data by candidates
df = election_sub[election_sub['candidate'].isin(cand)]
Ваше решение возможно с помощью теста, если хотя бы одно условие верно для обоих - выходное значение равно 2 скалярам, поэтому для AND
используется and
:
grouped = election_sub.groupby('candidate')
df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
print(df)
year office state district special election_date forecast_date \
0 2018 House WY 1.0 False 2018-11-06 2018-11-06
1 2018 House WY 1.0 False 2018-11-06 2018-11-06
2 2018 House WY 1.0 False 2018-11-06 2018-11-06
3 2018 House WY 1.0 False 2018-11-06 2018-11-06
4 2018 House WY 1.0 False 2018-11-06 2018-11-06
... ... ... ... ... ... ... ...
282688 2018 House AK 1.0 False 2018-11-06 2018-08-01
282689 2018 House AK 1.0 False 2018-11-06 2018-08-01
282690 2018 House AK 1.0 False 2018-11-06 2018-08-01
282691 2018 House AK 1.0 False 2018-11-06 2018-08-01
282692 2018 House AK 1.0 False 2018-11-06 2018-08-01
forecast_type party candidate projected_voteshare \
0 lite D Greg Hunter 33.29836
1 lite R Liz Cheney 61.18835
2 deluxe D Greg Hunter 31.37998
3 deluxe R Liz Cheney 63.10673
4 classic D Greg Hunter 31.33293
... ... ... ... ...
282688 lite R Don Young 50.74973
282689 deluxe D Alyse S. Galvin 41.49152
282690 deluxe R Don Young 51.96705
282691 classic D Alyse S. Galvin 44.10701
282692 classic R Don Young 49.35155
actual_voteshare probwin probwin_outcome
0 NaN 0.00134 0
1 NaN 0.99866 1
2 NaN 0.00020 0
3 NaN 0.99980 1
4 NaN 0.00032 0
... ... ... ...
282688 NaN 0.76900 1
282689 NaN 0.12776 0
282690 NaN 0.87224 1
282691 NaN 0.28146 0
282692 NaN 0.71854 1
[282240 rows x 14 columns]
РЕДАКТИРОВАТЬ:
Производительность обоих решений отличается:
In [41]: %%timeit
...: df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
...: #get number of unique datetimes per groups
...: s = df.groupby('candidate')['forecast_date'].nunique()
...: #filter candidates only with both datetimes, like condition AND
...: cand = s.index[s.eq(2)].unique()
...:
...: #filter original data by candidates
...: df = election_sub[election_sub['candidate'].isin(cand)]
...:
61.3 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: %%timeit
...: grouped = election_sub.groupby('candidate')
...: df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
...:
1.07 s ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)