Следующее должно делать то, что вы ищете.Настройка:
dates = np.repeat(pd.date_range('2018-01-01', '2018-01-31'), 4)
np.random.seed(100)
test_df = pd.DataFrame({
'date': dates,
'site': ['A', 'A', 'B', 'B']*(dates.shape[0]//4),
'id': [1, 2, 1, 2]*(dates.shape[0]//4),
'tracking_val': np.random.choice([0, 1], p=[0.4, 0.6], size=dates.shape)
})
Теперь выполните (много) groupby
агрегаций, необходимых для получения желаемого:
run_length_dict = {} # place to hold results
for name, group in test_df.groupby(['site', 'id']):
# Number all consecutive runs of 1 or 0
runs = group['tracking_val'].ne(group['tracking_val'].shift()) \
.cumsum() \
.rename(columns={'tracking_val': 'run_number'})
# Group each run by its number *and* the tracking value, then find the length of that run
run_lengths = runs.groupby([runs, group['tracking_val']]).agg('size')
# One final groupby (this time, on the tracking_val/level=1) to get the count of lengths
# and push it into the dict with the name of the original group - ("site", "id") tuple
run_length_dict[name] = run_lengths.groupby(level=1).value_counts()
Результат:
{('A', 1): tracking_val
0 1 2
2 1
3 1
4 1
5 1
1 1 3
2 3
6 1
dtype: int64, ('A', 2): tracking_val
0 1 5
2 2
3 1
4 1
1 1 6
2 1
3 1
4 1
dtype: int64, ('B', 1): tracking_val
0 1 6
2 2
1 2 4
1 2
4 2
3 1
dtype: int64, ('B', 2): tracking_val
0 1 5
2 2
3 2
1 1 5
2 2
3 1
4 1
dtype: int64}