«ValueError: наименее заполненный класс в y имеет только 1 член, что слишком мало», хотя эти классы уже удалены - PullRequest
0 голосов
/ 01 июня 2018

У меня проблемы с использованием StratifiedShuffleSplit из sklearn для некоторых данных с несколькими метками.Следующий автономный пример лучше всего объясняет проблему:

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

# Generate some data
np.random.seed(0)
n_samples = 10
n_features = 40
n_labels = 20

x = np.random.rand(n_samples, n_features)
y = np.zeros((n_samples, n_labels))
for col in range(n_labels):
    n_instances = np.random.randint(5)
    indices = np.random.permutation(n_samples)[:n_instances]
    y[indices,col] = 1

print('Features training set shape:', x.shape)
print('Labels from training set shape:', y.shape)
print('Are there any labels with fewer than two instances?', np.any(y.sum(axis=0) < 2), '\n')
print(y, '\n')

# Remove labels which are represented fewer than two times in the training set,
# since this messes with StratifiedShuffleSplit below.
label_indices_rm = np.where(y.sum(axis=0) < 2)[0]
y = np.delete(y, label_indices_rm, axis=1)

print(len(label_indices_rm), ' labels had fewer than two instances and were removed.')
print('Features from training set shape:', x.shape)
print('Labels from training set shape:', y.shape)
print('Are there any labels with fewer than two instances?', np.any(y.sum(axis=0) < 2), '\n')
print(y, '\n')

sss = StratifiedShuffleSplit(n_splits=1, train_size=0.5)
indices,_ = sss.split(x, y) # gives the training indices

Это дает следующий вывод:

Features from training set shape: (10, 40)
Labels from training set shape: (10, 20)
Are there any labels with fewer than two instances? True 

[[0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
 [0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0.]] 

7  labels had fewer than two instances and were removed.
Features from training set shape: (10, 40)
Labels from training set shape: (10, 13)
Are there any labels with fewer than two instances? False 

[[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0.]
 [0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0.]
 [0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0.]] 

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-a490e96dd0e0> in <module>()
     32 
     33 sss = StratifiedShuffleSplit(n_splits=1, train_size=0.5)
---> 34 indices,_ = sss.split(x, y) # gives the training indices

~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
   1202         """
   1203         X, y, groups = indexable(X, y, groups)
-> 1204         for train, test in self._iter_indices(X, y, groups):
   1205             yield train, test
   1206 

~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
   1544         class_counts = np.bincount(y_indices)
   1545         if np.min(class_counts) < 2:
-> 1546             raise ValueError("The least populated class in y has only 1"
   1547                              " member, which is too few. The minimum"
   1548                              " number of groups for any class cannot"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Я убедился, что нет меток, которые представлены только менее чем двумяраз.Почему я все еще получаю эту ошибку?

...