Проблема с подготовкой столбцов не в utf8: во время преобразования обнаружены неизвестные категории ['Fès-Meknès'] - PullRequest
0 голосов
/ 05 февраля 2020

Я попытался подготовить входные и выходные данные для проблемы выбора характеристики c, но обнаружил проблему в некоторых столбцах, которая не выглядит как Unicode:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-89-78f2cf157d88> in <module>
      1 # prepare input data
----> 2 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
      3 # prepare output
      4 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

<ipython-input-86-e63e5d5fad63> in prepare_inputs(X_train, X_test)
      3     oe.fit(X_train)
      4     X_train_enc = oe.transform(X_train)
----> 5     X_test_enc = oe.transform(X_test)
      6     return X_train_enc, X_test_enc
      7 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in transform(self, X)
    812 
    813         """
--> 814         X_int, _ = self._transform(X)
    815         return X_int.astype(self.dtype, copy=False)
    816 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _transform(self, X, handle_unknown)
    105                     msg = ("Found unknown categories {0} in column {1}"
    106                            " during transform".format(diff, i))
--> 107                     raise ValueError(msg)
    108                 else:
    109                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['Fès-Meknès'] in column 4 during transform

Вот выдержка из столбцов:

    Do you agree    Gender  Age     City          Urban/Rural  Output
0   Yes             Female  25-34   Madrid        Urban        Will buy
1   No              Male    18-25   Fès-Meknès  Rural        Won't
2   ...             ...     ...     ...      ...               Undecided
....

Fès-Meknès должно быть Fès-Meknès.

Вот код, который я сделал, чтобы получить данные:

def load_dataset():
    connection = psycopg2.connect(user = "user",
                                  password = "passwd",
                                  host = "host",
                                  port = "5432",
                                  database = "database")

sql = "select * from capi limit 10;"
# load the table
df = pd.read_sql_query(sql, connection)
# retrieve numpy array
dataset = df.values

# split into input (X) and output (y) variables
cols = df.iloc[:,5:].columns.array
filtered_cols = ['TL_Segment']
cols = [col for col in cols if col not in filtered_cols]

X = df.loc[:, cols]  #independent columns
X = X.astype(str)
y = df['TL_Segment']    #target column i.e price range
return X.values, y.values

, используя правильное кодирование: попытался не учитывать эти строки с помощью try catch: def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) try: X_train_enc = oe.transform(X_train) try: # imbricated in order not to return nothing in one of the two things returned X_test_enc = oe.transform(X_test) except ValueError as e: print(e) except ValueError as e: print(e) return X_train_enc, X_test_enc Но я все еще получаю следующее: Found unknown categories ['Fès-Meknès'] in column 4 during transform --------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call last) <ipython-input-126-78f2cf157d88> in <module> 1 # prepare input data ----> 2 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) 3 # prepare output 4 y_train_enc, y_test_enc = prepare_targets(y_train, y_test) <ipython-input-124-2376647ab46e> in prepare_inputs(X_train, X_test) 10 except ValueError as e: 11 print(e) ---> 12 return X_train_enc, X_test_enc 13 UnboundLocalError: local variable 'X_test_enc' referenced before assignment

...