Question

У меня есть pandas df, и некоторые столбцы являются списками с данными в них, и я хотел бы закодировать метки в списках.

Я получаю эту ошибку: ValueError: Expected 2D array, got 1D array instead:

from sklearn.preprocessing import OneHotEncoder
mins = pd.read_csv('recipes.csv')

enc = OneHotEncoder(handle_unknown='ignore')

X = mins['Ingredients']

'''
[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]
'''

enc.fit(X)

Я надеюсь получить столбец списков, в котором будет правильно закодированная информация

[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]

[[0, 1, 2, 3, 1]
[0, 1, 4, 3, 1]
...
[4, 1, 3, 3, 9]]

spadarian · Answer 1 · 23 июня 2019

Поскольку вы хотите применить его непосредственно к pandas.DataFrame:

from sklearn.preprocessing import LabelEncoder

# Get a flat list with all the ingredients
all_ingr = mins.Ingredients.apply(pd.Series).stack().values

enc = LabelEncoder()
enc.fit(all_ingr)

mins['Ingredients_enc'] = mins.Ingredients.apply(enc.transform)

amanb · Answer 2 · 23 июня 2019

Чтобы пометить кодированный список списков в серии DataFrame, мы сначала обучаем кодировщик уникальными текстовыми метками, а затем используем от apply до transform каждой текстовой метки для обучаемой целочисленной метки в списке списков.Вот пример:

In [2]: import pandas as pd

In [3]: from sklearn import preprocessing

In [4]: df = pd.DataFrame({"Day":["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Veggies&Drinks":[["lettuce"
   ...: , "tomatoes", "ginger", "vodka", "tomatoes"], ["flour", "vodka", "mustard", "lettuce", "ginger"], ["mustard", "
   ...: tomatoes", "ginger", "vodka", "tomatoes"], ["ginger", "vodka", "lettuce", "tomatoes", "flour"], ["mustard", "le
   ...: ttuce", "ginger", "flour", "tomatoes"]]})

In [5]: df
Out[5]:
         Day                                Veggies&Drinks
0     Monday  [lettuce, tomatoes, ginger, vodka, tomatoes]
1    Tuesday      [flour, vodka, mustard, lettuce, ginger]
2  Wednesday  [mustard, tomatoes, ginger, vodka, tomatoes]
3   Thursday     [ginger, vodka, lettuce, tomatoes, flour]
4     Friday   [mustard, lettuce, ginger, flour, tomatoes]

In [9]: label_encoder = preprocessing.LabelEncoder()

In [19]: list_of_veggies_drinks = ["lettuce","tomatoes","ginger","vodka","flour","mustard"]

In [20]: label_encoder.fit(list_of_veggies_drinks)
Out[20]: LabelEncoder()

In [21]: integer_encoded = df["Veggies&Drinks"].apply(lambda x:label_encoder.transform(x))

In [22]: integer_encoded
Out[22]:
0    [2, 4, 1, 5, 4]
1    [0, 5, 3, 2, 1]
2    [3, 4, 1, 5, 4]
3    [1, 5, 2, 4, 0]
4    [3, 2, 1, 0, 4]
Name: Veggies&Drinks, dtype: object

In [23]: df["Encoded"] = integer_encoded

In [24]: df
Out[24]:
         Day                                Veggies&Drinks          Encoded
0     Monday  [lettuce, tomatoes, ginger, vodka, tomatoes]  [2, 4, 1, 5, 4]
1    Tuesday      [flour, vodka, mustard, lettuce, ginger]  [0, 5, 3, 2, 1]
2  Wednesday  [mustard, tomatoes, ginger, vodka, tomatoes]  [3, 4, 1, 5, 4]
3   Thursday     [ginger, vodka, lettuce, tomatoes, flour]  [1, 5, 2, 4, 0]
4     Friday   [mustard, lettuce, ginger, flour, tomatoes]  [3, 2, 1, 0, 4]

Как кодировать столбец pandas.DataFrame, содержащий списки, используя Sklearn.preprocessing

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как кодировать столбец pandas.DataFrame, содержащий списки, используя Sklearn.preprocessing

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы