Учитывая оба кадра данных, как указано выше,
df
Department Country Age Grade Score
0 Math India Young A 97
1 Math India Young B 86
2 Math India Young D 68
3 Science India Young A 92
4 Science India Young B 81
5 Science India Young C 76
6 Social India Young B 88
7 Social India Young D 62
8 Social India Young C 72
input
Country Age Grade Score
0 India Young B 84
1 India Young D 65
2 India Young A 98
Одним из возможных решений является
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys
Преобразование категориальных признаков в числовое значение с использованием пакета scikit-learn
,
df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df
Выход:
Department Country Age Grade Score
0 Math 0 0 0 97
1 Math 0 0 1 86
2 Math 0 0 3 68
3 Science 0 0 0 92
4 Science 0 0 1 81
5 Science 0 0 2 76
6 Social 0 0 1 88
7 Social 0 0 3 62
8 Social 0 0 2 72
input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input
Выход:
Country Age Grade Score
0 0 0 1 84
1 0 0 2 65
2 0 0 0 98
Определение функции cosine-similarity
,
def cosine_similarity(a, b):
nom = np.sum(np.multiply(a, b))
denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
sim = nom / denom
return sim
dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
similarity = []
for j in range(len(df)):
a = input.iloc[i]
b = df.iloc[j, 1:]
c_sim = cosine_similarity(a, b)
similarity.append(c_sim)
max_similarity = []
for k in range(0, len(df), 3):
max_3 = max(similarity[k:k+3])
max_similarity.append(max_3)
max_idx = max_similarity.index(max(max_similarity))
results.append(dept[max_idx])
results
Выход:
['Math', 'Social', 'Math']