У меня есть следующие данные:
[
{"Q" : "What nationality is Laplace?", "Q_TYPE_COURSE" : ["LOCATION", "DESCRIPTION"], "Q_TYPE_FINE" : ["LOCATION-COUNTRY", "DESCRIPTION-DESCRIPTION"] },
{"Q" : "Who wrote 'Celestial Mechanics'?", "Q_TYPE_COURSE" : ["HUMAN"], "Q_TYPE_FINE" : ["HUMAN-IND"]},
{"Q" : "Who created Laplace's equation?", "Q_TYPE_COURSE" : ["HUMAN"], "Q_TYPE_FINE" : ["HUMAN-IND"]},
{"Q" : "What operator is named after Laplace?", "Q_TYPE_COURSE" : ["ENTITY"], "Q_TYPE_FINE" : ["ENTITY-SYMBOL","ENTITY-WORD","ENTITY-CREATIVE"]},
{"Q" : "Who was one of the first scientists to postulate the existence of black holes?", "Q_TYPE_COURSE" : ["HUMAN"], "Q_TYPE_FINE" : ["HUMAN-IND"]},
{"Q" : "Who was one of Napoleon's examiners while he was in school?", "Q_TYPE_COURSE" : ["HUMAN"], "Q_TYPE_FINE" : ["HUMAN-IND"]},
{"Q" : "Where was Laplace born?", "Q_TYPE_COURSE" : ["LOCATION"], "Q_TYPE_FINE" : ["LOCATION-CITY","LOCATION-STATE","LOCATION-COUNTRY"]},
{"Q" : "Where did Laplace go to school?", "Q_TYPE_COURSE" : ["LOCATION", "ENTITY"], "Q_TYPE_FINE" : ["LOCATION-CITY","LOCATION-STATE","LOCATION-OTHER","ENTITY-OTHER"]},
{"Q" : "What did Laplace think of d'Alembert?", "Q_TYPE_COURSE" : ["DESCRIPTION","HUMAN"], "Q_TYPE_FINE" : ["DESCRIPTION-REASON", "HUMAN-DESCRIPTION"]},
{"Q" : "What did d'Alembert think of Laplace?", "Q_TYPE_COURSE" : ["DESCRIPTION","HUMAN"], "Q_TYPE_FINE" : ["DESCRIPTION-REASON", "HUMAN-DESCRIPTION"]},
{"Q" : "When did Laplace become a member of the Academie Des Sciences?", "Q_TYPE_COURSE" : ["NUMERIC"], "Q_TYPE_FINE" : ["NUMERIC-DATE"]},
{"Q" : "Are Laplace's theories on celestial motion sufficient to describe the stability of the Solar System?", "Q_TYPE_COURSE" : ["DESCRIPTION"], "Q_TYPE_FINE" : ["DESCRIPTION-REASON"]},
{"Q" : "How did Laplace's theory of ocean tides differ from that of Newton or Bernoulli?", "Q_TYPE_COURSE" : ["DESCRIPTION"], "Q_TYPE_FINE" : ["DESCRIPTION-REASON","DESCRIPTION-DESCRIPTION"]},
{"Q" : "What sequence of functions, made by Legendre, did Laplace expand on?", "Q_TYPE_COURSE" : ["ENTITY"], "Q_TYPE_FINE" : ["ENTITY-SYMBOL", "ENTITY-CREATIVE"]},
{"Q" : "What is a potential function?", "Q_TYPE_COURSE" : ["ENTITY","DESCRIPTION"], "Q_TYPE_FINE" : ["ENTITY-SYMBOL","DESCRIPTION-DESCRIPTION"]},
{"Q" : "In what year did Laplace publish his book?", "Q_TYPE_COURSE" : ["NUMERIC"], "Q_TYPE_FINE" : ["NUMERIC-DATE"]},
{"Q" : "What hypothesis was Laplace known for?", "Q_TYPE_COURSE" : ["DESCRIPTION","ENTITY"], "Q_TYPE_FINE" : ["DESCRIPTION-DESCRIPTION","ENTITY-CREATIVE"]},
{"Q" : "What did Laplace do in statistics?", "Q_TYPE_COURSE" : ["DESCRIPTION", "ENTITY"], "Q_TYPE_FINE" : ["DESCRIPTION-DESCRIPTION", "ENTITY-CREATIVE"]},
{"Q" : "Was Laplace involved in politics?", "Q_TYPE_COURSE" : ["DESCRIPTION", "HUMAN"], "Q_TYPE_FINE" : ["DESCRIPTION-DESCRIPTION", "HUMAN-DESCRIPTION"]},
{"Q" : "What are Laplace's thoughts on governance?", "Q_TYPE_COURSE" : ["DESCRIPTION"], "Q_TYPE_FINE" : ["DESCRIPTION-REASON", "DESCRIPTION-DESCRIPTION"]},
{"Q" : "Where did Laplace die?", "Q_TYPE_COURSE" : ["LOCATION"], "Q_TYPE_FINE" : ["LOCATION-CITY", "LOCATION-COUNTRY", "LOCATION-STATE"]},
{"Q" : "What was Laplace's full name?", "Q_TYPE_COURSE" : ["HUMAN"], "Q_TYPE_FINE" : ["HUMAN-TITLE"]}
]
Я ТОЛЬКО ИСПОЛЬЗУЮ теги LI И ROTH "Q_TYPE_COURSE".У меня есть класс извлечения объектов, где объекты извлекаются, а затем преобразуются в векторную форму с использованием следующих двух методов (метод feature_extractor.create_features имеет пространственные документы в качестве входных данных и возвращает список строковых объектов) (метод векторных функций принимает выходные данныеcreate_features и превращает его в 1D coo_matrix, который позже превращается в массив для прогнозирования, где self - это класс qa_classifier):
feature_extractor.create_features(nlp(doc["Q"]), ngram_range=(1,3), lemmatize=True)
self.vectorize_features(self, features)
теперь здесь приведено определение класса qa_classifier (предположим, диапазоны ngram изначения лемматизации едины во всем)
class qa_classifier(feature_extractor):
clfs = []
mlb = MultiLabelBinarizer()
def _dummy_fun(s): return s
vectorizer = TfidfVectorizer(analyzer="word", tokenizer=_dummy_fun, preprocessor=_dummy_fun, token_pattern=None, norm="l2")
def __init__(self, questions, tags, ngram_range=(2,2), lemmatize=False):
#train_questions is a list of lists of strings
#assume that questions have already had the feature_extractor.create_features(question) method called on them
#tags are strings
self.ngram_range = ngram_range
self.lemmatize = lemmatize
self.q_matrix = qa_classifier.vectorizer.fit_transform(questions)
self.tags_matrix = qa_classifier.mlb.fit_transform(tags)
#here try to make a classifier for each tag
for tag_idx in range(len(qa_classifier.mlb.classes_)):
clf = svm.LinearSVC()
clf.fit(self.q_matrix, self.tags_matrix.take(indices=tag_idx,axis=1))
qa_classifier.clfs.append(clf)
def vectorize_features(self, features):
#here we just use the q_matrix to turn new features into tf_idf docs
#returns coo_matrix representing the feature vector
def predict(self,query):
# query is a spacy doc
query_features = feature_extractor.create_features(query) #turns a spacy doc into a list of strings
feature_vector = self.vectorize_features(query_features).T.toarray()
tags = []
for tag_idx, clf in enumerate(qa_classifier.clfs):
if clf.predict(feature_vector)[0] == 1:
tags.append(qa_classifier.mlb.classes_[tag_idx])
return tags
Таким образом, в основном у меня есть различные метки, и я использую многослойный бинаризатор для создания двоичных классификаторов для каждого тега и запуска каждого из классификаторов в документе в методе предиката.Однако когда я запускаю метод прогнозирования для данных обучения, показанных выше, он не воссоздает теги на 100%, что здесь происходит?