Sklearn SVM-классификатор предложений / вопросов - PullRequest
0 голосов
/ 11 июля 2020

Я новичок в машинном обучении и тестировании, и я нашел несколько руководств в Интернете https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568. Итак, я хочу классифицировать вопросы с помощью Sklearn и SVM. Итак, есть уровень для каждого вопроса (таксономия Блума). Поэтому я хочу узнать уровень вопросов, когда пользователь загружает файл pdf. Пока что я обучил модель SVM и сделал прогноз с помощью test_data. Теперь у меня проблема с передачей файла / строки в функцию predict в алгоритме SVM Sklean.

Это часть моего набора данных

введите описание изображения здесь

алгоритм

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import re
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report
from pdftotext import convert_pdf_to_string, convert_pdf_to_long_string, convert_to_string
from flask import jsonify

df = pd.read_csv('../dataset/data_question_levels.csv',encoding='unicode_escape')
df = df[pd.notnull(df['taxonomy'])]

my_tags = ['Remember','Understand','Apply']


def clean_text(text):
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))

    text = text.lower()  # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text)  # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text)  # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)  # delete stopwors from text
    return text

df['question'] = df['question'].apply(clean_text)
df['question'].apply(lambda x: len(x.split(' '))).sum()

X = df.question
y = df.taxonomy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

sgd = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
               ])
sgd.fit(X_train, y_train)

y_pred = sgd.predict() #want to pass string or string extracted from pdf

Я использовал эту функцию для извлечения строк из файла pdf

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

def convert_pdf_to_long_string(file_path):
    output_string = StringIO()
    with open(file_path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

        output = tokens(output_string.getvalue())
        without_stop = nltk.tokenize.sent_tokenize(output)
        finals = [k.strip(string.punctuation) for k in without_stop]
        final = [x for x in finals if x]
    return (final)

это результат функции

[' Database Management Systems Year Semester Consider the following schema from the previous lab sheet', 'Lab Sheet ', 'Physician eid ename position ', 'Department did name head Foreign key head references Physician eid ', 'Works in physician department Foreign key physician references Physician eid Foreign key department references Department did ', 'Patient pid name address phone insuranceId ', 'Appointments appointmentId patient physician startTime endTime room Foreign key physician references Physician eid Foreign key patient references Patient pid ', 'Drugs Code name brand ', 'Prescribes physician patient drug date appointment dose Foreign key physician references Physician eid Foreign key patient references Patient pid Foreign key drug references Drugs Code Answer the following questions with reference to the schema below', 'Use the given data set and check the correctness of your answers in the SQL Server', 'Write a T SQL function to return the number of physicians in a given department', 'Write the command s you would use to print the number of physicians in the Psychiatry using the function you have created in the previous question', 'Write a T SQL function that returns the number of appointments a given physicians have on a given date', 'Write a T SQL procedure that outputs number of drugs prescribed to a given patient by a given doctor', 'Write the command s you would use to print the number of drugs prescribed by the physician named Molly Clock to patient named Dennis Doe using the procedure you have created in the previous question', 'Write a procedure that outputs the patient with the maximum number of appointments with a given physicians', 'Output the name of the patient and the number of appointments']

с размером выборки [17, 26]

И после того, как я передаю этот вывод в функцию predict, я получаю эту ошибку

Найдены входные переменные с несовместимым количеством выборок : [17, 26]

Когда я передаю данные X_test для прогнозирования функции, я получаю ожидаемые результаты (Размер данных X_test (26,))

Как я могу передать строка к функции прогноза?

...