Я начинаю изучать Python и для этой проблемы я попытался написать процедуру для линейной регрессии. Чтобы проверить мой код, я протестировал два набора данных. Первый набор данных с веб-сайта, на котором я следил, чтобы посмотреть на математические шаги. Выполнение кода с этим набором данных дает правильное решение. Второй набор данных, который я сделал, дает значение r-квадрата больше 1, но график рассеяния показывает, что значение r-квадрата должно быть близко к 1, но не больше 1. Я проверил это с помощью excel. Ниже я предоставляю свой код Python. Я считаю, что ошибка находится в моем разделе кода при расчете значения r-квадрата. Любая помощь будет высоко ценится. Мой код содержит комментарии о том, какой набор данных доставляет мне проблемы, а какой нет.
Created on Fri Nov 01 2019 1:15:01 PM
Copyright (c) 2019 Deep Sen
Linear Regression Math steps:
https://machinelearningmastery.com/simple-linear-regression-tutorial-for-machine-learning/
I wrote the python code so that I can understand every step. This is version 1.
Version 2 will have functions and user can enter data.
Version 3 will allow user to point to an csv file with x,y data.
Version 4 will have grahical user interface
# Simple Linear regression
import matplotlib.pyplot as plt
import numpy as np
# Data
# Trial data in the link given above
#depth_x = np.array([1, 2, 4, 3, 5])
#age_y = np.array([1, 3, 3, 2, 5])
# When I use the data above, all my intermidiate solutions match all the intermidiate step solutions provided in the link above. An excel plot of the data gives the same slope, intercept and RMSE value so the mathematics shown is correct.
# But when I use the data from Waltham, see data block below, I get a really weird RMSE value in the two hundreds! What am I missing? The slope value and intercept value matches with an excel plot of the data below.
# Data in Mathematics Tool for Geologist by David Waltham, pg 21, Table 2.2
depth_x = np.array([0.5, 1.3, 2.47, 4.9, 8.2])
age_y = np.array([1020, 2376, 5008, 10203, 15986])
Simple linear regression model y = b0 + b1*x
where x is the independent variable, y is the dependent variable, b0 is the intercept and b1 is the slope.
#######################################
### Estimating b1, which represents the slope
1) b1 can be estimated by :
b1 = sum((xi-mean(x)) * (yi-mean(y))) / sum((xi – mean(x))**2)
b1 = (ss_xy/ss_xx)**2
where xi and yi are the ith value of x and y in an array or a list.
# Mean of depth_x and age_y
average_x = np.mean(depth_x)
average_y = np.mean(age_y)
print('Average Depth: ', average_x, 'm' '\nAverage Age:', average_y, 'yr')
# Calculating difference between depth_xi and average_x
diff_x = depth_x - average_x
print('(xi - mean x): ',diff_x)
# Calculating difference between age_yi and average_y
diff_y = age_y - average_y
print('(yi - mean y): ', diff_y)
# Products of the differences
p_xy = diff_x * diff_y
print('Product of difference: ', p_xy)
# Sum of products
sp_xy = np.sum(p_xy)
print('Sum of Products: ', sp_xy)
# Sum of the difference between xi and mean x
sp_xx = np.sum(diff_x**2)
print('Sum of difference (xi - mean x): ', sp_xx)
# Calculating b1 = (sp_xy / sp_xx)**2
b1 = (sp_xy/sp_xx)
print('Slope: ', b1)
#####################################
# Estimating b0 which represents the intercept
2) b0 can be estimated by:
b0 = mean(y) – b1 * mean(x)
b0 = average_y - b1 * average_x
print('Intercept: ', b0)
######################################
# Calculaing the Root Mean Square Error
# Calculating predicted value of y
pred_y = b0 + (b1 * depth_x)
print('Predicted y value: ', pred_y)
# RMSE = sqrt( sum( (pred_y – yi)^2 )/n )
# Square of difference between pred_y and yi
sqdiff_y = (pred_y - age_y)**2
print('Square of (pred_y - yi): ', sqdiff_y)
#Sum of the Square of difference between pred_y and yi
s_sqdiff_y = np.sum(sqdiff_y)
print('Sum of square of (pred_y - yi): ', s_sqdiff_y)
# Average of the Sum of the Square of difference between pred_y and yi
av_s_sqdiff_y = s_sqdiff_y / np.size(age_y)
print('Average of the Sum of the Square of (pred_y - yi): ', av_s_sqdiff_y)
# Square root of Average of the Sum of the Square of difference between pred_y and yi
rmse = np.sqrt(av_s_sqdiff_y)
print('Root Mean Square Error: ', rmse)
##################################
# Plotting scatter plot of data
plt.scatter(depth_x, age_y, color='m', marker = 'o', s = 30)
# Plotting Linear fit
plt. plot(depth_x, pred_y, color='r')
plt.show()