numpy 为什么我的SGD比我的线性回归模型差得远?

yk9xbfzb  于 2023-06-29  发布在  其他
关注(0)|答案(2)|浏览(120)

我试图比较线性回归(正常方程)与SGD,但看起来SGD是遥远的。我做错什么了吗?
这是我的代码

x = np.random.randint(100, size=1000)
y = x * 0.10
slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
print("slope is %f and intercept is %s" % (slope,intercept))
#slope is 0.100000 and intercept is 1.61435309565e-11

这是我的新币

x = x.reshape(1000,1)
clf = linear_model.SGDRegressor()
clf.fit(x, y, coef_init=0, intercept_init=0)

print(clf.intercept_)
print(clf.coef_)

#[  1.46746270e+10]
#[  3.14999003e+10]

我本以为coefintercept几乎相同,因为数据是线性的。

093gszye

093gszye1#

当我试图运行这段代码时,我得到了一个溢出错误。我怀疑你有同样的问题,但由于某种原因,它没有抛出错误。
如果您缩小功能,一切都按预期工作。使用scipy.stats.linregress

>>> x = np.random.random(1000) * 10
>>> y = x * 0.10
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
>>> print("slope is %f and intercept is %s" % (slope,intercept))
slope is 0.100000 and intercept is -2.22044604925e-15

使用linear_model.SGDRegressor

>>> clf.fit(x[:,None], y)
SGDRegressor(alpha=0.0001, epsilon=0.1, eta0=0.01, fit_intercept=True,
       l1_ratio=0.15, learning_rate='invscaling', loss='squared_loss',
       n_iter=5, penalty='l2', power_t=0.25, random_state=None,
       shuffle=False, verbose=0, warm_start=False)
>>> print("slope is %f and intercept is %s" % (clf.coef_, clf.intercept_[0]))
slope is 0.099763 and intercept is 0.00163353754797

slope的值稍低,但我猜这是因为正则化的原因。

oyxsuwqo

oyxsuwqo2#

我在参数上使用了GridSearchCV(),发现除了微调超参数之外,主要问题是loss参数,默认情况下是'squared_error',所以只需在SGD模型/管道中将其设置为'huber',如下所示:SGDRegressor(loss='huber')
基于documentation的可能解释如下:
...'squared_error'是指普通最小二乘拟合。'huber'修改了'squared_error',通过从平方损失切换到线性损失超过epsilon的距离来减少对异常值正确性的关注。...

import numpy as np
from scipy import stats

np.random.seed(321)
x = np.random.random(1000) * 10
y = x * 0.10

slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
print("slope is %f and intercept is %s" % (slope,intercept))
#slope is 0.100000 and intercept is -1.1102230246251565e-16

from sklearn.linear_model import SGDRegressor
x = x.reshape(1000,1)
clf = SGDRegressor(loss='huber',  random_state=123)
clf.fit(x, y)

print("slope is %f and intercept is %s" % (clf.coef_, clf.intercept_[0]))
#slope is 0.099741 and intercept is 0.0017359301382874714

PS I使用GridSearchCV如下:

from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create the pipeline
sgd_pipeline = Pipeline([('SGD', SGDRegressor())])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
    'SGD__penalty': ['l2', 'l1', 'elasticnet'],
    'SGD__alpha': [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_pipeline, param_grid, cv=5)
grid_search.fit(x, y)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Fit the best model on the training data
best_sgd_reg.fit(x, y)

相关问题