如何将logistic回归模型得到的系数Map到pyspark中的要素名称

slhcrj9b  于 2022-12-03  发布在  Spark
关注(0)|答案(3)|浏览(196)

我使用数据库中列出的管道流构建了一个逻辑回归模型。https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
使用X1 M0 N1 X对特征(数字和字符串特征)进行编码,然后使用标准定标器进行变换。
我想知道如何将从逻辑回归中获得的权重(系数)Map到原始 Dataframe 中的特征名称。
换句话说,如何得到相应的特征权值或系数从模型中获得
谢谢你
我尝试从lrModel.schema中提取特征,它给出了一个structField列表,显示了这些特征
我尝试从方案中提取要素并Map到权重,但没有成功

from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)

# Train model with Training Data

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(trainingData)

LRschema = predictions.schema

从提取中预期的结果-元组列表(特征权重、特征名称)

6tdlim6h

6tdlim6h1#

不是LogisticRegression的直接输出,但可以使用我使用的以下函数获得:

def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
    test = model.transform(dataset)
    weights = model.coefficients
    print('This is model weights: \n', weights)
    weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
    if excludedCols == None:
        feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
    else:
        feature_col = [f for f in test.schema.names if f not in excludedCols]
    if len(weights) == len(feature_col):
        weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
    else:
        print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
    
    return weightsDF

results = ExtractFeatureCoeficient(lr_model, trainingData)
results.show()
这将生成具有以下字段的Spark Dataframe :

+--------------------+--------------------+
|         Coeficients|         FeatureName|
+--------------------+--------------------+
|[0.15834847825223...|    name            |
|               [0.0]|  lat               |
+--------------------+--------------------+

或者,您可以按如下方式拟合GML模型:

model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")

# Train model.  This also runs the indexer.
models = glmModel.fit(trainingData)

# then get summary of the model:

summary = model.summary
print(summary)

生成输出:

Coefficients:
        Feature       Estimate Std Error  T Value P Value
    (Intercept)       -1.3079    0.0705 -18.5549  0.0000
    name               0.1248    0.0158   7.9129  0.0000
    lat                0.0239    0.0209   1.1455  0.2520
k75qkfdt

k75qkfdt2#

假设您有一个逻辑回归要处理,这个Pandas变通方案将给予您结果。

lr = LogisticRegression(labelCol="label", featuresCol="features",maxIter=50,threshold=0.5)

lr_model=lr.fit(train_set)

print("Intercept: " + str(lr_model.intercept))  

pd.DataFrame({'coefficients':lr_model.coefficients, 'feature':list(pd.DataFrame(train_set.schema["features"].metadata["ml_attr"]["attrs"]['numeric']).sort_values('idx')['name'])})
k5ifujac

k5ifujac3#

上述解决方案似乎都不适用于我的情况。我的模型混合了数字变量和二进制变量。而且所有的数据转换和模型验证都连接在一个长管道中,因此我只能在预测数据中看到模式。我能够拼凑一些代码来迭代模式,并从所有变量名称中创建一个字典。然后把这个和系数联系起来。

# Extract the coefficients on each of the variables
coeff = mymodel.coefficients.toArray().tolist()

# Loop through the features to extract the original column names. Store in the var_index dictionary
var_index = dict()
for variable_type in ['numeric', 'binary']:
    for variable in predictions.schema["features"].metadata["ml_attr"]["attrs"][variable_type]:
        print("Found variable:", variable)
        idx = variable['idx']
        name = variable['name']
        var_index[idx] = name      # Add the name to the dictionary

# Loop through all of the variables found and print out the associated coefficients
for i in range(len(var_index)):
    print(i, var_index[i], coeff[i])

相关问题