在管道中包含交互项

0vvn1miw  于 2021-05-18  发布在  Spark
关注(0)|答案(0)|浏览(248)

我试图在pyspark下构建一个带有管道的模型,对于一些分类变量,我希望将交互包含到管道中,但无法找到正确的代码。
以下是我目前的管道:

stages = []

for categoricalCol in cat_var:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index", stringOrderType='frequencyAsc')
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

cat_var = ['a','b']
int_var = ['c','d']

num_var = ["e", "f", "g", "h"]
assemblerInputs = [c + "classVec" for c in cat_var]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

我在文档中找到了交互术语代码。

assemblerInputs_int = [c + "classVec" for c in int_var] 
interaction = Interaction(inputCols=assemblerInputs_int, outputCol="interactedCol")

但我真的不知道如何将这两个部分缝合在一起,这样最终的features列就包含了当前变量,cat_var和num_var,还有交互项。
谢谢。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题