我试图在pyspark下构建一个带有管道的模型,对于一些分类变量,我希望将交互包含到管道中,但无法找到正确的代码。
以下是我目前的管道:
stages = []
for categoricalCol in cat_var:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index", stringOrderType='frequencyAsc')
# Use OneHotEncoder to convert categorical variables into binary SparseVectors
encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
# Add stages. These are not run here, but will run all at once later on.
stages += [stringIndexer, encoder]
cat_var = ['a','b']
int_var = ['c','d']
num_var = ["e", "f", "g", "h"]
assemblerInputs = [c + "classVec" for c in cat_var]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
我在文档中找到了交互术语代码。
assemblerInputs_int = [c + "classVec" for c in int_var]
interaction = Interaction(inputCols=assemblerInputs_int, outputCol="interactedCol")
但我真的不知道如何将这两个部分缝合在一起,这样最终的features列就包含了当前变量,cat_var和num_var,还有交互项。
谢谢。
暂无答案!
目前还没有任何答案,快来回答吧!