有可能在pyspark中取消Dataframe的定义吗?

ylamdve6  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(390)

我正在使用app.zelp.com执行nlp。在标记化和删除停止词之后,我想对剩下的词进行去标记并导出到csv。有可能吗?

%python

# Start Spark session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StopWords").getOrCreate()
from pyspark.ml.feature import Tokenizer, StopWordsRemover 
from pyspark import SparkFiles
url ="myamazon s3 url"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("myfile.csv"), sep=",", header=True)

# Tokenize DataFrame

review_data = Tokenizer(inputCol="Text", outputCol="Words")

# Transform DataFrame

reviewed = review_data.transform(df)

# Remove stop words

remover = StopWordsRemover(inputCol="Words", outputCol="filtered")
newFrame = remover.transform(reviewed)

final = newFrame.select("filtered")

我想结合其余的话和出口到csv。有可能吗?

8cdiaqws

8cdiaqws1#

您可以考虑使用spark nlp标记器进行标记化,然后使用tokenassembler组装回标记,
https://nlp.johnsnowlabs.com/docs/en/transformers#tokenassembler-重塑数据
阿尔贝托。

相关问题