我正在使用app.zelp.com执行nlp。在标记化和删除停止词之后,我想对剩下的词进行去标记并导出到csv。有可能吗?
%python
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StopWords").getOrCreate()
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark import SparkFiles
url ="myamazon s3 url"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("myfile.csv"), sep=",", header=True)
# Tokenize DataFrame
review_data = Tokenizer(inputCol="Text", outputCol="Words")
# Transform DataFrame
reviewed = review_data.transform(df)
# Remove stop words
remover = StopWordsRemover(inputCol="Words", outputCol="filtered")
newFrame = remover.transform(reviewed)
final = newFrame.select("filtered")
我想结合其余的话和出口到csv。有可能吗?
1条答案
按热度按时间8cdiaqws1#
您可以考虑使用spark nlp标记器进行标记化,然后使用tokenassembler组装回标记,
https://nlp.johnsnowlabs.com/docs/en/transformers#tokenassembler-重塑数据
阿尔贝托。