im将数据集字符串转换为数组,然后转换为向量,如下所示
from pyspark.ml.feature import HashingTF, IDF
# Create a HashingTF object to convert the "text" column to feature vectors
hashing_tf = HashingTF(inputCol="combined_features", outputCol="raw_features")
# Transform the DataFrame to create the raw feature vectors
df = hashing_tf.transform(combarray)
# Create an IDF object to calculate the inverse document frequency for the raw feature vectors
idf = IDF(inputCol="raw_features", outputCol="features")
# Fit the IDF on the DataFrame and transform it to create the final feature vectors
df = idf.fit(df).transform(df)
# View the resulting feature vectors
df.select("features").show(truncate=False)
- 输出:**
+-------------------------------------+
|features |
+-------------------------------------+
|(262144,[243082],[7.785305182539862])|
|(262144,[90558],[7.785305182539862]) |
|(262144,[9277],[7.785305182539862]) |
|(262144,[55279],[7.785305182539862]) |
|(262144,[114098],[7.785305182539862])|
|(262144,[106982],[7.785305182539862])|
|(262144,[248513],[7.785305182539862])|
+-------------------------------------+
如何在pyspark中从我的特征中创建余弦相似度?
- 更新**
我将数据合并:
from pyspark.sql.functions import concat, lit, col
selected_feature = selected_feature.withColumn('combined_features',
concat(col('genres'),
col('keywords'),
col('tagline'),
col('cast'),
col('director')))
combine = selected_feature.select("combined_features")
数据是这样的:
+--------------------------------------------------+
| combined_features|
+--------------------------------------------------+
|Action Adventure Fantasy Science Fictionculture...|
|Adventure Fantasy Actionocean drug abuse exotic...|
|Action Adventure Crimespy based on novel secret...|
+--------------------------------------------------+
我写的代码一样的答案,仍然得到错误一样的评论
import pyspark.sql.functions as F
from pyspark.ml.feature import RegexTokenizer, CountVectorizer, IDF
from pyspark.ml.feature import HashingTF, Tokenizer
from sklearn.pipeline import Pipeline
regex_tokenizer = RegexTokenizer(gaps=False, pattern="\w+", inputCol="combined_features", outputCol="tokens")
count_vectorizer = CountVectorizer(inputCol="tokens", outputCol="tf")
idf = IDF(inputCol="tf", outputCol="idf")
tf_idf_pipeline = Pipeline(stages=[regex_tokenizer, count_vectorizer, idf])
combine = tf_idf_pipeline.fit(combine).transform(combine).drop("news", "tokens", "tf")
combine = combarray.crossJoin(combine.withColumnRenamed("idf", "idf2"))
@F.udf(returnType=FloatType())
def cos_sim(u, v):
return float(u.dot(v) / (u.norm(2) * v.norm(2)))
df.withColumn("cos_sim", cos_sim(F.col("idf"), F.col("idf2")))
1条答案
按热度按时间2wnc66cl1#
您的代码中需要多处更正:
Pipeline
。正确的导入是from pyspark.ml import Pipeline
。以下是工作代码:
使用的样本数据集: