我是pyspark的新用户。我想比较两个不同 Dataframe (包含新闻信息)的文本以获得推荐。
我可以用Python很容易地做到这一点:
def get_recommendations(title, cosine_sim, indices):
idx = indices[title]
# Get the pairwsie similarity scores
sim_scores = list(enumerate(cosine_sim[idx]))
print(sim_scores)
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores for 10 most similar movies
sim_scores = sim_scores[1:11]
talk_indices = [i[0] for i in sim_scores]
# Return the top 10 most
return ted['News Data'].iloc[talk_indices]
indices = pd.Series(det.index, index=det['Unnamed: 0']).drop_duplicates()
transcripts = det['News Data']
transcripts2 = ted['News Data']
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(transcripts)
tfidf_matrixx = tfidf.transform(transcripts2)
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrixx)
print(get_recommendations(0, cosine_sim, indices))
当我切换到pyspark时,我在计算TF-IDF时得到了非常不同的结果。
我在Pyspark中使用以下内容进行tfidf计算:
df1 = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('bbcclear.csv')
df2 = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('yenisafakcategorypredict.csv')
# tokenize
tokenizer = Tokenizer().setInputCol("News Data").setOutputCol("word")
wordsData = tokenizer.transform(df2)
wordsData2 = tokenizer.transform(df1)
# vectorize
vectorizer = CountVectorizer(inputCol='word', outputCol='vectorizer').fit(wordsData)
wordsData = vectorizer.transform(wordsData)
wordsData2 = vectorizer.transform(wordsData2)
# calculate scores
idf = IDF(inputCol="vectorizer", outputCol="tfidf_features")
idf_model = idf.fit(wordsData)
wordsData = idf_model.transform(wordsData)
idf_model = idf.fit(wordsData2)
wordsData2 = idf_model.transform(wordsData2)
如何使用上面获得的ID-IDF计算余弦相似度以做出推荐?
1条答案
按热度按时间kmpatx3s1#
以下是我的PoC任务中在Spark中使用TF-IDF的一个例子。我强烈推荐使用像BERT这样的高级NLP框架,而不是TF-IDF,以获得有意义的相似性。
示例数据集:
TF-IDF矢量化和余弦相似度计算: