Apache Spark 比较非常大的框架

ykejflvf 于 2023-10-23 发布在 Apache

关注(0)|答案(1)|浏览(133)

我试图比较两个非常大的内存块，每个都有大约10 PB的数据在Spark中。即使在增加内存配置后，执行也会抛出内存问题。有人能提出更好的替代方案来解决这个问题吗？
我使用的方法：
1.为每个嵌套框生成row_hashes

diff = df.select（'row_hash'）-df1.select（'row_hash'）
diff.join（df，df.columns.toSeq，“inner”）

apache-spark

来源：https://stackoverflow.com/questions/77229772/compare-very-large-dataframes

1条答案

按热度按时间

qzwqbdag1#

您可以使用Spark的LSH实现将两个嵌套框散列到具有非常严格的相似性度量的较低维度。
关于如何做到这一点的一些基本代码：

from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

# Create feature vectors
df = df.withColumn("features", Vectors.dense(col("feature_col1"), col("feature_col2")))
df1 = df1.withColumn("features", Vectors.dense(col("feature_col1"), col("feature_col2")))

# Initialize LSH model
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)

# Fit the model and transform data
model = mh.fit(df)
df_hashed = model.transform(df)
df1_hashed = model.transform(df1)

# Approximate join to find potential matches
approx_joined_df = model.approxSimilarityJoin(df_hashed, df1_hashed, 0.1, distCol="JaccardDistance")

# Filter based on a distance threshold for very very similar items
filtered_df = approx_joined_df.filter("JaccardDistance < 0.05")

这种方法肯定比直接比较要好。但是，你可能会发现，对于你所说的那种规模，你肯定也需要使用配置设置。

赞(0）回复(0）举报 2023-10-23

我来回答

Apache Spark 比较非常大的框架

1条答案

相关问题

热门标签

最新问答