如何通过匹配列值来查找相似的行?

z31licg0  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(282)

所以我有一个数据集

{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}

我已经把数据放平了

+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
|   att-a|   att-b|   att-c|   att-d|   att-e|   att-f|   att-g|   att-h|   att-i|   att-j|   customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|

我想完成comaprecolumns函数。它比较两个Dataframe(userdf和flatteddf)的列,并返回一个新的df作为示例输出。
怎么做?比如,比较flattedDF中的每一行和每一列是否与userdf和count++匹配?e、 g att-a和att-a att-b和att-b。

def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
    dataFrame.filter($"customer" === customerID).toDF()
  }

  def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
    val userDF = dataFrame.transform(getCustomer(customerID))
    userDF.printSchema()
    userDF
  }

样本输出:

+--------------------+-----------+
| customer   | similarity_score |
+--------------------+-----------+
|customer-1  | -1  | its the same as the reference customer so to ignore '-1'
|customer-12 |  2  |
|customer-3  |  2  |
|customer-44 |  5  |
|customer-5  |  1  |
|customer-6  | 10  |

谢谢

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题