Apache Spark 矩阵GraphX

tkclm6bt  于 2023-10-23  发布在  Apache
关注(0)|答案(1)|浏览(143)

我在CSV文件中有一个矩阵,看起来像这样:

A  B  C  D
A  0  3  2  5
B -1  0  2  9
C -1 -1  0  8
D -1 -1 -1  0

我想在Spark中使用GraphX创建图形。我有另一个文件中的顶点,现在我试图用矩阵中的值创建边。但我被卡住了。

val Vertices: RDD[(VertexId, String)] = data.map(_.split(",")).map { arr =>
      val id = arr(0)
      val place = arr(1)
      (id.toLong, place)
}

val edges: RDD[Edge[Double]] = edgesData.map(_.split(",")).map { arr =>
      val place1 = arr(0).toLong
      val place2 =
}

如何从CSV文件中创建矩阵的边?

nnvyjq4y

nnvyjq4y1#

GraphX要求你提供一个VertexId,这是每个顶点的唯一Long标识符。只使用RDD API,这需要一点技巧,但这里有一种方法可以做到这一点。
我没有你的csv文件的确切结构,这是我使用的:

> cat matrix.csv 
,A,B,C,D
A,0,3,2,5
B,-1,0,2,9
C,-1,-1,0,8
D,-1,-1,-1,0

在下面的内容中,我强调了您应该调整的代码行,您的文件可能会略有不同。

val data: RDD[(VertexId, String)] = sc.textFile("matrix.csv")
// getting the list of vertex names based on the CSV header
// adapt this line to your file structure
val vertex_names = data.first.split(",").tail

val vertices : RDD[ = data
    // removing the header, adapt this line to your file structure
    .filter(! _.startsWith(","))
    .zipWithIndex
    .map{ case (arr, id) => id -> arr.split(",")(0) }.cache()

val vert_index: RDD[(String, VertexId)] = vertices.map(_.swap)

// and now the gymnastic, we create a RDD of edges and then join with vert_index
// to replace vertex names by their VertexId 
val edges: RDD[Edge[Double]] = data
    .filter(! _.startsWith(","))
    .map(_.split(","))
    .map(arr => arr.head -> arr.tail.map(_.toDouble))
    .map{ case (letter, weights) => letter -> header.zip(weights) }
    .flatMapValues(x=>x) // this line generates one row per edge
    .join(vert_index).map{ case (_, ((out, weight), in)) => out -> (weight, in) }
    .join(vert_index).map{ case (_, ((weight, in), out)) => Edge(in, out, weight) }

相关问题