spark中的并行fp增长

i7uaboj4 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(514)

我试图理解fptree类的“add”和“extract”方法：(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/fpgrowth.scala).
“summaries”变量的用途是什么？
小组名单在哪里？我假设是这样的，对吗

val numParts = if (numPartitions > 0) numPartitions else data.partitions.length
val partitioner = new HashPartitioner(numParts)

对于{a，b，c}，{a，b}，{b，c}这三个经常发生的事务，摘要将包含什么？

def add(t: Iterable[T], count: Long = 1L): FPTree[T] = {
  require(count > 0)
  var curr = root
  curr.count += count
  t.foreach { item =>
    val summary = summaries.getOrElseUpdate(item, new Summary)
    summary.count += count
    val child = curr.children.getOrElseUpdate(item, {
      val newNode = new Node(curr)
      newNode.item = item
      summary.nodes += newNode
      newNode
    })
    child.count += count
    curr = child
  }
  this
}

def extract(
    minCount: Long,
    validateSuffix: T => Boolean = _ => true): Iterator[(List[T], Long)] = {
  summaries.iterator.flatMap { case (item, summary) =>
    if (validateSuffix(item) && summary.count >= minCount) {
      Iterator.single((item :: Nil, summary.count)) ++
        project(item).extract(minCount).map { case (t, c) =>
          (item :: t, c)
        }
    } else {
      Iterator.empty
    }
  }
}

scala apache-spark fpgrowth

来源：https://stackoverflow.com/questions/63173902/parallel-fp-growth-in-spark

1条答案

按热度按时间

hwamh0ep1#

经过一点实验，它是非常直接的：
1+2）分区确实是组的代表。它也是条件交易的计算方式：

private def genCondTransactions[Item: ClassTag](
      transaction: Array[Item],
      itemToRank: Map[Item, Int],
      partitioner: Partitioner): mutable.Map[Int, Array[Int]] = {
    val output = mutable.Map.empty[Int, Array[Int]]
    // Filter the basket by frequent items pattern and sort their ranks.
    val filtered = transaction.flatMap(itemToRank.get)
    ju.Arrays.sort(filtered)
    val n = filtered.length
    var i = n - 1
    while (i >= 0) {
      val item = filtered(i)
      val part = partitioner.getPartition(item)
      if (!output.contains(part)) {
        output(part) = filtered.slice(0, i + 1)
      }
      i -= 1
    }
    output
  }

摘要只是保存事务中项目计数的一个助手，extract/project将使用上/下递归和相关fp树（project）生成fi，同时检查摘要是否需要遍历该路径。节点'a'的摘要将有{b:2，c:1}，节点'a'的子节点是'b'和'c'。

赞(0）回复(0）举报 2021-05-27

我来回答

spark中的并行fp增长

1条答案

相关问题

热门标签

最新问答