我试图理解fptree类的“add”和“extract”方法:(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/fpgrowth.scala).
“summaries”变量的用途是什么?
小组名单在哪里?我假设是这样的,对吗
val numParts = if (numPartitions > 0) numPartitions else data.partitions.length
val partitioner = new HashPartitioner(numParts)
对于{a,b,c},{a,b},{b,c}这三个经常发生的事务,摘要将包含什么?
def add(t: Iterable[T], count: Long = 1L): FPTree[T] = {
require(count > 0)
var curr = root
curr.count += count
t.foreach { item =>
val summary = summaries.getOrElseUpdate(item, new Summary)
summary.count += count
val child = curr.children.getOrElseUpdate(item, {
val newNode = new Node(curr)
newNode.item = item
summary.nodes += newNode
newNode
})
child.count += count
curr = child
}
this
}
def extract(
minCount: Long,
validateSuffix: T => Boolean = _ => true): Iterator[(List[T], Long)] = {
summaries.iterator.flatMap { case (item, summary) =>
if (validateSuffix(item) && summary.count >= minCount) {
Iterator.single((item :: Nil, summary.count)) ++
project(item).extract(minCount).map { case (t, c) =>
(item :: t, c)
}
} else {
Iterator.empty
}
}
}
1条答案
按热度按时间hwamh0ep1#
经过一点实验,它是非常直接的:
1+2)分区确实是组的代表。它也是条件交易的计算方式:
摘要只是保存事务中项目计数的一个助手,extract/project将使用上/下递归和相关fp树(project)生成fi,同时检查摘要是否需要遍历该路径。节点'a'的摘要将有{b:2,c:1},节点'a'的子节点是'b'和'c'。