我有一个rdd,看起来像这样:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
我想分别计算第二个条目(city/metro)中每个不同值的每一行最后一个值(1000、2000等)的平均值和最大值。我使用以下代码来收集“city”值:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
但是,我得到了错误,可能是因为序列中的字符串值(“unknown”例如)。
如何筛选出具有字符串和空值的行(=仅保留可转换为数字的行),然后计算最大值和平均值?
1条答案
按热度按时间7nbnzgx91#
试试这个。