pyspark基于列表的框架过滤器或包含

ebdffaop 于 2023-10-15 发布在 Spark

关注(0)|答案(3)|浏览(110)

我正在尝试使用列表过滤pyspark中的一个框架。我想根据列表进行筛选，或者只包含列表中有值的记录。下面的代码不工作：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

给出以下错误：ValueError：无法将列转换为bool：请使用“&”表示“和"，|在构建DataFrame布尔表达式时，''表示'或'，'~'表示'不'。

pyspark

来源：https://stackoverflow.com/questions/40421845/pyspark-dataframe-filter-or-include-based-on-list

3条答案

按热度按时间

wa7juj8i1#

它说的是“df.score in l”不能被评估，因为df.score给了你一个列，而“in”没有在该列类型上定义，使用“isin”
代码应该是这样的：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

请注意，where()是filter()的别名，因此两者可以互换。

赞(0）回复(0）举报 2023-10-15

vh0rcniy2#

基于@user3133475的回答，也可以像这样从col()调用isin()函数：

from pyspark.sql.functions import col

l = [10,18,20]
df.filter(col("score").isin(l))

赞(0）回复(0）举报 2023-10-15

qco9c6ql3#

我发现join的实现对于大型的嵌入式框架来说要比where快得多：

def filter_spark_dataframe_by_list(df, column_name, filter_list):
    """ Returns subset of df where df[column_name] is in filter_list """
    spark = SparkSession.builder.getOrCreate()
    filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
    return df.join(filter_df, df[column_name] == filter_df["value"])

赞(0）回复(0）举报 2023-10-15

我来回答

pyspark基于列表的框架过滤器或包含

3条答案

相关问题

热门标签

最新问答