PySpark迭代行并删除具有指定值的行

rt4zxlrg  于 2023-08-02  发布在  Spark
关注(0)|答案(2)|浏览(111)

我有一个像这样的数据框
| B栏| Column B |
| --| ------------ |
| [{id:1000,缩写Id:|1、姓名:“约翰”,行星:“地球”,太阳系:《银河系》,宇宙:“这一个”,大陆:{id:33,国家:“中国”,首都:“Bejing”},otherId:400,语言:“粤式”,品种:23409,生物:“人类”}] “Human”}] |
| [{id:2000,缩写Id:|2、姓名:“詹姆斯”,行星:“地球”,太阳系:《银河系》,宇宙:“这一个”,大陆:{id:33,国家:“俄罗斯”,首都:“莫斯科”},otherId:500,语言:“俄罗斯”,物种:12308,生物:“人类”}] “Human”}] |
在写入外部位置之前,如何遍历dataframe的行,以删除所有具有country: "China"的行?
我试过了

if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
    df.write.format("delta").mode("overwrite").save("file://path/")

字符串
和/或

for row in df.rdd.collect():
    if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
      df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")

bqucvtff

bqucvtff1#

您可以循环遍历各行,然后在每行中查找洲,然后在其中查找国家。
下面是示例代码:

import pandas as pd

# Assuming your DataFrame is named df

# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
    # Access the value in Column B, which contains the dictionary
    dict_value = row['Column B']
    
    # Check if the 'country' key in the dictionary is "China"
    if dict_value[0]['continent']['country'] == "China":
        # Drop the row if the condition is met
        df.drop(index, inplace=True)

# After iterating through all the rows, write the DataFrame to an external location
# Example: Writing to a CSV file
df.to_csv('output.csv', index=False)

字符串
希望对你有帮助。

8i9zcol2

8i9zcol22#

一种方法是使用exists数组函数。

from pyspark.sql.functions import expr
from pyspark.sql import Row

df = spark.createDataFrame([
    [
      [
        Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}), 
        Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
    ]
]], ["b"])

df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))

字符串
语法Row(**dict)将通过参数解包创建Row的示例。

相关问题