PySpark迭代行并删除具有指定值的行

rt4zxlrg 于 2023-08-02 发布在 Spark

关注(0)|答案(2)|浏览(111)

我有一个像这样的数据框
| B栏| Column B |
| --| ------------ |
| [{id：1000，缩写Id：|1、姓名：“约翰”，行星：“地球”，太阳系：《银河系》，宇宙：“这一个”，大陆：{id：33，国家：“中国”，首都：“Bejing”}，otherId：400，语言：“粤式”，品种：23409，生物：“人类”}] “Human”}] |
| [{id：2000，缩写Id：|2、姓名：“詹姆斯”，行星：“地球”，太阳系：《银河系》，宇宙：“这一个”，大陆：{id：33，国家：“俄罗斯”，首都：“莫斯科”}，otherId：500，语言：“俄罗斯”，物种：12308，生物：“人类”}] “Human”}] |
在写入外部位置之前，如何遍历dataframe的行，以删除所有具有country: "China"的行？
我试过了

if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
    df.write.format("delta").mode("overwrite").save("file://path/")

字符串
和/或

for row in df.rdd.collect():
    if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
      df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")

型

pyspark

来源：https://stackoverflow.com/questions/76657664/pyspark-iterate-rows-and-drop-rows-with-specified-value

2条答案

按热度按时间

bqucvtff1#

您可以循环遍历各行，然后在每行中查找洲，然后在其中查找国家。
下面是示例代码：

import pandas as pd

# Assuming your DataFrame is named df

# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
    # Access the value in Column B, which contains the dictionary
    dict_value = row['Column B']
    
    # Check if the 'country' key in the dictionary is "China"
    if dict_value[0]['continent']['country'] == "China":
        # Drop the row if the condition is met
        df.drop(index, inplace=True)

# After iterating through all the rows, write the DataFrame to an external location
# Example: Writing to a CSV file
df.to_csv('output.csv', index=False)

字符串
希望对你有帮助。

赞(0）回复(0）举报 2023-08-02

8i9zcol22#

一种方法是使用exists数组函数。

from pyspark.sql.functions import expr
from pyspark.sql import Row

df = spark.createDataFrame([
    [
      [
        Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}), 
        Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
    ]
]], ["b"])

df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))

字符串
语法Row(**dict)将通过参数解包创建Row的示例。

赞(0）回复(0）举报 2023-08-02

我来回答

PySpark迭代行并删除具有指定值的行

2条答案

相关问题

热门标签

最新问答