pandas 如何在Palantir foundry中比较当前数据集中行与以前的数据集

t40tm48m 于 2023-06-28 发布在其他

关注(0)|答案(1)|浏览(127)

在Foundry代码存储库中是否有一种方法可以找到原始数据集的行数，并将其与干净的数据集进行比较？我试图添加一个数据期望值作为健康检查之一，以查看原始数据集是否小于前一个数据集的1%，如果是，则中止构建。
任何帮助都很感激
没有直接的健康检查选项，我可以实现这一点，我被palantir支持告诉尝试通过python转换代码进行数据预期检查。但我不是Python数据框架概念的Maven。

pandas

来源：https://stackoverflow.com/questions/76528006/how-to-compare-rows-from-current-dataset-with-previous-dataset-in-palantir-found

1条答案

按热度按时间

cigdeys31#

试试这个代码：

import pandas as pd
import os

# Step 1: Import the necessary libraries

# Step 2: Read the raw dataset and clean dataset into data frames
raw_df = pd.read_csv("raw_dataset.csv")
clean_df = pd.read_csv("clean_dataset.csv")

# Step 3: Calculate the row count for both datasets
raw_row_count = raw_df.shape[0]
clean_row_count = clean_df.shape[0]

# Step 4: Calculate the percentage difference
percentage_difference = (raw_row_count - clean_row_count) / raw_row_count * 100

# Step 5: Compare the percentage difference with the threshold and take the appropriate action
threshold = 1.0  # Adjust this threshold as needed

if percentage_difference < threshold:
    print("Aborting build: Raw dataset is less than 1% of the previous dataset.")
    os._exit(1)  # Abort the build process
else:
    print("Data expectation check passed.")

请记住将“raw_dataset.csv”和“clean_dataset.csv”替换为原始数据集和干净数据集的实际路径。
确保您对Foundry环境中的数据集具有必要的权限和访问权限，以执行这些操作。
请注意，此代码假定数据集为CSV格式。如果数据集的格式不同，则可能需要相应地调整**pd.read_csv（）**调用。

赞(0）回复(0）举报 2023-06-28

我来回答

pandas 如何在Palantir foundry中比较当前数据集中行与以前的数据集

1条答案

相关问题

热门标签

最新问答