python-3.x 查找 Dataframe 中缺少的行并为列设置NaN的有效方法是什么？

gzszwxb4 于 2022-12-01 发布在 Python

关注(0)|答案(2)|浏览(103)

假设我有一个 Dataframe ，其中第一列是日期时间，其他列是指定日期时间中的数据（每小时收集一次数据，因此每行的第一列都比前一行晚一个小时）。在这个 Dataframe 中，一些日期时间的数据丢失。我想创建一个新的 Dataframe ，其中丢失的行被替换为其他列的相关日期时间和NaN。
我尝试从csv中读取 Dataframe 作为第一个DF，并在循环中创建一个空DF，以按时间顺序为每个小时创建日期时间，然后我从第一个DF中获取数据并将其放入第二个DF，如果第一个DF中没有指定日期时间的数据，我将NaN放入该行。
这对我来说很有效，但是它非常慢，需要3天的时间来运行70000行，我想有一个高效的方法来完成这个任务。
我想有一个更好的方法，如this one，但我需要它的日期时间。
我正在寻找一个类似Replacing one data frame value from another based on timestamp Criterion的答案，但只包含日期时间。

python-3.x

来源：https://stackoverflow.com/questions/74617766/what-is-the-efficient-way-to-find-missing-rows-of-a-dataframe-and-put-nan-for-co

2条答案

按热度按时间

abithluo1#

我认为你可以创建一个df，其中你有时间戳作为你的索引。
然后，您可以使用pd.date_range为每小时创建一个完整的日期时间范围（从最小值到最大值）。
然后，您可以运行Index.difference来有效地查找原始 Dataframe 中丢失的任何时间戳--〉这将是具有丢失值的新df的索引。
然后用NaN填写缺失的列

import pandas as pd
import numpy as np

# name of your datetime column
datetime_col = 'datetime'
 
# mock up some data
data = {
    datetime_col: [
        '2021-01-18 00:00:00', '2021-01-18 01:00:00',
        '2021-01-18 03:00:00', '2021-01-18 06:00:00'],
    'extra_col1': ['b', 'c', 'd', 'e'],
    'extra_col2': ['g', 'h', 'i', 'j'],
}

df = pd.DataFrame(data)
 
# Setting the Date values as index
df = df.set_index(datetime_col)
 
# to_datetime() method converts string
# format to a DateTime object
df.index = pd.to_datetime(df.index)
 
# create df of missing dates from the sequence
# starting from min dateitme, to max, with hourly intervals
new_df = pd.DataFrame(
    pd.date_range(
        start=df.index.min(), 
        end=df.index.max(),
        freq='H'
    ).difference(df.index)
)

# you will need to add these columns to your df
missing_columns = [col for col in df.columns if col!=datetime_col]

# add null data
new_df[missing_columns] = np.nan

# fix column names
new_df.columns = [datetime_col] + missing_columns

new_df

赞(0）回复(0）举报 2022-12-01

vd2z7a6w2#

我不确定我是否完全符合您的要求，即您尝试完成日期时间的频率是多少，但假设是每小时一次，那么您可以尝试以下内容：
1.使用pandas中的pd.date_range(start_date, end_date, freq='H')函数创建一个pandas DataFrame，其中包含您需要的所有缺失的每小时时间（一列，名称与初始DataFrame中的第一列相同）。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
1.使用pd.merge(initial_df, complete_df, how='outer')函数在两个 Dataframe 之间执行外部合并。如果我没有弄错的话，在初始 Dataframe 中没有日期的所有列都应该默认填充NA。
下面使用Matt的示例重现示例：

import pandas as pd
import numpy as np
 
# mock up some data
data = {
    'date': [
        '2021-01-18 00:00:00', '2021-01-18 01:00:00',
        '2021-01-18 03:00:00', '2021-01-18 06:00:00'],
    'extra_col1': ['b', 'c', 'd', 'e'],
    'extra_col2': ['g', 'h', 'i', 'j'],
}

df = pd.DataFrame(data)
 
# Use to_datetime() method to convert string
# format to a DateTime object
df['date'] = pd.to_datetime(df['date'])
 
# Create df with missing dates from the sequence
# starting from min dateitme, to max, with hourly intervals
new_df = pd.DataFrame(
    {'date': pd.date_range(
        start=df['date'].min(), 
        end=df['date'].max(),
        freq='H'
    )}
)

# Use the merge function to perform an outer merge
# and reorder the date column
result_df = pd.merge(df,new_df,how='outer')
result_df.sort_values(by='date',ascending=True, inplace=True)

赞(0）回复(0）举报 2022-12-01

我来回答

python-3.x 查找 Dataframe 中缺少的行并为列设置NaN的有效方法是什么？

2条答案

相关问题

热门标签

最新问答