pandas 拆分 Dataframe 后返回设置的字符数

lpwwtiir  于 2023-03-28  发布在  其他
关注(0)|答案(5)|浏览(153)

我有一个数据框,其中一列包含以下格式的字符串:
[{"assignedAgentId": "15", "assignedFullName": "Virtual Assistant", "interactionTime": "2023-03-15 18:39:42.443+0000", "interactiveSequence": 1, "dialogId": "fcf12345-e2d-4e2c-be5t-ed2bcf1831gh"}, {"assignedAgentId": "16", "assignedAgentFullName": "S", "interactionTime": "2023-03-16 08:00:50.500+0000", "dialogId": "fcf12345-e2d-4e2c-be5t-ed2bcf1831gh"}]
我希望能够将最后一次出现的“interactionTime”之后的前19个字符提取到一个新列中,因此上面的示例将返回:
2023-03-16 08:00:50
我知道我可以从右边拆分一个字符串,使用:

result=line.rsplit('interactionTime',1)
print('This is the result:', result[1])

但是,我不确定如何返回后的前19个字符,并将它们存储在一个名为“交互时间”的新 Dataframe 列中
我试过:

df2=df1['interaction Time'].str.split("InteractionTime", n=1, expand=True)

但是,这将返回一列Nan。

2ic8powd

2ic8powd1#

你可以这样做。

data = ['2023-03-15 18:39:42.443+0000', '2023-03-16 08:00:50.500+0000']
df = pd.DataFrame(data, columns=['interactionTime'])
df2 = pd.DataFrame()

df2['InterationTime'] = df['interactionTime'].str[:19]

print(df2)

输出

InterationTime
0  2023-03-15 18:39:42
1  2023-03-16 08:00:50

Process finished with exit code 0
uxhixvfz

uxhixvfz2#

该列似乎是一个完全有效的JSON的列表。
你可以将Json加载到一个有效的Python结构(一个dicts列表)中,获取最后一项(用[-1]索引),然后获取从interactionTime值的开始到19([:19])的元素:

import json

import pandas as pd

df = pd.DataFrame({
    "ID": [1],
    "weird_col": [
        '[{"assignedAgentId": "15", "assignedFullName": "Virtual Assistant", "interactionTime": "2023-03-15 18:39:42.443+0000", "interactiveSequence": 1, "dialogId": "fcf12345-e2d-4e2c-be5t-ed2bcf1831gh"}, {"assignedAgentId": "16", "assignedAgentFullName": "S", "interactionTime": "2023-03-16 08:00:50.500+0000", "dialogId": "fcf12345-e2d-4e2c-be5t-ed2bcf1831gh"}]'
    ],
})

df['dates'] = df['weird_col'].apply(
    lambda x: json.loads(x)[-1]['interactionTime'][:19]
)
print(df['dates'])

这就是说...处理完时区,时间偏移等,当你有一个非常好的有效日期时间对象时,你确定你想失去所有有趣的信息吗?当然,你知道得更好,但是...不知道...当你只需要使用pd.to_datetime时,让我很难过:

df['dates'] = df['weird_col'].apply(
    lambda x: pd.to_datetime(json.loads(x)[-1]['interactionTime'])
)
print(df['dates'])
ehxuflar

ehxuflar3#

考虑将它们解析为datetime示例!

  • 如果以后需要,无需丢弃亚秒级数据
  • 处理多个时区(可能非常重要)
  • 易于控制显示格式
  • 能够进行偏移量数学运算,并使用它进行更多运算
>>> df["interactionTime"] = pd.to_datetime(df["interactionTime"])

自定义显示格式

>>> df["interactionTime"].dt.strftime("%Y-%m-%d %H:%M:%S")  # custom display
0    2023-03-15 18:39:42
1    2023-03-16 08:00:50
Name: interactionTime, dtype: object

从开始日期开始计算秒数的方便性示例

>>> (df["interactionTime"] - pd.to_datetime("2023-03-01 00:00:00+0000")).dt.total_seconds()
0    1276782.443
1    1324850.500
Name: interactionTime, dtype: float64
kupeojn6

kupeojn64#

您应该考虑更改时间格式,而不是提取前19个字符,如下所示:

import pandas as pd
from datetime import datetime

datas = [{"assignedAgentId": "15", "assignedFullName": "Virtual Assistant", 
         "interactionTime": "2023-03-15 18:39:42.443+0000", "interactiveSequence": 1, 
         "dialogId": "fcf12345-e2d-4e2c-be5t-ed2bcf1831gh"}, 
        {"assignedAgentId": "16", "assignedAgentFullName": "S", 
         "interactionTime": "2023-03-16 08:00:50.500+0000", 
         "dialogId": "fcf12345-e2d-4e2c-be5t-ed2bcf1831gh"}]

# Refactoring to 'records' json format
datas = [{k:[v] for k,v in d.items()} for d in datas]
# List of dataframes
dfx = [pd.read_json(str(dat).replace("'", '"'), orient='records') for dat in datas]
# Convert to specified format "%Y-%m-%d %H:%M:%S"
l=[]
for df in dfx:
    df["Interaction Time"] = pd.to_datetime(df.loc[:, "interactionTime"]).dt.strftime('%Y-%m-%d %H:%M:%S')
    l.append(df)

r = pd.concat(l, axis=0).reset_index(drop=True)
# Display the two r-columns
print(r[['interactionTime','Interaction Time']])
interactionTime     Interaction Time
0  2023-03-15 18:39:42.443+0000  2023-03-15 18:39:42
1  2023-03-16 08:00:50.500+0000  2023-03-16 08:00:50
bakd9h0s

bakd9h0s5#

感谢所有提供答案的人,特别是@BorrajaX,他指出了JSON格式。
使用他的方法给了我一个'字符串索引必须是整数'的错误,但它确实帮助我产生了以下内容,实现了我所需要的:

df["interactions"] = df["interactions"].apply(lambda x: json.dumps(x))
df["interactions"] = df["interactions"].apply(lambda x: json.loads(x))

# find the last instance of interactionTime and create a new column with everything after it
new_column_values = []
for interaction in datas2['interactions']:
    interaction_str = json.dumps(interaction)  # convert interaction object to a string
    interaction_time_index = interaction_str.rfind('interactionTime')  # find the last index of interactionTime
    if interaction_time_index != -1:
        new_column_value = interaction_str[interaction_time_index+len('"interactionTime": '):interaction_time_index+len('"interactionTime": ')+19]  # extract everything after interactionTime
        new_column_values.append(new_column_value)
    else:
        new_column_values.append('')  # if interactionTime not found, set value to empty string

# create a new column in the dataframe with the extracted values
df.loc[:, 'new_column'] = new_column_values

相关问题