pandas 我怎样才能更快地创建一个值来自DataFrame的字典呢?

iyfamqjs  于 2022-12-21  发布在  其他
关注(0)|答案(2)|浏览(94)

我有这个 Dataframe :
| 事件|出席者|持续时间|
| - ------| - ------| - ------|
| 会议|[{" ID ":1,"电子邮件":"电子邮件1 "},{" ID":2,"电子邮件":"电子邮件2"]|三六零零|
| 午餐|[{" id ":2," email ":"电子邮件2 "},{" ID":3,"电子邮件":"电子邮件3"]|小行星7200|
我正试着把它变成这本字典:

{
    email1: {
       'num_events_with': 1,
       'duration_of_events': 3600,
    },
    email2: {
       'num_events_with': 2,
       'duration_of_events': 10,800,
    },
    email3: {
       'num_events_with': 1,
       'duration_of_events': 7200,
    },
}

但我的情况不同, Dataframe 有数千行,而我创建的字典使用多列来获取字典键的结果,因此我需要能够在创建字典时访问与每个用户电子邮件相关的信息。
字典的目的是提供用户在事件中遇到的人的信息,所以字典的第一个关键字是说用户在1个事件中遇到了持续3600秒的email。
下面是我获得这本词典的方法:

# need to sort because I use diff() later
df.sort_values(by='startTime', ascending=True, inplace=True)

# a list of all contacts (emails) that have been in events with user
contacts = contacts_in_events
contact_info_dict = {}
df['attendees_str'] = df['attendees'].astype(str)

for contact in contacts:
    temp_df = df[df['attendees_str'].str.contains(contact)]
    duration_of_events = temp_df['duration'].sum()
    num_events_with = len(temp_df.index)

    contact_info_dict[contact] = {
        'duration_of_events': duration_of_events,
        'num_events_with': num_events_with
    }

但是这太慢了。有没有办法用另一种更快的方法来做这件事?
这是实际 Dataframe 的输出。to_dict('records '):

{
 'creator': {
            'displayName': None,
            'email': 'creator of event',
            'id': None,
            'self': None
  },
  'start': {
            'date': None,
            'dateTime': '2022-09-13T12:30:00-04:00',
            'timeZone': 'America/Toronto'
  },
  'end': {
          'date': None,
          'dateTime': '2022-09-13T13:00:00-04:00',
          'timeZone': 'America/Toronto'
   },
  'attendees': [
       {
        'comment': None,
        'displayName': None,
        'email1': 'email1@email.com',
        'responseStatus': 'accepted'
       },
       {
        'comment': None,
        'displayName': None,
        'email': 'email2@email.com',
        'responseStatus': 'accepted'
       }
  ],
  'summary': 'One on One Meeting',
  'description': '...',
  'calendarType': 'work',
  'startTime': Timestamp('2022-09-13 16:30:00+0000', tz='UTC'),
  'endTime': Timestamp('2022-09-13 17:00:00+0000', tz='UTC'),
  'eventDuration': 1800.0,
  'dowStart': 1.0,
  'endStart': 1.0,
  'weekday': True,
  'startTOD': 59400,
  'endTOD': 61200,
  'day': Period('2022-09-13', 'D')
}
bksxznpy

bksxznpy1#

explode "与会者"转换为单独的行,然后使用json_normalize转换为列,使用groupby.agg聚合数据并将to_dict转换为:

out = (df
   .explode('attendees', ignore_index=True)
   .pipe(lambda d: d.join(pd.json_normalize(d.pop('attendees'))))
   .groupby('email')
   .agg(**{'num_events_with': ('email', 'count'),
           'duration_of_events': ('duration', 'sum')
          })
   .to_dict(orient='index')
)

输出:

{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
 'email2': {'num_events_with': 2, 'duration_of_events': 10800},
 'email3': {'num_events_with': 1, 'duration_of_events': 7200}}
uxhixvfz

uxhixvfz2#

示例

col = ['event', 'attendees', 'duration']
data = [['meeting', [{"id":1, "email": "email1"}, {"id":2, "email": "email2"}], 3600], ['lunch', [{"id":2, "email": "email2"}, {"id":3, "email": "email3"}],7200]]
df = pd.DataFrame(data, columns=col)

代码

df1 = df.explode('attendees')
grouper = df1['attendees'].str['email']
col1 = ['num_events_with', 'duration_of_events']
out = (df1.groupby(grouper)['duration'].agg(['count', sum]).T.set_axis(col1).to_dict())

out

{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
 'email2': {'num_events_with': 2, 'duration_of_events': 10800},
 'email3': {'num_events_with': 1, 'duration_of_events': 7200}}

如果您需要1行,请使用以下命令

(df.explode('attendees').assign(attendees=lambda x:x['attendees'].str['email'])
 .groupby('attendees')['duration'].agg(['count',sum])
 .T.set_axis(['num_events_with', 'duration_of_events']).to_dict())

相关问题