numpy 如何基于一列扩展 Dataframe 的行?

nukf8bse  于 2023-02-04  发布在  其他
关注(0)|答案(1)|浏览(100)

我正在尝试开发一个程序,它可以根据列中的值为每一行创建多个行和列。
这是我的数据

import pandas as pd

data = pd.read_excel("test data.xlsx")

| 身份证|周数|工时|开始日期|结束日期|起始年份|起始周期间|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 美国汽车协会|第二章|十个|二○二三年一月十五日|二○二三年一月二十九日|二○二三|三个|
| bbb|三个|十二|二○二三年十二月二日|二○二三年五月三日|二○二三|七|
需要扩展表,以便每一行都按周数扩展。需要添加每周工时的列和计算每个ID的周数的列。
结果应如下所示
| 身份证|周数|工时|开始日期|结束日期|起始年份|起始周期间|周计数|劳工|周数|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 美国汽车协会|第二章|十个|二○二三年一月十五日|二○二三年一月二十九日|二○二三|三个|1个|五个|三个|
| 美国汽车协会|第二章|十个|二○二三年一月十五日|二○二三年一月二十九日|二○二三|三个|第二章|五个|四个|
| bbb|三个|十二|二○二三年十二月二日|二○二三年五月三日|二○二三|七|1个|四个|七|
| bbb|三个|十二|二○二三年十二月二日|二○二三年五月三日|二○二三|七|第二章|四个|八个|
| bbb|三个|十二|二○二三年十二月二日|二○二三年五月三日|二○二三|七|三个|四个|十个|
通过执行以下操作,我已经能够获得所需格式的表:

# Expand the number of rows by the number of weeks for each job record
df = df.loc[df.index.repeat(df["# of weeks"])].reset_index(drop=True)

不过,还有一些问题。
我添加了以下列

# Add column for cumulative number of weeks for each expanded job record row    
df['Week Count'] = df.groupby(['Id']).cumcount() + 1 

# Add column for year for each job record row
df['Year'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
                   (df['Starting Year'] + 1),
                    df['Starting Year'])

# Add column for the week number for the calendar year for each job record row
df['Week #'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
                   (df['Starting Week period'] + df['Week Count']-53),
                    df['Starting Week period'] + df['Week Count']-1)

# Add a column Period which concatenates the Year and Week # columns 
df['Period'] = df['Year'].astype(str) + "-" + df['Week #'].astype(str)

这会带来一些问题,因为只有当记录持续时间仅超过1个日历年时,"年"和"周"列才会重置。如果记录持续时间超过2个或更多日历年,则不会重置。
我尝试了以下方法

# Add column for number of week for each expanded job record row
df['Week Count'] = df.groupby(['Id']).cumcount() + 1 

# Add column for year for each job record row
from math import floor
df['Year'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
                   df['Starting Year'] + floor((df['Starting Week period'] + df['Week Count'])/52),
                   df['Starting Year'])

# Add column for the number of week for the calendar year for each job record row
df['Week #'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
                   (df['Starting Week period'] + df['Week Count']-53),
                    df['Starting Week period'] + df['Week Count']-1)

# Add leading 0 to the Week # Column
df['Week #'] = df['Week #'].astype(str).str.pad(2, side = 'left', fillchar = '0')

# Add a column Period which concatenates the Year and Week #  columns 
df['Period'] = df['Year'].astype(str) + "-" + df['Week  #'].astype(str)

然而,这是给我以下错误:

TypeError                                 Traceback (most recent call last)
Cell In[6], line 7
      4 # Add column for year for each job record row
      5 from math import floor
      6 df['Year'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
----> 7                        df['Starting Year'] + floor((df['Starting Week period'] + df['Week Count'])/52), 
      8                        df['Starting Year'])
     10 # Add column for the number of week for the calendar year for each job record row
     11 df['Week #'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
     12                        (df['Starting Week period'] + df['Week Count']-53),
     13                         df['Starting Week period'] + df['Week Count']-1)

File /opt/anaconda3/lib/python3.9/site-packages/pandas/core/series.py:191, in _coerce_method.<locals>.wrapper(self)
    189 if len(self) == 1:
    190     return converter(self.iloc[0])
--> 191 raise TypeError(f"cannot convert the series to {converter}")

TypeError: cannot convert the series to <class 'float'>
v8wbuo2f

v8wbuo2f1#

你试着把浮点数函数应用到Pandas系列上,它们是不同的类型
我建议你用.astype(int),它的舍入方式和math.floor一样

df['Year'] = np.where((df['Starting Week period'] + df['Week Count']-1) > 52,
               df['Starting Year'] + ((df['Starting Week period'] + df['Week Count']) / 52).astype(int),
               df['Starting Year'])

您还可以使用numpy库,用于应用不同的类型或舍入

import numpy as np
df['column'].apply(np.ceil)
df['column'].apply(np.floor)

但在您的情况下,您仍然必须应用.astype(int),因为应用np不会更改系列数据的类型

>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)],
...                   columns=['dogs', 'cats'])
>>> df['dogs'].apply(np.floor)
0    0.0
1    0.0
2    0.0
3    0.0
Name: dogs, dtype: float64
enter code here

而且会影响你的成绩

Id  # of weeks  Manhours  ......    Year Week #     Period
0  aaa           2        10  ......  2023.0     03  2023.0-03
1  aaa           2        10  ......  2023.0     04  2023.0-04
2  bbb           3        12  ......  2023.0     07  2023.0-07
3  bbb           3        12  ......  2023.0     08  2023.0-08
4  bbb           3        12  ......  2023.0     09  2023.0-09

希望对你有帮助!

相关问题