如何在pandas中将多行标题值转换为列值

dced5bon  于 2023-03-21  发布在  其他
关注(0)|答案(3)|浏览(142)

如何将表中的客户名称从行转换为列值。
数据字典
“发票编号”:“Ketan patel”:nan,'03/25/2022':175264.0,'03/24/2022':175034.0,'03/22/2022':174548.0,‘阿尔门达·萨拉’:nan,'03/31/2022':176323.0,“2022年3月29日”:175934.0,‘哈桑’:nan,'Lara Brian ':nan,'03/28/2022':175668.0,'03/23/2022':174854.0},'销售额':“Ketan patel”:nan,'03/25/2022':477600.0,'03/24/2022':16800.0,'03/22/2022':315000.0,'Almenda sarah':nan,'03/31/2022':350200.0,'03/29/2022':263400.0,‘哈桑’:nan,'Lara Brian ':nan,'03/28/2022':232700.0,'03/23/2022':319600.0},“交付费用”:“Ketan patel”:nan,'03/25/2022':0.0,“2022年3月24日”:0.0,“2022年3月22日”:0.0,'Almenda sarah':nan,'03/31/2022':0.0,“2022年3月29日”:0.0,'哈桑':nan,'Lara Brian ':nan,'03/28/2022':0.0,“2022年3月23日”:0.0},“总销售额”:“Ketan patel”:nan,'03/25/2022':477600.0,'03/24/2022':16800.0,'03/22/2022':315000.0,'Almenda sarah':nan,'03/31/2022':350200.0,'03/29/2022':263400.0,‘哈桑’:nan,'Lara Brian ':nan,'03/28/2022':232700.0,'03/23/2022':319600.0

wswtfjt7

wswtfjt71#

正如@Marcelo Paco 所提到的,您的dict不会生成与第一张图片匹配的 Dataframe ,但无论如何,这里有一个joinffill的选项可以获得您期望的输出(* 从第一张图片/电子表格中显示的数据开始 *):

# Does the column `Invoice Date` hold a date ?
m = pd.to_datetime(df["Invoice Date"], errors="coerce").notna()
​
out = df.join(df["Invoice Date"].where(~m).rename("Customer Name").ffill()).loc[m]
# you can add .reset_index(drop=True) if you don't care about the index​

输出:

print(out)

   Invoice Date  Invoice No  Sales Amount  Delivery Charges  Total Sales  Customer Name
1    03/25/2022    175228.0       50500.0               0.0      50500.0    Ketan patel
2    03/24/2022    175034.0       16800.0               0.0      16800.0    Ketan patel
3    03/22/2022    174548.0      315000.0               0.0     315000.0    Ketan patel
5    03/31/2022    176323.0      350200.0               0.0     350200.0  Almenda sarah
6    03/29/2022    175928.0      121700.0               0.0     121700.0  Almenda sarah
8    03/29/2022    175934.0      263400.0               0.0     263400.0         Hassan
10   03/28/2022    175668.0      232700.0               0.0     232700.0     Lara Brian
11   03/25/2022    175264.0      477600.0               0.0     477600.0     Lara Brian
12   03/23/2022    174854.0      319600.0               0.0     319600.0     Lara Brian
nhaq1z21

nhaq1z212#

您给予以下数据框,它与屏幕截图略有不同(哈桑没有值):

Invoice No  Sales Amount  Delivery Charges  Total Sales
Ketan patel           NaN           NaN               NaN          NaN
03/25/2022       175264.0      477600.0               0.0     477600.0
03/24/2022       175034.0       16800.0               0.0      16800.0
03/22/2022       174548.0      315000.0               0.0     315000.0
Almenda sarah         NaN           NaN               NaN          NaN
03/31/2022       176323.0      350200.0               0.0     350200.0
03/29/2022       175934.0      263400.0               0.0     263400.0
Hassan                NaN           NaN               NaN          NaN
Lara Brian            NaN           NaN               NaN          NaN
03/28/2022       175668.0      232700.0               0.0     232700.0
03/23/2022       174854.0      319600.0               0.0     319600.0

建议稿

技巧是使用cumsum - 1构建gkey
然后这个gkey对应于names列表的索引,这允许我们拾取想要的名称。
agg函数通过列表名称加索引填充gkey列。
最后为了可读性我将gkey列重命名为

import pandas as pd
import numpy as np
import time

d = {'Invoice No': {'Ketan patel': np.nan, '03/25/2022': 175264.0, '03/24/2022': 175034.0, '03/22/2022': 174548.0, 'Almenda sarah': np.nan, '03/31/2022': 176323.0, '03/29/2022': 175934.0, 'Hassan ': np.nan, 'Lara Brian ': np.nan, '03/28/2022': 175668.0, '03/23/2022': 174854.0}, 'Sales Amount': {'Ketan patel': np.nan, '03/25/2022': 477600.0, '03/24/2022': 16800.0, '03/22/2022': 315000.0, 'Almenda sarah': np.nan, '03/31/2022': 350200.0, '03/29/2022': 263400.0, 'Hassan ': np.nan, 'Lara Brian ': np.nan, '03/28/2022': 232700.0, '03/23/2022': 319600.0}, 'Delivery Charges': {'Ketan patel': np.nan, '03/25/2022': 0.0, '03/24/2022': 0.0, '03/22/2022': 0.0, 'Almenda sarah': np.nan, '03/31/2022': 0.0, '03/29/2022': 0.0, 'Hassan ': np.nan, 'Lara Brian ': np.nan, '03/28/2022': 0.0, '03/23/2022': 0.0}, 'Total Sales': {'Ketan patel': np.nan, '03/25/2022': 477600.0, '03/24/2022': 16800.0, '03/22/2022': 315000.0, 'Almenda sarah': np.nan, '03/31/2022': 350200.0, '03/29/2022': 263400.0, 'Hassan ': np.nan, 'Lara Brian ': np.nan, '03/28/2022': 232700.0, '03/23/2022': 319600.0}}

df = pd.DataFrame(d)

start = time.time()

df['gkey'] = df['Invoice No'].isna().cumsum()-1
# names = ['Ketan patel', 'Almenda sarah', 'Hassan ', 'Lara Brian ']
names = df.index[df['Invoice No'].isna()].tolist()

df['gkey'] = df.agg({'gkey': lambda x: names[x]})

df = df.dropna(axis=0).rename(columns={'gkey':'Customer name'})

end = time.time()

print("speed : ", end-start, " seconds \n")

print(df)

结果

speed :  0.012795677185058594  seconds 

            Invoice No  Sales Amount  ...  Total Sales  Customer name
03/25/2022    175264.0      477600.0  ...     477600.0    Ketan patel
03/24/2022    175034.0       16800.0  ...      16800.0    Ketan patel
03/22/2022    174548.0      315000.0  ...     315000.0    Ketan patel
03/31/2022    176323.0      350200.0  ...     350200.0  Almenda sarah
03/29/2022    175934.0      263400.0  ...     263400.0  Almenda sarah
03/28/2022    175668.0      232700.0  ...     232700.0    Lara Brian 
03/23/2022    174854.0      319600.0  ...     319600.0    Lara Brian 

[7 rows x 5 columns]
ix0qys7i

ix0qys7i3#

你可以添加一个helper function来检查字符串是否是日期:

from dateutil.parser import parse
def is_date(x):
    try:
        dt = parse(x)
        return True
    except:
        return False

现在我们可以操作df来获得所需的结果:

# Get indexes as a list
a = df.index.to_list()
# Helper arrays
idx = []
names = []
# Get names of original columns
lst = df.columns.to_list()
for i, j in enumerate(a):
    # Add to helper arrays if its not a date
    if not is_date(j):
        idx.append(i)
        names.append(j)
# Create an empty column for Customer Name
df['Customer Name'] = np.nan
# Set the values to the Customer Name Column
for i, j in enumerate(idx):
    if i != 0:
        if i == len(idx) - 1:
            df['Customer Name'][idx[i]:] = names.pop(0)
        else:
            df['Customer Name'][idx[i]:idx[i+1]] = names.pop(0)
    else:
        df['Customer Name'][:idx[i+1]] = names.pop(0)
# Drop rows that have NaN in original columns
df.dropna(subset=lst, inplace=True)

我们的df现在看起来像这样:

Invoice No  Sales Amount  Delivery Charges  Total Sales  Customer Name
03/25/2022    175264.0      477600.0               0.0     477600.0    Ketan patel
03/24/2022    175034.0       16800.0               0.0      16800.0    Ketan patel
03/22/2022    174548.0      315000.0               0.0     315000.0    Ketan patel
03/31/2022    176323.0      350200.0               0.0     350200.0  Almenda sarah
03/29/2022    175934.0      263400.0               0.0     121700.0  Almenda sarah
03/28/2022    175668.0      232700.0               0.0     232700.0    Lara Brian 
03/23/2022    174854.0      319600.0               0.0     319600.0    Lara Brian

注意:您提供的Data dict与您展示的图片不具有相同的数据,上面的结果反映的是Data dict不是图片。

相关问题