基于数组/列之一中的None值切片numpy/pandas数组

e37o9pze  于 2023-11-15  发布在  其他
关注(0)|答案(1)|浏览(80)

我有下面的代码,我试图将一个数据数组切片/转换为多个 Dataframe ,然后将这些 Dataframe 连接在一起

import pandas as pd, numpy as np

arr=[['id',          'aaa',   'bbb',   None,    'ccc',   None, None],
     ['period',      'd',     'd',     None,    'd',     None, None],
     ['date',        'price', 'price', 'volume','price','volume', 'mktcap'],
     ['01/03/2001',  103.1,   103.2,   10000,   103.4,   20000, 1000000],
     ['01/04/2001',  104.1,   104.2,   11000,   104.4,   30000, 1000000],
     ['01/05/2001',  105.1,   105.2,   12000,   105.4,   40000, 1000000],
      ]

data=np.array(arr)

all_ts =[]
for col in range(1,data.shape[1]):
   id = data[0][col]
   per = data[1][col]
   if id is None:
      continue

   #from the 3rd row, take the date column and the current col and produce a dataframe
   ts_data = data[2:,[0, col]]
   cols = ts_data[:1,][0]         
   ts_data = pd.DataFrame(ts_data[1:], columns=cols)
   ts_data = ts_data[cols].dropna()
   ts_data['id'] = id
   ts_data['period'] = per
   all_ts.append(ts_data)

df = pd.concat(all_ts) 
df

字符串
上面的代码只会生成 Dataframe ,其中列的列为:id, period, date, price,因为当它遇到None时,我会继续(因为我还不知道如何获取下面的“None”列)。
最后,我想总结一下三个方面:
1.第1个,列:id, period, date, price
1.第二个带列:id, period, date, price, volume
1.第三个带列:id, period, date, price, volume, mktcap
所以基本上,从第3行开始,我想取第一列(日期列)加上“字段”列,并生成一个嵌套框-我挣扎的问题是,如果后续字段上面有None(即在句号或id列中),则应该将多个字段放在一起。

**编辑/更新:**我可以用下面的调整来做到这一点.但它感觉不太pythonic/numpyic.注意添加了另一个名为select_cols的变量,当遇到None时,它会不断被追加。我认为numpy必须提供一个更好的方法来做到这一点.只是不确定那可能是什么。

all_ts =[]
select_cols = [0]
for col in range(1,data.shape[1]):
   id = data[0][col]
   per = data[1][col]
   select_cols += [col]
   if id is None:
      continue

   #from the 3rd row, take the date column and the current col and produce a dataframe
   ts_data = data[2:, select_cols]
   cols = ts_data[:1,][0]         
   ts_data = pd.DataFrame(ts_data[1:], columns=cols)
   ts_data = ts_data[cols].dropna()
   ts_data['id'] = id
   ts_data['period'] = per
   all_ts.append(ts_data)
   select_cols=[0]

df = pd.concat(all_ts) 
df

tf7tbtn2

tf7tbtn21#

只是为了好玩,我确实想出了这种使用函数式pandas编程的方法。它更pythonic吗?很难说......

data = np.array(arr)
df = (pd.DataFrame(data[3:,1:],
                   # build column index from the first three rows, forward filled to remove None values
                   columns=pd.DataFrame(data[:3,1:]).ffill(axis=1).values.tolist(),
                   # row index is the first value in each row of actual data
                   index=pd.Index(data[3:,0], name=data[2,0])
                  )
    # melt the column index
    .melt(ignore_index=False)
    # and rename the columns back to their original names
    .rename(columns={ f'variable_{i}' : arr[0] for i, arr in enumerate(data[:2]) })
    .reset_index()
    # now pivot the values based on the data columns
    .pivot(index=['date'] + list(data[:2,0]), columns='variable_2', values='value')
    .reset_index()
    # get rid of the unnecessary column index name
    .rename_axis('', axis=1)
    # and sort on id
    .sort_values('id')
)

字符串
示例数据的输出:

date   id period   mktcap  price volume
0  01/03/2001  aaa      d      NaN  103.1    NaN
3  01/04/2001  aaa      d      NaN  104.1    NaN
6  01/05/2001  aaa      d      NaN  105.1    NaN
1  01/03/2001  bbb      d      NaN  103.2  10000
4  01/04/2001  bbb      d      NaN  104.2  11000
7  01/05/2001  bbb      d      NaN  105.2  12000
2  01/03/2001  ccc      d  1000000  103.4  20000
5  01/04/2001  ccc      d  1000000  104.4  30000
8  01/05/2001  ccc      d  1000000  105.4  40000

相关问题