csv 从具有不同值和类型的一列创建新的数据框列

qnyhuwrf  于 2023-07-31  发布在  其他
关注(0)|答案(4)|浏览(87)

我试图创建新的列的鱼类物种名称和整数作为值,保持索引做一个dataframe连接后。

import pandas as pd
df = pd.read_csv("fishCounts.csv",index_col=0)
countsdf = df[["Fish Count"]].copy()
countsdf.head()
    
Fish Count
0   38 Sand Bass, 16 Sculpin, 10 Blacksmith
1   138 Sculpin, 28 Sand Bass
2   150 Sculpin Released, 102 Sculpin, 40 Sanddab
3   156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3 ...
4   161 Sculpin

countsdf.columns = ["fish"]
countsdf.fish = countsdf.fish.str.split(", ", expand=False)
countsdf.head()

fish
0   [38 Sand Bass, 16 Sculpin, 10 Blacksmith]
1   [138 Sculpin, 28 Sand Bass]
2   [150 Sculpin Released, 102 Sculpin, 40 Sanddab]
3   [156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3...
4   [161 Sculpin]

字符串
这就是我不知道该去哪里的地方。遍历数据框行?列一个字典的清单?我是否可以以不同的方式导入数据以使其更容易?
编辑:这就是我想说的。

Sand Bass   Sculpin   Blacksmith   Sculpin Released  Sanddab  Black Croaker
0        38        16           10
1        28        138
2                  102                            150       40
3        29        156                                                      5
4                  161

eni9jsuy

eni9jsuy1#

类似于@Manakin
Fish Count int list

df['Fish Count']=df['Fish Count'].str.split(',')

字符串
爆炸,以分开每一条鱼与其id

df2=df.explode('Fish Count')


创建字典。在这里,我使用列表解析在Fish Count中通过数字后面白色分割值之后派生键和值

{i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}


结果

{'38': 'Sand Bass',
 ' 16': 'Sculpin',
 ' 10': 'Blacksmith',
 '138': 'Sculpin',
 ' 28': 'Sand Bass',
 '150': 'Sculpin Released',
 ' 102': 'Sculpin',
 ' 40': 'Sanddab',
 '156': 'Sculpin',
 ' 29': 'Sand Bass',
 ' 5': 'Black Croaker',
 '161': 'Sculpin'}


如果需要,可以打印

print(pd.DataFrame.from_dict({i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}, orient='index'))

                     0
38           Sand Bass
 16            Sculpin
 10         Blacksmith
138            Sculpin
 28          Sand Bass
150   Sculpin Released
 102           Sculpin
 40            Sanddab
156            Sculpin
 29          Sand Bass
 5       Black Croaker
161            Sculpin

mfpqipee

mfpqipee2#

我们可以使用str.splitstr.extractstack

s = df['Fish Count'].str.split(',',expand=True).stack()
s.str.extract('(\d+)(\D+)')

字符串
收益率-

0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
  3    3                ...
4 0  161            Sculpin


那么就由你来决定你想要/需要的格式。

s.str.extract('(\d+)(\D+)').groupby(level=[1]).agg(list)

                          0                                                  1
0  [38, 138, 150, 156, 161]  [ Sand Bass,  Sculpin,  Sculpin Released,  Scu...
1         [16, 28, 102, 29]       [ Sculpin,  Sand Bass,  Sculpin,  Sand Bass]
2               [10, 40, 5]            [ Blacksmith,  Sanddab,  Black Croaker]
3                       [3]                                             [ ...]


s.str.extract('(\d+)(\D+)').unstack(1)

     0                                 1                                  
     0    1    2    3                  0           1               2     3
0   38   16   10  NaN          Sand Bass     Sculpin      Blacksmith   NaN
1  138   28  NaN  NaN            Sculpin   Sand Bass             NaN   NaN
2  150  102   40  NaN   Sculpin Released     Sculpin         Sanddab   NaN
3  156   29    5    3            Sculpin   Sand Bass   Black Croaker   ...
4  161  NaN  NaN  NaN            Sculpin         NaN             NaN   NaN


s.str.extract('(\d+)(\D+)').values

array([['38', ' Sand Bass'],
       ['16', ' Sculpin'],
       ['10', ' Blacksmith'],
       ['138', ' Sculpin'],
       ['28', ' Sand Bass'],
       ['150', ' Sculpin Released'],
       ['102', ' Sculpin'],
       ['40', ' Sanddab'],
       ['156', ' Sculpin'],
       ['29', ' Sand Bass'],
       ['5', ' Black Croaker'],
       ['3', ' ...'],
       ['161', ' Sculpin']], dtype=object)


你可以把它变成口述。

# actually i'd use fish : num - 
# sorry closed my ide keys can only be unique in a dict.
{num : fish for num, fish in s.str.extract('(\d+)(\D+)').values}

{'38': ' Sand Bass',
 '16': ' Sculpin',
 '10': ' Blacksmith',
 '138': ' Sculpin',
 '28': ' Sand Bass',
 '150': ' Sculpin Released',
 '102': ' Sculpin',
 '40': ' Sanddab',
 '156': ' Sculpin',
 '29': ' Sand Bass',
 '5': ' Black Croaker',
 '3': ' ...',
 '161': ' Sculpin'}

gg58donl

gg58donl3#

首先,你需要分解你所做的列表,然后你可以用正则表达式提取两次,一次匹配数字,然后匹配文本。
有了数据

data = '38 Sand Bass, 16 Sculpin, 10 Blacksmith\n138 Sculpin, 28 Sand Bass\n150 Sculpin Released, 102 Sculpin, 40 Sanddab\n156 Sculpin, 29 Sand Bass, 5 Black Croaker\n161 Sculpin'
df = pd.DataFrame(data.split('\n'), columns=['Fish Count'])

字符串
执行

countsdf = df['Fish Count'].str.split(', ')
countsdf = countsdf.explode('Fish Count').rename('fish').to_frame()
countsdf['count'] = countsdf.fish.str.extract('([0-9]+)')
countsdf['species'] = countsdf.fish.str.extract('([a-zA-Z]+[ a-zA-Z]*)')
countsdf.drop('fish', axis=1, inplace=True)


输出量

count           species
0     38         Sand Bass
1     16           Sculpin
2     10        Blacksmith
3    138           Sculpin
4     28         Sand Bass
5    150  Sculpin Released
6    102           Sculpin
7     40           Sanddab
8    156           Sculpin
9     29         Sand Bass
10     5     Black Croaker
11   161           Sculpin

hzbexzde

hzbexzde4#

使用@Manakin的答案来获得这个多索引的 Dataframe :

0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
4 0  161            Sculpin

字符串
然后我重命名了列,去掉了“species”的前导和结尾空格,切换了列顺序,并设置了索引名称。

s.columns = ['num','species']
s.species = s.species.str.strip()
s = s.reindex(['species','num'],axis=1)
s.index.names = ['a','b']
s.head()

        species     num
a   b       
0   0   Sand Bass   38
1         Sculpin   16
2      Blacksmith   10
1   0     Sculpin   138
1       Sand Bass   28


然后我扁平化和重置索引,并删除了B索引。

s_flat = s.reset_index()
s_reindexed = s_flat.set_index(['a','species'])
s_reindexed = s_reindexed.drop(columns='b')
s_reindexed.head()

               num
a   species     
0 Sand Bass     38
     Sculpin    16
  Blacksmith    10
1    Sculpin    138
   Sand Bass    28


最后,我取消了堆叠并删除了多列索引级别。我有一个空列,我必须删除以及

s_reindexed = s_reindexed.unstack(1)
s_reindexed.columns = s_reindexed.columns.droplevel(0)
s_reset = s_reindexed.drop(columns=np.nan)
s_reset .head()

species     Albacore    Barracuda   Barracuda Released  Bat Ray Released    Black Croaker   Black Seabass Released  Blacksmith  Blue Perch  Bluefin Tuna    Bocaccio ...
a                                                                                   
0                NaN          NaN                  NaN               NaN              NaN                      NaN          10         NaN           NaN         NaN ...
1                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...
2                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...
3                NaN          NaN                  NaN               NaN                5                      NaN         NaN           3           NaN         NaN ...
4                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...

相关问题