在pandas Dataframe 中,基于在单行中拆分一列,用多行替换单行[duplicate]

wwwo4jvm  于 2023-09-29  发布在  其他
关注(0)|答案(1)|浏览(167)

此问题已在此处有答案

Split (explode) pandas dataframe string entry to separate rows(27个回答)
6天前关闭
我有一个dataframe,看起来像这样:

key     term        notes   Source
156349471   Aasdasd     Bleen   20623750
213740505   dfgdfgdfg   Blox    33052911
171645239   rtertertert sdffd   15805072|24361871|28885000
156134219   cvdv        dsfsdf  20305092|21259293|21905055|23136149
205936689   ddfg        dfsewr  34480604
205947819   xvcbfghf    svdst   34480604
213902333   jfghd       xcvsd   35020164
156133836   cvbcvb      xcvsfg  21907755|30098279
156349486   cvbcvb      xcv 24880025
156134727   dfgdfgdfg   sdfgdfs 24001450

我尝试做的是从这里创建一个dataframe,其中在Source列中有多个条目的每一行,该列由“|“”被转换为多行,该行的其余部分不受影响。由此可见:
171645239 rtertertert sdffd 15805072|24361871|28885000
会变成:

171645239   rtertertert sdffd   15805072
171645239   rtertertert sdffd   24361871
171645239   rtertertert sdffd   28885000

因此,对于上面的整个示例,10行将变为16行。
这是我尝试的代码:

new_data = []
for _, row in master_df.iterrows():
    for src in row['Source'].split('|'):
        new_data.append([row['key', 'term', 'notes', 'Source'], src])
        new_df = pd.DataFrame(new_data, columns=['key', 'term', 'notes', 'Source', 'src'])

print(new_df)

这是我得到的错误:

File "notations.py", line 70, in <module>
    new_data.append([row['key', 'term', 'notes', 'Source'], src])
                     ~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "XX\venv\Lib\site-packages\pandas\core\series.py", line 1072, in __getitem__
    return self._get_with(key)
           ^^^^^^^^^^^^^^^^^^^
  File "XX\venv\Lib\site-packages\pandas\core\series.py", line 1082, in _get_with
    return self._get_values_tuple(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "XX\venv\Lib\site-packages\pandas\core\series.py", line 1126, in _get_values_tuple
    raise KeyError("key of type tuple not found and not a MultiIndex")
KeyError: 'key of type tuple not found and not a MultiIndex'

此代码工作:

import pandas as pd

diddly = {
    'A': ['gone1', 'gone2'],
    'B': ['PMID1|PMID2', 'PMID3|PMID4']
}
df = pd.DataFrame(diddly)

print(diddly)

new_data = []
for _, row in df.iterrows():
    for pmid in row['B'].split('|'):
        new_data.append([row['A'], pmid])
        new_df = pd.DataFrame(new_data, columns=['Gone', 'PMID'])

print(new_df)

输出量:

Gone   PMID
0  gone1  PMID1
1  gone1  PMID2
2  gone2  PMID3
3  gone2  PMID4

所以我想知道这是否是我的数据框在错误情况下有两个以上的列的事实,但我不是Maven。
帮助将不胜感激!

h79rfbju

h79rfbju1#

验证码

使用以下代码

df.assign(Source=df['Source'].str.split('|')).explode('Source')

产出:

key         term        notes   Source
0   156349471   Aasdasd     Bleen   20623750
1   213740505   dfgdfgdfg   Blox    33052911
2   171645239   rtertertert sdffd   15805072
2   171645239   rtertertert sdffd   24361871
2   171645239   rtertertert sdffd   28885000
3   156134219   cvdv        dsfsdf  20305092
3   156134219   cvdv        dsfsdf  21259293
3   156134219   cvdv        dsfsdf  21905055
3   156134219   cvdv        dsfsdf  23136149
4   205936689   ddfg        dfsewr  34480604
5   205947819   xvcbfghf    svdst   34480604
6   213902333   jfghd       xcvsd   35020164
7   156133836   cvbcvb      xcvsfg  21907755
7   156133836   cvbcvb      xcvsfg  30098279
8   156349486   cvbcvb      xcv     24880025
9   156134727   dfgdfgdfg   sdfgdfs 24001450

如果你想重新分配索引,你可以在上面的结果上使用reset_index(drop=True)

示例

import pandas as pd
data = {'key': [156349471, 213740505, 171645239, 156134219, 205936689, 205947819, 213902333, 156133836, 156349486, 156134727], 
        'term': ['Aasdasd', 'dfgdfgdfg', 'rtertertert', 'cvdv', 'ddfg', 'xvcbfghf', 'jfghd', 'cvbcvb', 'cvbcvb', 'dfgdfgdfg'], 
        'notes': ['Bleen', 'Blox', 'sdffd', 'dsfsdf', 'dfsewr', 'svdst', 'xcvsd', 'xcvsfg', 'xcv', 'sdfgdfs'], 
        'Source': ['20623750','33052911', '15805072|24361871|28885000', '20305092|21259293|21905055|23136149', '34480604', 
                   '34480604', '35020164', '21907755|30098279', '24880025', '24001450']}
df = pd.DataFrame(data)

相关问题