numpy 在Pandas中进行向量化或加速for循环以进行数据转换

我有一个dataframe，格式如下：

df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
                   'Parent_age': [38, None, 40, 26],
                   'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
                   'Child1_age': [2, None, 1, 2],
                   'Child2_name': ['Pat', 'Dennis', None, None],
                   'Child2_age': [4, None, None, None]}) 

  Parent_username  Parent_age Child1_name  Child1_age Child2_name  Child2_age
0            Bob1        38.0        Mike         2.0         Pat         4.0
1           Ron23         NaN        John         NaN      Dennis         NaN
2          Lisa00        40.0        Curt         1.0        None         NaN
3            Joe_        26.0       Kelly         2.0        None         NaN

字符串
正如您在上面看到的，每一行对应一个父级（唯一ID），并且每个父级可以有多个子级。可以有很多孩子，但我列出了2个，每个孩子可以有很多属性，但在这个例子中我只有2个（名字，年龄）。子属性列遵循相同的约定。
我想把它变成这样：

df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
                    'Child_number': [1, 2, 1, 2, 1, 1],
                    'Child_age': [2, 4, None, None, 1, 2],
                    'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
                    'Parent_age': [38, 38, None, None, 40, 26]})

  Child_name  Child_number  Child_age Parent_username  Parent_age
0       Mike             1        2.0            Bob1        38.0
1        Pat             2        4.0            Bob1        38.0
2       John             1        NaN           Ron23         NaN
3     Dennis             2        NaN           Ron23         NaN
4       Curt             1        1.0          Lisa00        40.0
5      Kelly             1        2.0            Joe_        26.0

型
每一行对应一个child，Child_number表示它是第一个child还是第二个child，等等。
为了加快速度，我为df 2预先分配了空间，方法是创建一个大小合适的空 Dataframe ，而不是进行连接。我首先通过计算每个父节点有多少个子节点来遍历df 1，以获得df 2所需的行数。
然后，我构建了索引字典，将每个子节点/父节点Map到df 2中的行。我想，既然字典查找很快，这比每次使用where（）在df 2中查找行要好。同样，为此使用了for循环。
这些实际上并不需要很长时间。但是，使用for循环将数据从df实际复制到df 2需要很长时间：

for index in df.index:
    for col in df.columns:
        // copy df.loc[index, col] into the corresponding position in df2 using dataframe.loc

型
我真的希望有一个更快的方法来做到这一点。我不太了解向量化，也不确定它是否适用于字符串列。
请指示。谢啦，谢啦

你的代码很慢，因为你一次处理一个元素。您可以通过一次处理一个列来加快速度。下面的代码查找所有子名称列，查找它们具有值的索引（即不为空），并一次对所有这些字段进行操作。
我还添加了提前列出所有属性的方法，这样您就不必单独手动重命名它们。

cnames =  [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
cattrs = ['_name', '_age']
newnames = ['Child' + i for i in cattrs]
dflist = []

for childcol in cnames:
    cid = childcol.split('_')[0]
    cnum = int(cid[-1])
    attrs  = [cid + i for i in cattrs] # get all the attributes
    attrs.extend(['Parent_username', 'Parent_age'])
    
    cdf = df.loc[df[childcol].dropna().index, attrs]
    cdf['Child_number'] = cnum
    
    cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
    dflist.append(cdf)
    
newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)

字符串

numpy 在Pandas中进行向量化或加速for循环以进行数据转换

1条答案

相关问题

热门标签

最新问答