Pandas特征杂交

qfe3c7zg  于 2023-03-11  发布在  其他
关注(0)|答案(6)|浏览(73)

我有2列PandasDF:

col_A     col_B
 0         1
 0         0
 0         1
 0         1
 1         0
 1         0
 1         1

我想为col_A和col_B组合的每个值创建一个新列,类似于get_dummies(),但唯一的变化是我在这里尝试使用列的组合
示例OP -在此列中,Col_A的值为0,col_B的值为1:

col_A_0_col_B_1

   1
   0
   1
   1
   0
   0
   0

我目前正在使用iterrows()迭代每一行以检查值,然后进行更改
有没有一个通常的Pandas较短的方法来实现这一点。

daolsyd0

daolsyd01#

将链式布尔掩码转换为整数:

df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)

为了获得更好的性能:

df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)

性能:取决于行数和01值:

一个二个一个一个

6psbrbz9

6psbrbz92#

您可以使用np.where

df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
uurity8g

uurity8g3#

首先创建列,然后分配,例如0表示False
df['col_A_0_col_B_1'] = 0
然后使用loc,您可以按where col_A == 0和col_B ==1进行过滤,然后将1赋给新列df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1

zc0qhyus

zc0qhyus4#

如果我没理解错的话,你可以这样做:

import pandas as pd
data = [[0, 1],
        [0, 0],
        [0, 1],
        [0, 1],
        [1, 0],
        [1, 0],
        [1, 1]]

df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)

产出

col_A  col_B  col_A_0_col_B_1
0      0      1                1
1      0      0                0
2      0      1                1
3      0      1                1
4      1      0                0
5      1      0                0
6      1      1                0

或者作为替代:

df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)
mnowg1ta

mnowg1ta5#

您可以使用panda ~表示布尔非,再加上1和0表示真和假。

df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']
h43kikqp

h43kikqp6#

我在Pandas中寻找类似于谷歌ML入门课程中使用的tensorflow “crossed_column”的东西,但没有找到。这将用于向 Dataframe 添加one-hot编码的特征crosses。所选列必须已经被有序编码/因子分解。

def cross_category_features(
    df: pd.DataFrame,
    cross: list[str],
    remove_originals: bool = True
) -> pd.DataFrame:
    """
    Add feature crosses to the  based on the columns in cross_cols.  The columns must have already been factorized / ordinal encoded.

    :param data: The data to add feature crosses to
    :param cross_cols: The columns to cross. Columns must be int categorical 0 to n-1
    :param remove_originals: If True, remove the original columns from the data

    :return: The data with the feature crosses added
    """
    def set_hot_index(row):
        hot_index = (row[cross] * offsets).sum()
        row[hot_index + org_col_len] = 1
        return row

    org_col_len = df.shape[1]
    str_values = [[col + str(val) for val in sorted(df[col].unique())]
                  for col in cross]
    cross_names = ["_".join(x) for x in product(*str_values)]

    cross_features = pd.DataFrame(
        data=np.zeros((df.shape[0], len(cross_names))),
        columns=cross_names,
        dtype="int64")
    df = pd.concat([df, cross_features], axis=1)
    
    max_vals = df[cross].max(axis=0) + 1
    offsets = [np.prod(max_vals[i+1:]) for i in range(len(max_vals))]
    df.apply(set_hot_index, axis=1)

    if remove_originals:
        df = df.drop(columns=cross)

    return df

相关问题