regex pandas Series使用正则表达式键替换字典

n3schb8v  于 2023-10-22  发布在  其他
关注(0)|答案(2)|浏览(127)

假设有一个定义为

df = pd.DataFrame({'Col_1': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', '0'], 
                   'Col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', '0']})

看起来像

Col_1 Col_2
0      A     a
1      B     b
2      C     c
3      D     d
4      E     e
5      F     f
6      G     g
7      H     h
8      I     i
9      J     j
10     0     0

我想使用定义为的字典替换Col_1中的值

repl_dict = {re.compile('[ABH-LP-Z]'): 'DDD',
             re.compile('[CDEFG]'): 'BBB WTT',
             re.compile('[MNO]'): 'AAA WTT',
             re.compile('[0-9]'): 'CCC'}

我希望得到一个新的框架,其中Col_1应该如下所示

Col_1
0       DDD
1       DDD
2   BBB WTT
3   BBB WTT
4   BBB WTT
5   BBB WTT
6   BBB WTT
7       DDD
8       DDD
9       DDD
10      CCC

我只使用df['Col_1'].replace(repl_dict, regex=True)。但它并没有产生我所期望的。我得到的是:

Col_1
0     BBB WTTBBB WTTBBB WTT
1     BBB WTTBBB WTTBBB WTT
2                   BBB WTT
3                   BBB WTT
4                   BBB WTT
5                   BBB WTT
6                   BBB WTT
7     BBB WTTBBB WTTBBB WTT
8     BBB WTTBBB WTTBBB WTT
9     BBB WTTBBB WTTBBB WTT
10                      CCC

如果有人能告诉我为什么df.replace()不适合我,以及如何正确地替换多个值以获得预期的输出,我将非常感激。

72qzrwbm

72qzrwbm1#

使用锚点(即^$):

repl_dict = {re.compile('^[ABH-LP-Z]$'): 'DDD',
             re.compile('^[CDEFG]$'): 'BBB WTT',
             re.compile('^[MNO]$'): 'AAA WTT',
             re.compile('^[0-9]+$'): 'CCC'}

使用df['Col_1'].replace(repl_dict, regex=True)生成:

0         DDD
1         DDD
2     BBB WTT
3     BBB WTT
4     BBB WTT
5     BBB WTT
6     BBB WTT
7         DDD
8         DDD
9         DDD
10        CCC
hwazgwia

hwazgwia2#

更现实的情况可能是您希望根据以下模式对条目进行重新分类:
考虑如下的矩阵“x”:

column
0       good farmer
1        bad farmer
2         ok farmer
3  worker did wrong
4      worker fired
5      worker hired
6   heavy duty work
7   light duty work

然后考虑以下代码:

x['column_reclassified'] = x['column'].replace(
    to_replace=[
        '^.*(farmer).*$',
        '^.*(worker).*$',
        '^.*(duty).*$'
    ],
    value=[
        'FARMER',
        'WORKER',
        'DUTY'
    ],
    regex=True
)

它将产生以下输出:

column column_reclassified
0       good farmer              FARMER
1        bad farmer              FARMER
2         ok farmer              FARMER
3  worker did wrong              WORKER
4      worker fired              WORKER
5      worker hired              WORKER
6   heavy duty work                DUTY
7   light duty work                DUTY

希望这也有帮助。

相关问题