python 从数据框的列中删除文本和字符值

qcbq4gxm  于 2023-05-27  发布在  Python
关注(0)|答案(2)|浏览(131)

我有“重量“列在我的数据框架,但在CSV文件,有很多不需要的文本,我需要删除的字母和所有字符,除了(.)从列的例子点:

import pandas as pd

df  = pd.DataFrame(
    [
        (1, '+9.1A', 100),
        (2, '-1A', 121),
        (3, '5B', 312),
        (4, '+1D', 567),
        (5, '+1C', 123),
        (6, '-2E', 101),
        (7, '+3T', 231),
        (8, '5A', 769),
        (9, '+5B', 907),
        (10, 'text', 15),
    ],
    columns=['colA', 'weight', 'colC']
)
print(df)

预期结果是:

真实的

df  = pd.DataFrame(
    [
        (0,68),
        (1,67),
        (2,68.1),
        (3,97.1),
        (4,113.9),
        (5,114),
        (6,112),
        (7,111.8),
        (8,111),
        (9,110.8),
        (10,111.2),
        (11,),
        (12,111.5),
        (13,'Not Appropriate at t'),

    ],
    columns=['colA', 'weight']
)
print(df)
lymnna71

lymnna711#

可以使用pandas.Series.str.extract

df["weight"] = df["weight"].str.extract("(\d+\.?\d*)")

df

#   colA weight  colC
#0     1    9.1   100
#1     2      1   121
#2     3      5   312
#3     4      1   567
#4     5      1   123
#5     6      2   101
#6     7      3   231
#7     8      5   769
#8     9      5   907
#9    10    NaN    15

对于真实的数据示例,在必须将列转换为str列之前:

df["weight"] = df["weight"].astype("str")

df["weight"] = df["weight"].str.extract("(\d+\.?\d*)")

df

#    colA weight
#0      0     68
#1      1     67
#2      2   68.1
#3      3   97.1
#4      4  113.9
#5      5    114
#6      6    112
#7      7  111.8
#8      8    111
#9      9  110.8
#10    10  111.2
#11    11    NaN
#12    12  111.5
#13    13    NaN
p3rjfoxz

p3rjfoxz2#

你可以使用regex和apply来删除列中的这些部分:

import re

def filter_number(x):
    # With + and - sign
    # number = re.search(r'(\-?\d+\.?\d*)', x)
    # without + and - sign
    number = re.search(r'(\d+\.?\d*)', x)
    if number:
        return float(number.groups()[0])
    return np.nan

df.weight = df.weight.apply(filter_number)

相关问题