pandas Python:消除dafframe中重复的小数

vaqhlq81  于 2023-02-27  发布在  Python
关注(0)|答案(6)|浏览(194)

我有一个包含手动输入的数据点的数据框,理想情况下应该包含数字。然而,有大量的数据质量问题,其中有两个小数位如下所示:

A              B
0   54.6464        46.8484
1   64.68461       65.4
2   95.79527       65.644
3   484.644.161    45.45
4   71.257.9       21.1
5   12.8           10.8
6   9.6            12.5
7   312.4          12.787.57.674

通常,如果这些值只是小 Dataframe 中的几个值,我会手动更改它们。如果 Dataframe 较大,则会非常麻烦。我想去掉第二个小数点,这样我会得到以下结果:

A           B
0   54.646400   46.848400
1   64.684610   65.400000
2   95.795270   65.644000
3   484.644161  45.450000
4   71.257900   21.100000
5   12.800000   10.800000
6   9.600000    12.500000
7   312.400000  12.787577

我尝试过强制限制字符长度来去掉第二个小数点,但是它会在意想不到的地方弹出,所以下面的逻辑在这里不能很好地工作:

df['A'] = df['A'].str.slice(0,4)
df['B'] = df['B'].str.slice(0,4)
o0lyfsai

o0lyfsai1#

使用扩展正则表达式替换:

import re

pat = re.compile(r'^(\d+\.)([\d.]+)')  # precompiled pattern
repl = lambda m: m.group(1) + m.group(2).replace(".", "")
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))
A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577
cwxwcias

cwxwcias2#

与Leonid的答案非常相似,但没有使用. apply。不确定哪个是最好的。

import pandas as pd

data = {
    'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
    'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}

df = pd.DataFrame(data=data)

for key in df:
    df[key] = [x.split('.')[0]+'.'+''.join(x.split('.')[1:]) for x in df[key].tolist()]
    
print(df)
            A            B
0     54.6464      46.8484
1    64.68461         65.4
2    95.79527       65.644
3  484.644161        45.45
4     71.2579         21.1
5        12.8         10.8
6         9.6         12.5
7       312.4  12.78757674
wljmcqd8

wljmcqd83#

这是使用正则表达式删除第二个小数点的解决方案,而不是@Leonid Astrain提出的函数,看起来像缩短版:

import pandas as pd

# create a sample dataframe
data = {
    'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
    'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data)

# use regex to remove second decimal point
df = df.replace(r'\.(?=.*\.)', '', regex=True)

# convert columns to float
df['A'] = df['A'].astype(float)
df['B'] = df['B'].astype(float)

print(df)

输出将为:

A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577
yebdmbv4

yebdmbv44#

我会这样做:

def rectify_decimal(string):
   parts = string.split('.')
   if len(parts) > 1:
     return f"{parts[0]}.{''.join(parts[1:])}"
   else:
     return(parts[0])

df['A'] = df['A'].apply(rectify_decimal)
whlutmcx

whlutmcx5#

如果始终存在小数点:

df['A'].str.split('.').str[0] + '.' + df['A'].str.split('.').str[1:].str.join('')
aemubtdh

aemubtdh6#

您可以使用一些字符串操作:

def convert(sr):
    return (sr.str.split('.', n=1, expand=True)
              .pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
              .astype(float))

df = df.apply(convert)
print(df)

# Output
            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

相关问题