numpy 提取数值

wgx48brx 于 2023-10-19 发布在其他

关注(0)|答案(2)|浏览(120)

我需要转换一个列（'Price'），它可能有多个数值的价格（例如，500 -750美元）。我需要提取这些数据并找到它们的平均值，不包括日历年（例如2019年）。有时值包括字符串（例如，USD 500），有时为空白，有时仅为数字（例如500）。
目前，我是这样做的：

s=df['Price'].str.findall('\d+')
df['Price'] = s.apply(lambda x: np.mean([float(i) for i in x if int(i)<2000]))

但是，这只适用于值中有字符串的情况（例如：USD 500），并且不适用于空白或仅适用于数字（例如，500）。
我如何修改它，使其适用于任何类型？
谢谢
编辑：示例：

df = pd.DataFrame({
    "Price": ["Above Market Price", "USD200K as an initial price", "USD310K to 360K.", 300000, "", "150,000"]

前三个可以，后三个不行。
如果我只有前3个，输出将是：

df = pd.DataFrame({
        "Output": ["", 200, 335]

但是如果你加上最后3个，它就不起作用了。
最终预期输出，但我不需要最后3除以3，我只需要克服代码不能处理空白的障碍，如果它只是数字，目前它只在代码中有字符串时才有效：

df = pd.DataFrame({
            "Output": ["", 200, 335, 300, "", 150]

numpy

来源：https://stackoverflow.com/questions/76967949/extracting-numeric-values

2条答案

按热度按时间

eni9jsuy1#

使用复杂替换（str.replace）和正则表达式匹配（str.findall）：

def extract_prices(prices):
    prices = (prices.astype(str).str.replace(r',', '')
             .replace(r'(?<=\d)K', '000', regex=True).str.findall(r'\d+')
             .str.join(' ').str.split(' ', expand=True).replace('', np.nan)
             .astype(float).div(1000))
    prices = prices.where(prices < 2000).mean(axis=1)
    return prices

print(extract_prices(df['Price']))

0      NaN
1    200.0
2    335.0
3    300.0
4      NaN
5    150.0
dtype: float64

赞(0）回复(0）举报 2023-10-19

2nbm6dog2#

下面是一个使用正则表达式提取的向量化方法。我假设那个逗号是千位分隔符。

import re

import pandas as pd

df = pd.DataFrame({
    "Price": [
        "Above Market Price",            # excluded
        "USD200K as an initial price",   # 200000
        "USD310K to 360K.",              # mean of (310000, 360000)
        300000,                          # 300000
        "",                              # excluded
        "150,000",                       # 150000
        "2019",                          # excluded
    ]
})

terms = (
    df['Price'].astype(str)
    .str.replace(',', '')  # Interpret comma as digit group separator
    .str.extractall(pat=r'(\d+k?)', flags=re.IGNORECASE)
    [0].str.lower()        # Normalize case for 'K'
    .str.replace('k', '000')
    .astype(float)
)
terms = terms[(terms < 2000) | (terms > 2200)].groupby(level=0).mean()
print(terms)

此格式具有已提取值的原始索引，具有未提取值的间隙。

赞(0）回复(0）举报 2023-10-19

我来回答

numpy 提取数值

2条答案

相关问题

热门标签

最新问答