清洁Pandas柱的最佳方法

gblwokeq  于 2023-01-07  发布在  其他
关注(0)|答案(3)|浏览(87)

我一直在尝试从数据集中清除一个特定的列。我多次使用函数. apply()以便丢弃可能存在于列的字符串值中的任何符号
对于每个符号,函数如下:.应用(λ x:十.替换(""、""))
虽然我的代码可以工作,但它相当长,而且不是那么干净。我想知道是否有更短和/或更好的方式来清理列。
下面是我的代码:

df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]

df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
                                .apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))


df_reviews['name'] = df_reviews['name'].str.strip()

正如您所看到的,许多. apply()函数使得很难清楚地看到从"name"列中删除了什么。
有人能帮帮我吗?
此致

lvjbypge

lvjbypge1#

您也可以使用regex

df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)

正则表达式模式:

'Review|[:\'"#{}_]'
  • Review:替换"审查"一词
  • |
  • [:\'"#{}_]-方括号[]中的任意字符

注:

如果您希望删除所有标点符号:你可以用这个代替

import string

df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)

这将删除以下字符:

!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
n6lpvg4x

n6lpvg4x2#

试试这个:

df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()
zf2sa74q

zf2sa74q3#

import pandas as pd

REMOVE_CHARS = ["Review", ":", "#", "{", "}", "_", "'", '"']
def process_name(name: str) -> str:
    for removal_char in REMOVE_CHARS:
        try:
            print(f"removal char {removal_char}", name.index(removal_char))
            name = name.replace(removal_char,"")
        except ValueError:
            continue
    return name

def clean_code(df_reviews: pd.DataFrame):
    # Renaming `Unnamed: 0` as `index` ; `0` as `Name`
    df_reviews = df_reviews.rename(columns={"Unnamed: 0": "index", "0": "Name"})
    # todo: clarification needed
    # Here Name col contains a words separated by : so `expand=True` separate it into different columns
    # then we just read the zeroth column
    df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
    # Preprocessing of name column
    # >> if `name` contains ["Review",":","#","{","}","_","'",'"'] remove/replace it
    df_reviews['name'] = df_reviews['name'].apply(lambda x: process_name(x))
    df_reviews['name'] = df_reviews['name'].str.strip()

if __name__ == "__main__":
    df_reviews = pd.read_csv("reviews.csv")

相关问题