regex 在 Dataframe 中迭代列时，如何将行值从对象转换为字符串，以便使用正则表达式？

7fhtutme 于 2023-03-20 发布在其他

关注(0)|答案(3)|浏览(87)

我有54 k+行和31列的 Dataframe ，最后10列是我想调查的文章。

我要运行的正则表达式的目的是去掉标点符号

对第一个条目运行此命令即可：

essay0 = okcupid.essay0.dropna()
essay0 = essay0.astype('string')  # this is the only way I could find to convert to string

essay_master = essay0
#print(essay_master[0].title())
essay_master = re.sub(r'[\.\?\!\,\:\;\(\)\"]', '', essay_master[0])
print(essay_master)

但是试图为所有列中的每一行创建一种方法给我带来了麻烦。下面的代码是目前为止的尝试。

问题是：为什么它在上面的代码中工作，而在下面的循环中不工作？我如何将对象转换为字符串，以便正则表达式工作？

for col in okcupid[['essay0','essay1']]:
    col = okcupid[col] #col is the iterator and so acts as the index for which we are acting upon
    col.dropna(inplace=True) 
    #col = pd.DataFrame(data=col) # dont think this is needed...
    col.astype('string').dtypes
    #col.convert_dtypes(convert_string=True) # doesnt work
    print(col.dtypes) # still an object
    col = col.apply(lambda x: re.sub(r'[\.\?\!\,\:\;\(\)\"]', '', col)) # need string not object
    #for i, row in col.iterrows():
     #   lambda x: re.sub(r'[\.\?\!\,\:\;\(\)\"]', '', row) # this runs but doesnt seem to work on the rows...

dropna行，所以我期望astype行（末尾有和没有dtypes）行，但是没有。我试了convert_dtypes行，但是没有用，还有很多其他的东西，但是我完全卡住了！

regex

来源：https://stackoverflow.com/questions/75758775/iterating-through-columns-in-a-dataframe-how-do-i-turn-the-row-values-from-obje

3条答案

按热度按时间

q5lcpyga1#

你应该使用尽可能多的原生Pandas方法。它们中的大多数都隐式地处理NaN/None/etc。而且它们比使用.apply的自定义函数要快得多。在这里你可以尝试以下方法：

for col in okcupid.filter(regex="essay\d+", axis=1).columns:
    okcupid[col] = okcupid[col].str.replace("[.?!,:;()\"]", "", regex=True)

除了.filter，你当然也可以使用for col in ["essay0","essay1"]。但是因为你有10个文章专栏，使用.filter可能会使你的代码更简洁。除了re.sub，你应该使用.str.replace，它本质上是一样的，但是照顾到NaN/None，而且更快。关于regex的一个附带说明：在方括号[]里面你不需要转义任何东西，除了这里的"，因为我用它作为字符串分隔符。
如果你把它应用到样本 Dataframe 中
x一个一个一个一个x一个一个二个x
你会得到

numbers essay0 essay1
0        1    a b      a
1        2   None    b c
2        3    cde      d
3        4     fg   None

赞(0）回复(0）举报 2023-03-20

svmlkihl2#

您不必使用for循环来迭代列行。您可以使用apply函数。您首先需要定义一个function来执行标点符号剥离。要将列中的值转换为字符串，您可以使用astype：

def punct_strip(value):
     if value == np.nan:
      # return what you'd like when nan
     else:
       return re.sub(r'[\.\?\!\,\:\;\(\)\"]', '', value)

okcupid['essay0'] = okcupid['essay0'].astype(str) # Cast values into strings
okcupid['essay0'] = okcupid['essay0'].apply(punct_strip)
okcupid['essay1'] = okcupid['essay1'].astype(str) # Cast values into strings    
okcupid['essay1'] = okcupid['essay1'].apply(punct_strip)

apply函数将所选列中的每一行的值发送给给定的函数。如果你想了解更多关于apply here is the documentation的信息。

赞(0）回复(0）举报 2023-03-20

yqlxgs2m3#

检查documentation of pandas working with text data。字符串的专用Pandas类型是StringDtype。
而且，您似乎没有将col.astype保存在任何变量中。
另请查看Series.astype方法和DataFrame.astype方法的文档。

赞(0）回复(0）举报 2023-03-20

我来回答

regex 在 Dataframe 中迭代列时，如何将行值从对象转换为字符串，以便使用正则表达式？

3条答案

相关问题

热门标签

最新问答