pandas 从 Dataframe 中提取特定单词- Python

zdwk9cvp  于 2022-12-28  发布在  Python
关注(0)|答案(3)|浏览(208)

1.我拥有的第一个 Dataframe 如下所示:
| 字符串1|
| - ------|
| Table 671usa50452.tab has been created as of the process date (12-19-22). |
| Table 643usa50552.tab has been created as of the process date (12-19-22). |
| Table 681usa50532.tab has been created as of the process date (12-19-22). |
| Table 621usa56452.tab has been created as of the process date (12-19-22). |
| Table 547usa67452.tab has been created as of the process date (12-19-22). |
我想提取所有包含'usa'之间的帐户和日期指定的每一行有这样的东西:
| 字符串1|账户|日期|
| - ------| - ------| - ------|
| Table 671usa50452.tab has been created as of the process date (12-19-22). | 671usa50452 | 12-19-22 |
| Table 643usa50552.tab has been created as of the process date (12-19-22). | 643usa50552 | 12-19-22 |
| Table 681usa50532.tab has been created as of the process date (12-19-22). | 681usa50532 | 12-19-22 |
| Table 621usa56452.tab has been created as of the process date (12-19-22). | 621usa56452 | 12-19-22 |
| Table 547usa67452.tab has been created as of the process date (12-19-22). | 547usa67452 | 12-19-22 |
我一直在尝试使用以下内容,但信息无法进入新 Dataframe 的列中:
第一个月
1.第二个 Dataframe 类似:
| 字符串2|
| - ------|
| 3203美国34088:资产USA1/asd011245|
| 3203美国34088:资产USA2/ghf023345|
| 3203美国34088:资产美国3/hgf012735|
| 3203美国34088:资产USA4/湿012455|
| 3203美国34088:资产美国5/nbj012245|
我希望得到以下信息:
| 字符串2|账户2|
| - ------| - ------|
| 3200美国34088:资产USA1/asd011245|小行星3200|
| 3201美国34088:资产USA2/ghf023345|小行星3201|
| 3202美国34088:资产美国3/hgf012735|小行星3202|
| 3203美国34088:资产USA4/湿012455|小行星3203|
| 3204美国34088:资产美国5/nbj012245|小行星3204|

bwntbbo3

bwntbbo31#

对于第一个 Dataframe ,我们可以使用str.extract如下:

df["Account"] = df["String1"].str.extract(r'(\w+)\.tab\b')
df["Date"] = df["String1"].str.extract(r'\((\d{2}-\d{2}-\d{2})\)')

对于第二个 Dataframe :

df["Account2"] = df["String2"].str.extract(r'^(\w+)')
roqulrg3

roqulrg32#

我认为这是可行的:

# Pandas lib
import pandas as pd

# -------------------------------------------------------------- FIRST DATAFRAME

# I will suppose youre importing the df from excel ok?
df1 = pd.read_excel("First_df.xlsx")

#First case:
list_account = []
list_date = []
for string in df1['String1']:
    if "usa" in string:
        new_string = string.split()
        newnew_string = new_string[1].split(".")
        date_string = new_string[10].split("(")
        datedate_string = date_string[0].split(")")
        
        list_account.append(newnew_string[0])
        list_date.append(datedate_string[0])

df_output = pd.DataFrame({'Account': list_account})
df_output['Date'] = list_date

# -------------------------------------------------------------- SECOND DATAFRAME

df2 = pd.read_excel("Second_df.xlsx")

list_account2 = []

for string in df2['String2']:
    if "usa" in string:
        new_string = string.split()
        list_account2.append(new_string[0])
        
df_output2 = pd.DataFrame({'Account2': list_account2})
lfapxunr

lfapxunr3#

第一个使用案例的答案:

l=[]
l2=[]
for i in range(len(df)):
    l.append(df.string1.tolist()[i].split(" ")[1])
    s=(df.string1.tolist()[j].split(" ")[10])
    l2.append(s[s.find("(")+1:s.find(")")])

df['Account']=l
df['Date']=l2

输出:

string1          Account  

    Date
0  Table 671usa50452.tab has been created as of t...  671usa50452.tab  12-19-22
1  Table 643usa50552.tab has been created as of t...  643usa50552.tab  12-19-22
2  Table 681usa50532.tab has been created as of t...  681usa50532.tab  12-19-22
3  Table 621usa56452.tab has been created as of t...  621usa56452.tab  12-19-22
4  Table 547usa67452.tab has been created as of t...  547usa67452.tab  12-19-22

对于第二种:

l3=[]
for i in range(len(df)):
    l.append(df.string1.tolist()[i].split(" ")[0]) 
df['Account2']=l3

相关问题