regex python中从文本文件到panda Dataframe 的文本提取?

y4ekin9u  于 2023-04-13  发布在  Python
关注(0)|答案(1)|浏览(117)

我有ATM日志数据,其中包含每个客户的交易日志详细信息。我正在尝试从日志文件中提取客户数据。我在从文本文件中提取Trx_datetime字段时遇到问题。
我的样本数据

[01012020 101221 168][1][INFO]> -Cash Withdraw Initiated -------------
[01012020 101221 177][1][INFO]> -----Amount : 2500
[01012020 101221 187][21][INFO]> ----AUX NO : :xx:A00000XXX0200101101DD1-02
[01012020 101221 196][21][INFO]> ----AUX NO : :xx:A00000XXX200101101DD1-03
[01012020 101221 205][21][INFO]> ----AUX NO : :xx:A00000942020010XXXX221-04
[01012020 101222 487][1][INFO]> ---- Image Capture (TRX_RESPONSE_WITHDRAW)
[01012020 101222 560][1][INFO]> -----Withdraw Status : OK
[01012020 101222 567][1][INFO]> -----Account         : 60700XXXXXXXX
[01012020 101222 574][1][INFO]> -----Action Code     :na
[01012020 101222 580][1][INFO]> -----Response        : 000
[01012020 101222 587][1][INFO]> -----Trace ID        : 000000
[01012020 101222 595][1][INFO]> -----EOD ID          : 
[01012020 101222 602][1][INFO]> -----BATCH ID        : 
[01012020 101222 609][1][INFO]> -----TRX NO          : 
[01012020 101222 615][1][INFO]> ---Cash Withdraw Initiated Completed
[01012020 101222 757][1][INFO]> ---Send Online Data
[01012020 101222 763][1][INFO]> -----ARC           : 3030
[01012020 101222 770][1][INFO]> -----Trx DateTime  : 11/1/2020
[01012020 101222 777][1][INFO]> -----Online Status : Online_Perfoamed
[01012020 101223 091][1][INFO]> -EMV Transaction Completed------------
[01012020 101223 099][1][INFO]> --- Status  : Success
[01012020 101223 108][1][INFO]> --- Message : Approved
[01012020 101941 893][1][INFO]> -Cash Withdraw Initiated -------------
[01012020 101941 900][1][INFO]> -----Amount : 30000
[01012020 101941 910][15][INFO]> ----AUX NO : :xx:A00000942xxxxxxxxx1941-02
[01012020 101941 919][15][INFO]> ----AUX NO : :xx:A000009420200XXXXXXXXX-03
[01012020 101941 928][15][INFO]> ----AUX NO : :xx:A000009xxxxxxxxx11xx41-04
[01012020 101943 317][1][INFO]> ---- Image Capture (TRX_RESPONSE_WITHDRAW)
[01012020 101943 406][1][INFO]> -----Withdraw Status : OK
[01012020 101943 415][1][INFO]> -----Account         : 6075XXXXXXXXX8
[01012020 101943 422][1][INFO]> -----Action Code     :na
[01012020 101943 429][1][INFO]> -----Response        : 000
[01012020 101943 436][1][INFO]> -----Trace ID        : 165870
[01012020 101943 442][1][INFO]> -----EOD ID          : 
[01012020 101943 449][1][INFO]> -----BATCH ID        : 
[01012020 101943 456][1][INFO]> -----TRX NO          : 
[01012020 101943 463][1][INFO]> ---Cash Withdraw Initiated Completed
[01012020 101943 605][1][INFO]> ---Send Online Data
[01012020 101943 613][1][INFO]> -----ARC           : 3030
[01012020 101943 619][1][INFO]> -----Trx DateTime  : 1/1/2020
[01012020 101943 628][1][INFO]> -----Online Status : Online_Perfoamed
[01012020 101943 972][1][INFO]> -EMV Transaction Completed------------
[01012020 101943 979][1][INFO]> --- Status  : Success
[01012020 101943 986][1][INFO]> --- Message : Approved
[01012020 102838 263][1][INFO]> -Cash Withdraw Initiated -------------
[01012020 102838 271][1][INFO]> -----Amount : 5000
[01012020 102838 281][10][INFO]> ----AUX NO : :xx:A000009420XXXXXXXXXXXX-02
[01012020 102838 290][10][INFO]> ----AUX NO : :xx:A00000942XXXXXXXXXXXXX-03
[01012020 102838 298][10][INFO]> ----AUX NO : :xx:A00000942XXXXXXXXXXXXX-04
[01012020 102839 660][1][INFO]> ---- Image Capture (TRX_RESPONSE_WITHDRAW)
[01012020 102839 735][1][INFO]> -----Withdraw Status : OK
[01012020 102839 742][1][INFO]> -----Account         : 106XXXXXXXXX
[01012020 102839 748][1][INFO]> -----Action Code     :na
[01012020 102839 755][1][INFO]> -----Response        : 000
[01012020 102839 762][1][INFO]> -----Trace ID        : 167030
[01012020 102839 768][1][INFO]> -----EOD ID          : 
[01012020 102839 777][1][INFO]> -----BATCH ID        : 
[01012020 102839 783][1][INFO]> -----TRX NO          : 
[01012020 102839 790][1][INFO]> ---Cash Withdraw Initiated Completed
[01012020 102839 931][1][INFO]> ---Send Online Data
[01012020 102839 940][1][INFO]> -----ARC           : 3030
[01012020 102839 947][1][INFO]> -----Trx DateTime  : 11/12/2020
[01012020 102839 953][1][INFO]> -----Online Status : Online_Perfoamed
[01012020 102840 273][1][INFO]> -EMV Transaction Completed------------
[01012020 102840 280][1][INFO]> --- Status  : Success
[01012020 102840 325][1][INFO]> --- Message : Approved

我试过这个代码:

import re
import pandas as pd

# Extract the required data from the text file using regular expressions
amounts = [int(m) for m in re.findall(r'Amount\s*:\s*(\d+)', text)]
withdraw_statuses = re.findall(r'Withdraw\s+Status\s*:\s*(\w+)', text)
accounts = re.findall(r'Account\s*:\s*(\d+)', text)
#trace_ids = [int(m) for m in re.findall(r'Trace\s+ID\s*:\s*(\d+)', text)]
trace_ids =  re.findall(r'Trace\s+ID\s*:\s*(\d+)', text)
#trx_datetimes = re.findall(r'Trx\s+DateTime\s\s:\s*(.+)', text)
trx_datetimes = re.findall(r'Trx\s+DateTime\s\s:\s\d{1,2}\/\d{1,2}\/\d{4}\s+', text)
#trx_datetimes = re.findall(r'Trx\s+DateTime\s\s:\s(\d{1,2}\/\d{1,2}\/\d{4}\s+\d{1,2}:\d{1,2}:\d{1,2}\s+(?:AM|PM))', text)
online_statuses = re.findall(r'Online\s+Status\s*:\s*(.+)', text)
statuses = re.findall(r'Status\s\s:\s*(.+)', text)
messages = re.findall(r'Message\s*:\s*(.+)', text)

# Create a list of dictionaries to store the extracted data for each transaction
data_list = []
for i in range(len(amounts)):
    data_dict = {
        'amount': amounts[i],
        'withdraw_status': withdraw_statuses[i],
        'account': accounts[i],
        'trace_id': trace_ids[i],
        'trx_datetime': trx_datetimes[i],
        'online_status': online_statuses[i],
        'status': statuses[i],
        'message': messages[i],
    }
    data_list.append(data_dict)

# Create a pandas dataframe from the list of dictionaries
dff = pd.DataFrame(data_list)
dff['trx_datetime'] = pd.to_datetime(df['trx_datetime'])
dff['upload_datetime'] = pd.Timestamp('now')
dff

我的输出是:

trx_datetime在第二行有一个空值,只是它捕获了第一个值。如何捕获 Dataframe 中的所有trx_datetime值?

axzmvihb

axzmvihb1#

您的代码运行良好,除了您忘记了trx_datetimes的捕获组:

#                                          HERE --v                     --v
trx_datetimes = re.findall(r'Trx\s+DateTime\s\s:\s(\d{1,2}\/\d{1,2}\/\d{4})\s+', text)
...
#                    dff and not df --v
dff['trx_datetime'] = pd.to_datetime(dff['trx_datetime'])

输出:

>>> dff
   amount withdraw_status account trace_id trx_datetime     online_status   status   message            upload_datetime
0    2500              OK   60700   000000   2020-11-01  Online_Perfoamed  Success  Approved 2023-04-02 19:36:26.989106
1   30000              OK    6075   165870   2020-01-01  Online_Perfoamed  Success  Approved 2023-04-02 19:36:26.989106
2    5000              OK     106   167030   2020-11-12  Online_Perfoamed  Success  Approved 2023-04-02 19:36:26.989106

相关问题