Python将文本文件转换为带有多行文本的pandas Dataframe

r1zhe5dt  于 2023-04-10  发布在  Python
关注(0)|答案(2)|浏览(117)

我在纯文本文件中有一个协议转储,格式如下:

Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)
Bluetooth HCI H4
    [Direction: Sent (0x00)]
    HCI Packet Type: ACL Data (0x02)
0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................
0010  00 00 00                                          ...
Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)
Bluetooth HCI H4
    [Direction: Rcvd (0x01)]
    HCI Packet Type: HCI Event (0x04)
0000  04 13 05 01 0b 00 01 00                           ........
Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)
Bluetooth HCI H4
    [Direction: Rcvd (0x01)]
    HCI Packet Type: ACL Data (0x02)
0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G..
0010  00 00 00 01 02 00 04                              .......

在这个简化的例子中,帧号380,381等是文本格式的每个帧的第一行的一部分。我想将其转换为以下形式的pandas Dataframe :

FrameNumber                                   Details                                  
|---------------------------------------------------------------------------------------|
|            | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|            | Bluetooth HCI H4                                                         |
|   380      |     [Direction: Sent (0x00)]                                             |
|            |     HCI Packet Type: ACL Data (0x02)                                     |
|            | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|            | 0010  00 00 00                                                           |
|---------------------------------------------------------------------------------------|
|            | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|            | Bluetooth HCI H4                                                         |
|   381      |     [Direction: Rcvd (0x01)]                                             |
|            |     HCI Packet Type: HCI Event (0x04)                                    |
|            | 0000  04 13 05 01 0b 00 01 00                           ........         |
|---------------------------------------------------------------------------------------|
|            | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|            | Bluetooth HCI H4                                                         |
|   382      |     [Direction: Rcvd (0x01)]                                             |
|            |     HCI Packet Type: ACL Data (0x02)                                     |
|            | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|            | 0010  00 00 00 01 02 00 04                              .......          |
+---------------------------------------------------------------------------------------+

我尝试使用pandas read_csv(),但由于我对多行正则表达式选择的知识有限,我无法解决这个问题。有人能帮助我提出一个简单的解决方案吗?

wfauudbj

wfauudbj1#

另一种解决方案,使用re模块:

import re
import pandas as pd

all_data = []
with open("data.txt", "r") as f_in:
    for (g, n) in re.findall(
        r"^(Frame (\d+).*?)\s*(?=^Frame \d+|\Z)", f_in.read(), flags=re.M | re.S
    ):
        all_data.append({"FrameNumber": int(n), "Details": g})

df = pd.DataFrame(all_data)
print(df)

图纸:

|    |   FrameNumber | Details                                                                  |
|---:|--------------:|:-------------------------------------------------------------------------|
|  0 |           380 | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Sent (0x00)]                                             |
|    |               |     HCI Packet Type: ACL Data (0x02)                                     |
|    |               | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|    |               | 0010  00 00 00                                          ...              |
|  1 |           381 | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Rcvd (0x01)]                                             |
|    |               |     HCI Packet Type: HCI Event (0x04)                                    |
|    |               | 0000  04 13 05 01 0b 00 01 00                           ........         |
|  2 |           382 | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Rcvd (0x01)]                                             |
|    |               |     HCI Packet Type: ACL Data (0x02)                                     |
|    |               | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|    |               | 0010  00 00 00 01 02 00 04                              .......          |
btqmn9zl

btqmn9zl2#

使用extractgroupby

df = pd.read_fwf("input2.txt", header=None, names=["Details"])

df["FrameNumber"] = (df["Details"].str.extract(r"(Frame \d+)", expand=False)
                         .where(df["Details"].str.startswith(r"Frame")).ffill())

out = df.groupby("FrameNumber", as_index=False).agg("\n".join)

输出:

+---------------+--------------------------------------------------------------------------+
| FrameNumber   | Details                                                                  |
|---------------+--------------------------------------------------------------------------|
| Frame 380     | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Sent (0x00)]                                                 |
|               | HCI Packet Type: ACL Data (0x02)                                         |
|               | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|               | 0010  00 00 00                                          ...              |
| Frame 381     | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Rcvd (0x01)]                                                 |
|               | HCI Packet Type: HCI Event (0x04)                                        |
|               | 0000  04 13 05 01 0b 00 01 00                           ........         |
| Frame 382     | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Rcvd (0x01)]                                                 |
|               | HCI Packet Type: ACL Data (0x02)                                         |
|               | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|               | 0010  00 00 00 01 02 00 04                              .......          |

相关问题