从txt文件中提取数据并导入CSV

q8l4jmvw  于 2022-12-15  发布在  其他
关注(0)|答案(1)|浏览(186)

我有一个txt文件中的数据,格式如下。

ScanHeader # 1
position = 1, start_mass= 2.000000, end_mass = 535.010058
start_time = 0.034048, end_time = 0.000000, packet_type = 24
num_readings = 114, integ_intens = 14276257.301926, data packet pos = 1026
uScanCount = 0, PeakIntensity = 6799450.500000, PeakMass = 18.045876
Scan Segment = 0, Scan Event = 0
Precursor Mass 
Collision Energy 
Isolation width 
Polarity positive, Cenrtoid Data, Full Scan Type, MS Scan
SourceFragmentation Any, Type Ramp, Values = 0, Mass Ranges = 0
Turbo Scan Any, IonizationMode ElectronImpact, Corona Any
Detector Any, Value = 0.00, ScanTypeIndex = -1
DataPeaks

Packet # 0, intensity = 3691.226074, mass/position = 2.112536
saturated = 0, fragmented = 0, merged = 0

Packet # 1, intensity = 42881.203125, mass/position = 3.466080
saturated = 0, fragmented = 0, merged = 0

Packet # 2, intensity = 3006256.000000, mass/position = 4.184193
saturated = 0, fragmented = 0, merged = 0

理想情况下,输出应该是如下所示的csv文件:

我试过使用regex和read_csv选项,但似乎都没有给予我想要的输出。我得到的最接近的是regex,在那里我设法提取了所有需要的数据,但我很难将其放入dataframe。代码如下所示:

from tabulate import tabulate
import re 

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data = re.findall(r'\d*last_scan = \d*\d.\d*', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data1 = re.findall(r'\d* start_time = \d*\d.\d*', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data2 = re.findall(r'\d* end_time = \d*\d.\d*', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data3 = re.findall(r'\d*low_mass = \d*\d.\d*', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data4 = re.findall(r'\d*high_mass = \d*\d.\d*', newfile.read())
    
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data5 = re.findall(r'\d*ScanHeader # \d', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data6 = re.findall(r'\d*Packet # \d*', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data7 = re.findall(r'\d* intensity = \d*\d.\d*', newfile.read())

with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
    data8 = re.findall(r'\d* mass/position = \d*\d.\d*', newfile.read())

import pandas as pd

data = {'Scanheader': [data5],
        'Packet Number': [data6],
        'Intensity': [data7], 
        'Mass/Position': [data8]
        }

df = pd.DataFrame(data) 
df.to_csv('2020-06-23-Didecylamine-deriv-0,1uL.csv', index=False)

这段代码的输出如下所示:

我知道有很多方法可以让这个代码变得简单一些,但是我还是一个初学者,还没有找到任何方法可以让它变得更简单。任何提示都将非常感谢:)

pgx2nnw8

pgx2nnw81#

您应该只打开文件一次。
您可以首先使用re标志re.MULTILINE + re.DOTALL匹配所有ScanHeaders的整个文本。
迭代这些匹配项并提取Header #和time。
最后,迭代数据包(在前一个匹配中找到的)以提取其他列:

data = []

scanHeader_pattern = re.compile(r'ScanHeader.*?(?=ScanHeader|\Z)', flags= re.MULTILINE + re.DOTALL)
packet_pattern = re.compile(r'Packet.*?(?=Packet|\Z)', flags= re.MULTILINE + re.DOTALL)

header_nb_pattern = re.compile(r'ScanHeader # (\d+)')
time_pattern = re.compile(r'start_time = (\d+\.\d+)', re.MULTILINE)

packet_nb_pattern = re.compile(r'Packet # (\d+)')
intensity_pattern = re.compile(r'intensity = (\d+\.\d+)')
mass_pos_pattern = re.compile(r'mass/position = (\d+\.\d+)')

for sh in re.findall(scanHeader_pattern, newfile.read()):
    h_nb = int(re.search(header_nb_pattern, sh).group(1))
    t = float(re.search(time_pattern, sh).group(1))
    
    for p in re.findall(packet_pattern, sh):
        p_nb = int(re.search(packet_nb_pattern, p).group(1))
        intensity = float(re.search(intensity_pattern, p).group(1))
        mass_pos = float(re.search(mass_pos_pattern, p).group(1))

        data.append(
            {'Scanheader': h_nb,
            'Packet Number': p_nb,
            'Time': t,
            'Intensity': intensity, 
            'Mass/Position': mass_pos
            }
        )

df = pd.DataFrame(data)

相关问题