我有一个txt文件中的数据,格式如下。
ScanHeader # 1
position = 1, start_mass= 2.000000, end_mass = 535.010058
start_time = 0.034048, end_time = 0.000000, packet_type = 24
num_readings = 114, integ_intens = 14276257.301926, data packet pos = 1026
uScanCount = 0, PeakIntensity = 6799450.500000, PeakMass = 18.045876
Scan Segment = 0, Scan Event = 0
Precursor Mass
Collision Energy
Isolation width
Polarity positive, Cenrtoid Data, Full Scan Type, MS Scan
SourceFragmentation Any, Type Ramp, Values = 0, Mass Ranges = 0
Turbo Scan Any, IonizationMode ElectronImpact, Corona Any
Detector Any, Value = 0.00, ScanTypeIndex = -1
DataPeaks
Packet # 0, intensity = 3691.226074, mass/position = 2.112536
saturated = 0, fragmented = 0, merged = 0
Packet # 1, intensity = 42881.203125, mass/position = 3.466080
saturated = 0, fragmented = 0, merged = 0
Packet # 2, intensity = 3006256.000000, mass/position = 4.184193
saturated = 0, fragmented = 0, merged = 0
理想情况下,输出应该是如下所示的csv文件:
我试过使用regex和read_csv选项,但似乎都没有给予我想要的输出。我得到的最接近的是regex,在那里我设法提取了所有需要的数据,但我很难将其放入dataframe。代码如下所示:
from tabulate import tabulate
import re
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data = re.findall(r'\d*last_scan = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data1 = re.findall(r'\d* start_time = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data2 = re.findall(r'\d* end_time = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data3 = re.findall(r'\d*low_mass = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data4 = re.findall(r'\d*high_mass = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data5 = re.findall(r'\d*ScanHeader # \d', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data6 = re.findall(r'\d*Packet # \d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data7 = re.findall(r'\d* intensity = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data8 = re.findall(r'\d* mass/position = \d*\d.\d*', newfile.read())
import pandas as pd
data = {'Scanheader': [data5],
'Packet Number': [data6],
'Intensity': [data7],
'Mass/Position': [data8]
}
df = pd.DataFrame(data)
df.to_csv('2020-06-23-Didecylamine-deriv-0,1uL.csv', index=False)
这段代码的输出如下所示:
我知道有很多方法可以让这个代码变得简单一些,但是我还是一个初学者,还没有找到任何方法可以让它变得更简单。任何提示都将非常感谢:)
1条答案
按热度按时间pgx2nnw81#
您应该只打开文件一次。
您可以首先使用
re
标志re.MULTILINE + re.DOTALL
匹配所有ScanHeaders的整个文本。迭代这些匹配项并提取Header #和time。
最后,迭代数据包(在前一个匹配中找到的)以提取其他列: