今天,我将使用TSfresh库来管理我的时间序列数据集,并进行时间序列分类。
我使用this tutorial来使代码适应我的数据。现在,我实现了一些步骤,但在分割数据时出现了一个错误:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Import the data path (my time series path)
data_path = 'PATH'
#Import the csv containing the label (in my case "Reussite_Sevrage")
target_df = pd.read_csv("PATH.csv",encoding="ISO-8859-1", dtype={'ID': 'str'})
# Delete the useless lines (containing nan values in the end of the dataset)
target_df = target_df.iloc[0:57,:]
# Definition of the labels
labels = target_df['Reussite_sevrage']
# Definition of the df containing the IDs
sequence_ids=target_df['ID']
#Splitting the data
train_ids, test_ids, train_labels, test_labels = train_test_split(sequence_ids, labels, test_size=0.2)
#Create the X_train and X_test dataframe
X_train = pd.DataFrame()
X_test = pd.DataFrame()
# Now, will loop through the training sequence IDs and the testing sequence IDs.
# For each of these sequence IDs, we will read the corresponding time series data CSV file and add it to the main dataframe.
# We will also add a column for the sequence number and a step column which contains integers representing the time step in the sequence
for i, sequence in enumerate(train_ids):
inputfile = 'PATH'/ f"{sequence}.txt"
if inputfile.exists():
df = pd.read_csv(os.path.join(data_path, 'PAD/', "%s.txt" % sequence),
delimiter='\t', # columns are separated by spaces
header=None, # there's no header information
#parse_dates=[[0, 1]], # the first and second columns should be combined and converted to datetime objects
#infer_datetime_format=True,
decimal=",")
df = df.iloc[:,1]
df = df.to_frame(name ='values')
df.insert(0, 'sequence', i)
df['step'] = np.arange(df.shape[0]) # creates a range of integers starting from 0 to the number of the measurements.
X_train = pd.concat([X_train, df])
我在循环中添加了一个条件,只检查和处理存在的文件。丢失的数据由丢失的文件表示。如果我省略了这个条件,循环将在检测到丢失的文件时停止。
inputfile = PATH / f"{sequence}.txt"
if inputfile.exists():
但出现以下错误:unsupported operand type(s) for /: 'str' and 'str'
我不知道是否错误是由于数据加载过程中的dtype={'ID': 'str'}
,但我需要它,因为ID的格式如下:0001,0002,0003 ...如果不添加此条件,则ID将转换为:1,2,3...sequence_ids, train_ids, train_labels, test_ids and test_labels
为串行格式,sequence
为字符串格式。
你能想出解决这个问题的办法吗?
非常感谢
1条答案
按热度按时间yyyllmsg1#
我建议使用Path库来处理文件路径。你可以使用
from pathlib import Path
导入,inputfile
将是Path(data_path) / f"PAD/{sequence}.txt"
这将创建一个Path对象到序列文件的路径。现在你应该可以在此调用exists()
方法。最终代码: