pandas 如何用python中的datasets库从磁盘上的三个文件创建一个dataset?

idfiyjo8  于 2023-02-17  发布在  Python
关注(0)|答案(2)|浏览(145)

我在磁盘上有三个文件train.xlsx、validation.xlsx和test.xlsx,我需要一个数据集库中包含这三个文件的数据集,下面是我的代码:

from google.colab import drive
from datasets import Dataset
import pandas as pd
drive.mount('/content/drive')
train_data = pd.read_excel('/content/drive/My Drive/NLP-Datasets/Question2_Data/train.xlsx')
validation_data = pd.read_excel('/content/drive/My Drive/NLP-Datasets/Question2_Data/valid.xlsx')
test_data = pd.read_excel('/content/drive/My Drive/NLP-Datasets/Question2_Data/test.xlsx')

print(train_data.shape)
print(validation_data.shape)
print(test_data.shape)

现在我需要一个数据集,其中包含来自相应文件的这些键:数据集['train']和数据集['validation']和数据集['test']有人能帮我吗?

vddsk6oq

vddsk6oq1#

试试这个

train_data = train_data.values.tolist()
validation_data = validation_data.values.tolist()
test_data = test_data.values.tolist()
d = {'train_data ' : train_data ,
'validation_data ' : validation_data ,
'test_data ' : test_data 
}
df = pd.DataFrame(data = d)

值得注意的是,如果这些 Dataframe 只有一列,则.values.tolist()有效,如果没有列,则指定为EX。:train_data ['COLUMN'].values.tolist()

zynd9foi

zynd9foi2#

试试这个

import pandas as pd
import os

from google.colab import drive
drive.mount('/content/drive')

os.chdir('/content/drive/My Drive/NLP-Datasets/Question2_Data/')

train_data = 'train.xlsx'
validation_data = 'valid.xlsx'
test_data = 'test.xlsx'

paths = [train_data, validation_data , test_data ]
dfs = {p: pd.read_excel(p) for p in paths}

更新:可以使用Python中的datasets库从磁盘上的三个文件创建数据集,如下所示:

from google.colab import drive
from datasets import Dataset
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

# Load train, validation, and test data
train_data = pd.read_excel('/content/drive/My Drive/NLP-Datasets/Question2_Data/train.xlsx')
validation_data = pd.read_excel('/content/drive/My Drive/NLP-Datasets/Question2_Data/valid.xlsx')
test_data = pd.read_excel('/content/drive/My Drive/NLP-Datasets/Question2_Data/test.xlsx')

# Convert data to dictionary format
train_dict = train_data.to_dict(orient='list')
validation_dict = validation_data.to_dict(orient='list')
test_dict = test_data.to_dict(orient='list')

# Create a dataset from the data
dataset = Dataset.from_dict({
    'train': train_dict,
    'validation': validation_dict,
    'test': test_dict
})

# Print the shapes of the data
print(dataset['train'].shape)
print(dataset['validation'].shape)
print(dataset['test'].shape)

相关问题