“for”循环:创建一个新列,该列考虑来自多个csv文件的新数据

vulvrdjw  于 2021-09-29  发布在  Java
关注(0)|答案(2)|浏览(319)

我想自动化一个过程,分配几个文件的标签。意外地,有人创建了许多文件(csv),如下所示:
文件名1: test_1.csv ```
Node Target Char1 Var2 Start
1 2 23.1 No 1
1 3 12.4 No 1
1 4 52.1 Yes 1
1 12 14.5 No 1

文件名2: `test_2.csv` ```
Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1
2      1     23.1  No    0
2      41    12.4  Yes   0
3      15    8.2   No    0
3      12    63.1  No    0

文件名3: test_3.csv ```
Node Target Char1 Var2 Start
1 2 23.1 No 1
1 3 12.4 No 1
1 4 52.1 Yes 1
1 12 14.5 No 1
2 1 23.1 No 0
2 41 12.4 Yes 0
3 15 8.2 No 0
3 12 63.1 No 0
41 2 12.4 Yes 0
15 3 8.2 No 0
15 8 12.2 No 0
12 3 63.1 No 0

从我所看到的,csv文件是创建的,包括以前运行的数据。我想添加一个列,该列考虑到它来自的数据集,没有重复项,也就是说,只考虑在下一个数据集中添加了什么。例如,这意味着要有一个包含所有数据的唯一csv文件:
文件名全部: `test_all.csv` ```
Node Target Char1 Var2 Start  File
1      2     23.1  No    1      1
1      3     12.4  No    1      1
1      4     52.1  Yes   1      1
1      12    14.5  No    1      1
2      1     23.1  No    0      2
2      41    12.4  Yes   0      2
3      15    8.2   No    0      2
3      12    63.1  No    0      2
41      2     12.4  Yes   0      3
15      3     8.2   No    0      3
15      8     12.2  No    0      3
12      3     63.1  No    0      3

我正在考虑计算数据集之间的差异(以行为单位),并在此基础上添加一个新列。但是,我正在一个接一个地这样做,这是不可行的,因为我有,例如:

test_1.csv, test_2.csv, test_3.csv, ... , test_7.csv
filex_1.csv, filex_2.csv, ..., filex_7.csv
name_1.csv, name_2.csv, ..., name_7.csv

等等
后缀 _x17 :唯一的更改是文件名(例如。, filex, test, name, 和许多其他人)。
你能给我一些关于如何以一种更简单、更快的方式运行它的提示吗?例如,使用一个for循环,它考虑后缀并根据每个文件的新信息创建一个新列?如果你需要的话,我很乐意提供更多的信息和细节。

ztigrdn8

ztigrdn81#

你可以试试这样做。


# Importing libraries.

import os  # Misc OS interfaces.
import pandas as pd  # Data manipulation library.

# Constants.

PATH_DATA_FOLDER = ''  # Specify your data folder location.

# Let's get your filenames and only leave unique ones.

list_files = os.listdir(PATH_DATA_FOLDER)
list_filenames = list(pd.unique([file.split('_')[0] for file in list_files]))

# Now, when we have our filenames, we can loop through them, read files and build dataframes.

for filename in list_filenames:
    # Get list of columns using the first data file available and append the `File` column.
    list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_1.csv')).columns) + ['ID', 'File']
    # Create a dataframe which we will fill with data from different data files.
    df_final = pd.DataFrame(columns=list_columns)
    # Loop through files of same type (test, filex, name...).
    # Here we will loop through indices from 1 to 7.
    # You might also calculate these values dynamically. 
    for x in range(1, 8):
        # Reading a data file.
        df = pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_{}.csv'.format(x)))
        # Filling the `File` column with the file index. 
        df['File'] = x
        # Creating an ID column to track duplicates in different files.
        df['ID'] = range(0, len(df))
        # Appending our final dataframe
        df_final = df_final.append(df)
    # Resseting the dataframe indices.
    # Removing duplicates using the initial columns (without the `File` column).
    df_final = df_final.reset_index(drop=True).drop_duplicates(subset=list_columns[0:-1])
    # Dropping the unused ID column.
    df_final = df_final.drop(['ID'], axis=1)
    # Printing out the dataframe.
    print(df_final)
x8goxv8g

x8goxv8g2#

你可以通过 pd.concatkeys -参数(文档)。

frames = [df1, df2, ...] # your dataframes
file_names = ['file1', 'file2', ...] # the file names

df = pd.concat(frames, keys=file_names)
输出
Node  Target  Char1 Var2  Start
file1 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
file2 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
      4      2       1   23.1   No      0
      5      2      41   12.4  Yes      0
      6      3      15    8.2   No      0
      7      3      12   63.1   No      0
file3 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
      4      2       1   23.1   No      0
      5      2      41   12.4  Yes      0
      6      3      15    8.2   No      0
      7      3      12   63.1   No      0
      8     41       2   12.4  Yes      0
      9     15       3    8.2   No      0
      10    15       8   12.2   No      0
      11    12       3   63.1   No      0

为了在文件中保留重复项,我们可以临时将级别1索引设置为列,以便 drop_duplicates 仅在跨文件复制时匹配。

df = df.reset_index(level=1).drop_duplicates()

# get rid of the extra column

df = df.drop('level_1', axis=1)

# Set the file name index as new column

df = df.reset_index().rename(columns={'index':'File'})
输出
File  Node  Target  Char1 Var2  Start
0   file1     1       2   23.1   No      1
1   file1     1       3   12.4   No      1
2   file1     1       4   52.1  Yes      1
3   file1     1      12   14.5   No      1
4   file2     2       1   23.1   No      0
5   file2     2      41   12.4  Yes      0
6   file2     3      15    8.2   No      0
7   file2     3      12   63.1   No      0
8   file3    41       2   12.4  Yes      0
9   file3    15       3    8.2   No      0
10  file3    15       8   12.2   No      0
11  file3    12       3   63.1   No      0

相关问题