从CSV读取特定列的循环,不存在时跳过

ntjbwcob  于 2023-09-28  发布在  其他
关注(0)|答案(2)|浏览(105)

我有几百个csv文件,里面有不同类型的环境数据。例如,它们的标题都有[sensor_id]、[sensor_type]、[timestamp]、[long]、[lat]列。一些有另一个柱[温度],其他的可能有[压力]
我想写一个脚本,它只读取温度数据,并跳过不存在该列的csv。不幸的是,当它无法在某些csv中找到温度列时,它会导致错误。
这些是我尝试过的事情:

file_list<-list.files(path='path', full.names = TRUE, pattern="*.csv")

cind <- c("sensor_id", "sensor_type", "lat", "lon", "timestamp", "temperature")

#attempt 1 - this leads to an undefined columns selected error

collate1 <- lapply(file_list, function(x) read.csv(x, sep=";", header=TRUE)[cind])

#attempt 2 - I try to skip files that lead to an error, but it still has an undefined columns selected error

collate2 <- tryCatch(lapply(file_list, function(x) read.csv(x, sep=";", header=TRUE)[cind]), 
        warning = function(e) print(paste('no temperature')))

#attempt 3 - I try some kind of if statement but I don't quite understand the output

collate3 <- lapply(file_list, function(x){
  df <- read.csv(x, sep=";", header=TRUE)
  df_new <- df[,c("sensor_id", "sensor_type", "lat", "lon", "timestamp")]
  if("temperature" %in% colnames(df)    ){
    return(df_new$temperature == df$temperature)
  } else {return(NA)
  }
})

collate3 <- collate3 [sapply(is.na, collate3 )]

有非常大量的数据,所以我很想找出一些可以快速工作的东西。
谢谢
`

bweufnob

bweufnob1#

读取每个文件的第一行以找出所需的文件。

files_to_read_in  <- lapply(file_list, \(f) {
    dat  <- read.csv(f, nrow = 1, sep = ";")
    if(all(cind %in% names(dat))) {
        file_to_read  <- f    
    } else {
        file_to_read  <- NULL
    }
    file_to_read
}
)  |> unlist()

然后读入这些文件:

collate1 <- lapply(files_to_read_in, function(f) read.csv(f, sep=";", header=TRUE)[cind])
yptwkmov

yptwkmov2#

方法1:读取所有文件,然后过滤"temperature"存在的列表:

file_list <- list.files(path = 'path', full.names = TRUE, pattern = "*.csv")
collate1 <- lapply(file_list, read.csv2)
collate1 <- collate1[sapply(collate1, function(z) "temperature" %in% names(z))]
cind <- c("sensor_id", "sensor_type", "lat", "lon", "timestamp", "temperature")
collate1 <- lapply(collate1, function(z) z[ind])

方法2:如果文件太多,可以先阅读几行来确定要读取的文件。

file_list <- list.files(path = 'path', full.names = TRUE, pattern = "*.csv")
collate1 <- lapply(file_list, read.csv2, nrows = 2)
collate1 <- collate1[sapply(collate1, function(z) "temperature" %in% names(z))]
cind <- c("sensor_id", "sensor_type", "lat", "lon", "timestamp", "temperature")
collate1 <- lapply(names(collate1), function(z) read.csv2(z)[cind])

相关问题