从CSV读取特定列的循环，不存在时跳过

ntjbwcob 于 2023-09-28 发布在其他

关注(0)|答案(2)|浏览(105)

我有几百个csv文件，里面有不同类型的环境数据。例如，它们的标题都有[sensor_id]、[sensor_type]、[timestamp]、[long]、[lat]列。一些有另一个柱[温度]，其他的可能有[压力]
我想写一个脚本，它只读取温度数据，并跳过不存在该列的csv。不幸的是，当它无法在某些csv中找到温度列时，它会导致错误。
这些是我尝试过的事情：

file_list<-list.files(path='path', full.names = TRUE, pattern="*.csv")

cind <- c("sensor_id", "sensor_type", "lat", "lon", "timestamp", "temperature")

#attempt 1 - this leads to an undefined columns selected error

collate1 <- lapply(file_list, function(x) read.csv(x, sep=";", header=TRUE)[cind])

#attempt 2 - I try to skip files that lead to an error, but it still has an undefined columns selected error

collate2 <- tryCatch(lapply(file_list, function(x) read.csv(x, sep=";", header=TRUE)[cind]), 
        warning = function(e) print(paste('no temperature')))

#attempt 3 - I try some kind of if statement but I don't quite understand the output

collate3 <- lapply(file_list, function(x){
  df <- read.csv(x, sep=";", header=TRUE)
  df_new <- df[,c("sensor_id", "sensor_type", "lat", "lon", "timestamp")]
  if("temperature" %in% colnames(df)    ){
    return(df_new$temperature == df$temperature)
  } else {return(NA)
  }
})

collate3 <- collate3 [sapply(is.na, collate3 )]

有非常大量的数据，所以我很想找出一些可以快速工作的东西。
谢谢
`

csv

来源：https://stackoverflow.com/questions/76906153/loops-to-read-specific-columns-from-csvs-skip-when-absent

2条答案

按热度按时间

bweufnob1#

读取每个文件的第一行以找出所需的文件。

files_to_read_in  <- lapply(file_list, \(f) {
    dat  <- read.csv(f, nrow = 1, sep = ";")
    if(all(cind %in% names(dat))) {
        file_to_read  <- f    
    } else {
        file_to_read  <- NULL
    }
    file_to_read
}
)  |> unlist()

然后读入这些文件：

collate1 <- lapply(files_to_read_in, function(f) read.csv(f, sep=";", header=TRUE)[cind])

赞(0）回复(0）举报 2023-09-28

yptwkmov2#

方法1：读取所有文件，然后过滤"temperature"存在的列表：

file_list <- list.files(path = 'path', full.names = TRUE, pattern = "*.csv")
collate1 <- lapply(file_list, read.csv2)
collate1 <- collate1[sapply(collate1, function(z) "temperature" %in% names(z))]
cind <- c("sensor_id", "sensor_type", "lat", "lon", "timestamp", "temperature")
collate1 <- lapply(collate1, function(z) z[ind])

方法2：如果文件太多，可以先阅读几行来确定要读取的文件。

file_list <- list.files(path = 'path', full.names = TRUE, pattern = "*.csv")
collate1 <- lapply(file_list, read.csv2, nrows = 2)
collate1 <- collate1[sapply(collate1, function(z) "temperature" %in% names(z))]
cind <- c("sensor_id", "sensor_type", "lat", "lon", "timestamp", "temperature")
collate1 <- lapply(names(collate1), function(z) read.csv2(z)[cind])

赞(0）回复(0）举报 2023-09-28

我来回答

从CSV读取特定列的循环，不存在时跳过

2条答案

相关问题

热门标签

最新问答