无法找到一个值在csv文件的列中出现的次数,R

cmssoen2  于 2023-09-28  发布在  其他
关注(0)|答案(4)|浏览(79)

我有多个大的csv文件,它们都有相同的列。我将它们读入R语言中一个名为www.example.com的列表中listFinal.data。我需要计算在某些列中出现的不同项的出现次数。例如,有一个名为“Biological.Stage”的列。该列中的值可以是“无”、“孵育”、“有症状”或“恢复期”。我需要计算出这些不同阶段发生的次数。这个很好用。但是,我有另一个名为“Incapacitation.Status”的列,其中的值可以是“Not Incapacitated”或“Incapacitated”。我还需要计算每种情况发生的次数。我使用相同的方法,但我得到一个错误,它说:

Error in incapacitated_count[[l]] <- nrow(listFinal.data[[l]][listFinal.data[[l]]$Incapacitation.Status ==  : object 'incapacitated_count' not found

这看起来并不难,但我已经绕了一整天,不能使这工作。有人能在我发疯之前指出我的错误吗?
我编辑了我的帖子,以更完整地展示我的代码,而不仅仅是一个好心的人建议的问题循环,他们试图帮助我。我试图避免发布太多的代码,因为我知道它可能不是很好。请记住,我是自学成才,不做很多编码。

library(tidyverse)
library(ggplot2)
library(rlist)
library(magrittr)
library(DT)
library(readr)
library(dplyr)
 

options(shiny.maxRequestSize = 200*1024^2)

wd <<- choose.dir(caption = "Select top level folder where your data is located")
setwd(wd)

#List the full path and filename of all files in the working directory and sub-directories that starts
#with "BioModel_" and ends with ".csv"
out_files <- list.files(pattern = "^BioModel(.*)csv$", recursive = TRUE)

# create an empty list that will serve as a container to receive the incoming files
list.data<-list()
listForceFinal.data<-list()
listForceInitial.data<-list()
outputInitial.data<-list()
listOcularFinal.data<-list()
listInhalationFinal.data<-list()
listSurfaceFinal.data<-list()
listChemFinal.data<-list()
listBioFinal<-list()
listFatalities.data<-list()
listFinal.data<-list()
listContaminatedVeh.data<-list()
none_count<-list()
minor_count<-list()
major_count<-list()
lethal_count<-list()
incubation_count<-list()
symptomatic_count<-list()
convalescence_count<-list()
incubation_count<-list()
vehicle_count<-list()
incapacitation_count<-list()
# 
 num_reps <- length(out_files)

# create a loop to read in your data

for (i in 1:length(out_files))
{
  list.data[[i]]<-read.csv(out_files[i], check.names = TRUE)
}

# # create a loop to get the final contamination level and remove all others
for (h in 1:length(list.data))
  listFinal.data[[h]] <- list.data[[h]] %>%
  group_by(Actor.ID, Actor.Name) %>%
  slice(n())

output <- plyr::ldply(listFinal.data, function(x) x %>%
  group_by(Entity.Type) %>%
  summarise(n=n()))

listContaminatedVeh.data <- output %>%
  filter(Entity.Type != "Lifeform") %>%
  group_by(Entity.Type) %>%
  summarise(Mean.Final = mean(n)) %>%
  rename("Platform Type" = Entity.Type, Mean = Mean.Final)

#Contaminated Entities

#create a loop to sum up the total number of contaminated and incapacitated entities in each rep
for (l in 1:length(listFinal.data)){
  incubation_count[[l]] <- nrow(listFinal.data[[l]][listFinal.data[[l]]$Biological.Stage == "incubation",])
  symptomatic_count[[l]] <- nrow(listFinal.data[[l]][listFinal.data[[l]]$Biological.Stage == "symptomatic",])
  convalescence_count[[l]] <- nrow(listFinal.data[[l]][listFinal.data[[l]]$Biological.Stage == "convalescence",])
  incapacitated_count[[l]] <- nrow(listFinal.data[[l]][listFinal.data[[l]]$Incapacitation.Status == "Incapacitated",])
}

# calculate the average number of entities with biological contamination / incapacitation across all reps
result.incubationAvg <- mean(as.numeric(incubation_count))
result.incubationAvg <- round(result.incubationAvg)

result.symptomaticAvg <- mean(as.numeric(symptomatic_count))
result.symptomaticAvg <- round(result.symptomaticAvg)

result.convalescenceAvg <- mean(as.numeric(convalescence_count))
result.convalescenceAvg <- round(result.convalescenceAvg)

result.incapacitationAvg <- mean(as.numeric(incapacitation_count))
result.incapacitationAvg <- round(result.incapacitationAvg)

这是我的数据文件的压缩版本。这是一个很大的文件,所以我删除了与问题无关的列,只包含了一小部分行。

df <- read.table(text = "
Time Stamp,Biological Stage,Contaminant Type,Incapacitation Status,Concentration of Agent  
13206478,symptomatic,biological,Incapacitated,8.38E-05  
13087148,none,biological,Not Incapacitated,0  
12966365,none,biological,Not Incapacitated,0  
13207078,none,biological,Not Incapacitated,8.38E-05  
", header = TRUE, sep = ",")
ui7jx7zq

ui7jx7zq1#

我发现dplyrtidyr(包含在tidyverse中)以一种既简洁又易读的方式方便地处理数据。这里,I count不同的值在Biological.Stage中出现了多少次,然后I complete计数表以包括可能包含或可能不包含在数据中的指定值。你也可以用带因子和table的基数R来做这个,但对我来说看起来更简单。

library(tidyverse)
df |> 
  count(Biological.Stage) |>
  complete(Biological.Stage = c("none", "incubation", "symptomatic", "convalescence"), 
           fill = list(n = 0))

# A tibble: 4 × 2
  Biological.Stage     n
  <chr>            <int>
1 convalescence        0
2 incubation           0
3 none                 3
4 symptomatic          1

或者,如果非现值对计数不重要,您可以使用以下命令一次计算所有列:

df |>
  pivot_longer(Biological.Stage:Incapacitation.Status) |>
  count(name, value)

# A tibble: 5 × 3
  name                  value                 n
  <chr>                 <chr>             <int>
1 Biological.Stage      none                  3
2 Biological.Stage      symptomatic           1
3 Contaminant.Type      biological            4
4 Incapacitation.Status Incapacitated         1
5 Incapacitation.Status Not Incapacitated     3
mbskvtky

mbskvtky2#

library(tidyverse)

df <- df |> 
  mutate(Biological.Stage = factor(Biological.Stage, levels = c("none", "incubation", "symptomatic", "convalescence")),
          Incapacitation.Status =  Incapacitation.Status  == "Incapacitated") # if you only have two categories, which are basically just true or false, you may as well use TRUE and FALSE and get it over with

# to get the frequency for each column individually:
map(df, ~table(.x))
iezvtpos

iezvtpos3#

第一部分只是生成数据,我认为你的问题开始于:

获取listFinal.data起点

Biological.State <- sample(c("none", "incubation", "symptomatic", "convalescence"), 
                                                     size = 1000, 
                                                     replace = TRUE)

Incapacitation.Status <- sample(c("Not Incapacitated", "Incapacitated"),
                                                                size = 1000,
                                                                replace = TRUE)

df_start <- data.frame(Biological.State, Incapacitation.Status)

chunk <- 50
n <- nrow(df_start)
r <- rep(1:ceiling(n/chunk), each = chunk)[1:n]
listFinal.data <- split(df_start, r)

上面的所有内容都是我创建的listFinal.data,就像你拉到列表中的20个csv文件一样。

解决方案

library(tidyverse)

df_FinalData <- bind_rows(listFinal.data) # Takes all the individual list elements and puts them in a single dataframe (again)

> head(df_FinalData)
  Biological.State Incapacitation.Status
1       incubation     Not Incapacitated
2             none         Incapacitated
3    convalescence         Incapacitated
4    convalescence     Not Incapacitated
5      symptomatic     Not Incapacitated
6       incubation         Incapacitated

> df_FinalData %>% count(Incapacitation.Status)

  Incapacitation.Status   n
1         Incapacitated 502
2     Not Incapacitated 498

> df_FinalData %>% count(Biological.State)

  Biological.State   n
1    convalescence 236
2       incubation 242
3             none 281
4      symptomatic 241
r7s23pms

r7s23pms4#

我能够使用每个人的建议,并提出一些工作的方式,我有我的数据安排在一个列表的 Dataframe 。我很感激大家的帮助,因为我不相信没有他们我就能弄明白。我并不真正理解数据的排列方式以及如何访问它,但你们都帮助我看到了这一点,我需要使用应用函数。
为了完整起见,这是我的代码:

incapacitation_count <- plyr::ldply(listFinal.data, function(x) x %>%
                             group_by(Incapacitation.Status) %>%
                             summarise(n=n()))
listIncapStat.data <- incapacitation_count %>%
group_by(Incapacitation.Status) %>%
summarise(Mean.Final = mean(n)) %>%
rename("Incapacitation Status" = Incapacitation.Status, Mean = Mean.Final)

相关问题