我有一个包含6列的 Dataframe :年龄、性别、MoB、YoB、MoI和YoI。 Dataframe 包含一些患者数据,其中后缀"B"指患者的生日(月和年),后缀"I"指疾病的发病日期(月和年)。由于某些行可能缺少年龄,对于没有有效Age值的行,我想将Age计算为YoI和YoB的差值。这是我到目前为止编写的脚本:
# 1. >>> INIT
# 1.1 Initialization
# install.packages("dplyr");
library("dplyr");
message("\nRunning data sanity for fixing missing age\n");
# 1.2 Definition of target files
base_path <- dirname(sys.frame(1)$ofile);
dataset_file_in <- paste(base_path, "/", "dataset_missing_age.csv", sep="");
dataset_file_out <- paste(base_path, "/", "dataset_fixed.csv", sep="");
# 1.3 Provide feedback in the consolle
message(paste("Base path :", base_path));
message(paste("Target dataset :", dataset_file_in));
message(paste("Output dataset :", dataset_file_out));
# 2. >>> READ FILE
# 2.1 Read file from CSV file (assuming semicolon separator)
dataframe_csv <- read.csv(dataset_file_in, sep=";");
message(paste("Dataset rows :", nrow(dataframe_csv)));
message(paste("Dataset cols :", length(dataframe_csv), "\n"));
# 3. >>> FIX AGE (only where missing)
# 3.1 Add an "helper" column adjusting age with respect of the
# the condition "Is Incidence date AFTER Birthday date?".
# Example:
# if patient was born in April and got sick in May -> age = YoI - YoB
# if patient was born in April and got sick in March -> age = YoI - YoB - 1
dataframe_csv <- dataframe_csv %>%
mutate(IafterB = case_when(
as.numeric(MoB) >= as.numeric(MoI) ~ 0,
as.numeric(MoB) < as.numeric(MoI) ~ -1
));
# 3.2 Mark with the value "-1" the rows where Age is undefined
dataframe_csv$Age[is.na(dataframe_csv$Age)] <- -1;
# 3.3 Define age formula (to improve code readability and maintainability)
calculateAge <- function(YoB, YoI, IafterB) {
# WE HAVE MANY WEIRD BUGS HERE !!!
# I'll list some expressions I tried so far, providing an example of the result
# 1) This calculates: 2006 - 1940 as 68 (instead of 66). Why?
age_calculated <- YoI - YoB;
# 2) This works fine, but I don't like the idea to remove 2 years without knowing why
age_calculated <- YoI - YoB - 2;
# 3) This calculates: 2006 - 1940 as 68 (instead of 66) as case 1,
# and also ignores the "IafterB" value. Why?
age_calculated <- YoI + IafterB - YoB;
# 4) This return always 0, which confirm that column "IafterB" is ignored. Why?
age_calculated <- IafterB;
return (age_calculated);
}
# 3.4 Executes actual datasanity, replacing missing age with calculated one
# (should be rows from 13 to 19 included).
dataframe_csv$Age <- sub("-1", calculateAge(dataframe_csv$YoB, dataframe_csv$YoI, dataframe_csv$IafterB), dataframe_csv$Age);
# 3.4 Print result of the above datasanity
message("Whole dataframe after fixing missing Age:\n");
print(dataframe_csv);
message();
# 4. >>> PRODUCING OUTPUT FILE
# 4.1 Save current dataset object to in current working directory
write.csv2(dataframe_csv, dataset_file_out, row.names = FALSE, quote=FALSE);
这是数据测试文件:
Age;Sex;MoB;YoB;MoI;YoI
49;X ;8;1960;5;2010
49;*;8;1960;5;2010
67;1;1;1938;2;2006
;1;3;1940;8;2006
;1;4;1940;9;2006
;1;6;1940;10;2006
;1;8;1940;6;2006
;1;10;1940;2;2006
;1;11;1940;2;2006
;1;12;1940;2;2006
67;1;11;1940;2;2006
67;9;10;1938;2;2006
67;1;10;1938;2;9999
我试过许多不同的公式,但每一个都有不同的问题。如果我计算年龄如下:
age_calculated <- YoI - YoB;
我得到了一个错误的值(例如,2006 - 1940给出68,而不是66!!!)。如果我计算年龄如下:
age_calculated <- YoI - YoB - 2;
我得到了正确的值,但我不明白为什么。如果我按如下公式计算年龄(这是我想用的公式):
age_calculated <- YoI + IafterB - YoB;
列"IafterB"被忽略。为了确认这样的列是否被忽略,我也尝试了(错误的公式,仅用于检查列"IafterB"是否被考虑):
age_calculated <- IafterB;
这将所有缺失的Age值设置为"0",这证明列"IafterB"上的"-1"值被忽略。
我哪里做错了?
1条答案
按热度按时间yhxst69z1#
使用
coalesce
:(我把它放在一个单独的
Age2
中,只是为了演示,让这两个变量并排,只使用Age=coalesce(...)
是有效的,而且可能更容易。)这两条语句实际上是等效的,有助于理解
coalesce
的作用: