使用“子”功能更新 Dataframe 时，R计算错误的数学减法

我有一个包含6列的 Dataframe ：年龄、性别、MoB、YoB、MoI和YoI。 Dataframe 包含一些患者数据，其中后缀"B"指患者的生日（月和年），后缀"I"指疾病的发病日期（月和年）。由于某些行可能缺少年龄，对于没有有效Age值的行，我想将Age计算为YoI和YoB的差值。这是我到目前为止编写的脚本：

# 1. >>> INIT

# 1.1 Initialization
# install.packages("dplyr");
library("dplyr");
message("\nRunning data sanity for fixing missing age\n");

# 1.2 Definition of target files
base_path <- dirname(sys.frame(1)$ofile);
dataset_file_in <- paste(base_path, "/", "dataset_missing_age.csv", sep="");
dataset_file_out <- paste(base_path, "/", "dataset_fixed.csv", sep="");

# 1.3 Provide feedback in the consolle
message(paste("Base path                     :", base_path));
message(paste("Target dataset                :", dataset_file_in));
message(paste("Output dataset                :", dataset_file_out));

# 2. >>> READ FILE

# 2.1 Read file from CSV file (assuming semicolon separator)
dataframe_csv <- read.csv(dataset_file_in, sep=";");
message(paste("Dataset rows                  :", nrow(dataframe_csv)));
message(paste("Dataset cols                  :", length(dataframe_csv), "\n"));

# 3. >>> FIX AGE (only where missing)

# 3.1 Add an "helper" column adjusting age with respect of the
# the condition  "Is Incidence date AFTER Birthday date?".
# Example:
# if patient was born in April and got sick in May -> age = YoI - YoB
# if patient was born in April and got sick in March -> age = YoI - YoB - 1
dataframe_csv <- dataframe_csv %>% 
    mutate(IafterB = case_when(
        as.numeric(MoB) >= as.numeric(MoI) ~ 0,
        as.numeric(MoB) < as.numeric(MoI) ~ -1
    ));

# 3.2 Mark with the value "-1" the rows where Age is undefined
dataframe_csv$Age[is.na(dataframe_csv$Age)] <- -1;
    
# 3.3 Define age formula (to improve code readability and maintainability)
calculateAge <- function(YoB, YoI, IafterB) {
  # WE HAVE MANY WEIRD BUGS HERE !!!
  # I'll list some expressions I tried so far, providing an example of the result
  # 1) This calculates: 2006 - 1940 as 68 (instead of 66). Why?
  age_calculated <- YoI - YoB;
  # 2) This works fine, but I don't like the idea to remove 2 years without knowing why
  age_calculated <- YoI - YoB - 2;
  # 3) This calculates: 2006 - 1940 as 68 (instead of 66) as case 1,
  # and also ignores the "IafterB" value. Why?
  age_calculated <- YoI + IafterB - YoB;
  # 4) This return always 0, which confirm that column "IafterB" is ignored. Why?
  age_calculated <- IafterB;
  return (age_calculated);
}

# 3.4 Executes actual datasanity, replacing missing age with calculated one
# (should be rows from 13 to 19 included).
dataframe_csv$Age <- sub("-1", calculateAge(dataframe_csv$YoB, dataframe_csv$YoI, dataframe_csv$IafterB), dataframe_csv$Age);

# 3.4 Print result of the above datasanity
message("Whole dataframe after fixing missing Age:\n");
print(dataframe_csv);
message();

# 4. >>> PRODUCING OUTPUT FILE

# 4.1 Save current dataset object to in current working directory
write.csv2(dataframe_csv, dataset_file_out, row.names = FALSE, quote=FALSE);

这是数据测试文件：

Age;Sex;MoB;YoB;MoI;YoI
49;X ;8;1960;5;2010
49;*;8;1960;5;2010
67;1;1;1938;2;2006
;1;3;1940;8;2006
;1;4;1940;9;2006
;1;6;1940;10;2006
;1;8;1940;6;2006
;1;10;1940;2;2006
;1;11;1940;2;2006
;1;12;1940;2;2006
67;1;11;1940;2;2006
67;9;10;1938;2;2006
67;1;10;1938;2;9999

我试过许多不同的公式，但每一个都有不同的问题。如果我计算年龄如下：

age_calculated <- YoI - YoB;

我得到了一个错误的值（例如，2006 - 1940给出68，而不是66!!!）。如果我计算年龄如下：

age_calculated <- YoI - YoB - 2;

我得到了正确的值，但我不明白为什么。如果我按如下公式计算年龄（这是我想用的公式）：

age_calculated <- YoI + IafterB - YoB;

列"IafterB"被忽略。为了确认这样的列是否被忽略，我也尝试了（错误的公式，仅用于检查列"IafterB"是否被考虑）：

age_calculated <- IafterB;

这将所有缺失的Age值设置为"0"，这证明列"IafterB"上的"-1"值被忽略。
我哪里做错了？

使用coalesce：

dat %>%
  mutate(Age2 = coalesce(Age, YoI - YoB))
#    Age Sex MoB  YoB MoI  YoI Age2
# 1   49  X    8 1960   5 2010   49
# 2   49   *   8 1960   5 2010   49
# 3   67   1   1 1938   2 2006   67
# 4   NA   1   3 1940   8 2006   66
# 5   NA   1   4 1940   9 2006   66
# 6   NA   1   6 1940  10 2006   66
# 7   NA   1   8 1940   6 2006   66
# 8   NA   1  10 1940   2 2006   66
# 9   NA   1  11 1940   2 2006   66
# 10  NA   1  12 1940   2 2006   66
# 11  67   1  11 1940   2 2006   67
# 12  67   9  10 1938   2 2006   67
# 13  67   1  10 1938   2 9999   67

（我把它放在一个单独的Age2中，只是为了演示，让这两个变量并排，只使用Age=coalesce(...)是有效的，而且可能更容易。）
这两条语句实际上是等效的，有助于理解coalesce的作用：

coalesce(Age, YoI - YoB)
if_else(is.na(Age), YoI - YoB, Age)

使用“子”功能更新 Dataframe 时，R计算错误的数学减法

1条答案

相关问题

热门标签

最新问答