在给定出生日期和任意日期的情况下,在R中进行有效且准确的年龄计算(以年、月或周为单位)

new9mtju  于 2023-09-27  发布在  其他
关注(0)|答案(4)|浏览(72)

我面临着一个常见的任务,即在给定出生日期和任意日期的情况下计算年龄(以年、月或周为单位)。问题是,我经常需要对许多记录(> 3亿)执行此操作,因此性能是这里的一个关键问题。
在SO和Google快速搜索后,我发现了3种替代方案:

  • 一种常用的算术运算程序(/365.25)(link
  • 使用软件包lubridatelink)中的函数new_interval()duration()
  • 来自eeptools包的函数age_calc()(link,linklink

这是我的玩具代码:

# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01", 
                       "1962-12-30", "1962-12-31", "1963-01-01", 
                       "2000-06-16", "2000-06-17", "2000-06-18", 
                       "2007-03-18", "2007-03-19", "2007-03-20", 
                       "1968-02-29", "1968-02-29", "1968-02-29"))

# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31", 
                       "2015-12-31", "2015-12-31", "2015-12-31", 
                       "2050-06-17", "2050-06-17", "2050-06-17",
                       "2008-03-19", "2008-03-19", "2008-03-19", 
                       "2015-02-28", "2015-03-01", "2015-03-02"))

# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25

# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) / 
                     duration(num = 1, units = "years")

# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")

让我们稍后讨论准确性,首先关注性能。代码如下:

# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = new_interval(start = birthdate, end = givendate) /
                                     duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    times = 1000
)

# And examine the results
mbm
autoplot(mbm)

这里的结果:

底线是:lubridateeeptools函数的性能比算术方法差得多(/365.25至少快10倍)。不幸的是,算术方法不够准确,我不能承担这种方法会犯的几个错误。
“由于现代格里历的构造方式,没有直接的算术方法来产生一个人的年龄,根据常见的用法-常见的用法意味着一个人的年龄应该总是一个整数,正好在生日上增加。(link
正如我在一些帖子中所读到的,lubridateeeptools不会犯这样的错误(尽管我没有查看代码/阅读更多关于这些函数的信息以了解它们使用的方法),这就是为什么我想使用它们,但它们的性能并不适用于我的真实的应用程序。
有没有一种有效而准确的方法来计算年龄?

编辑

Ops,似乎lubridate也会出错。显然,基于这个玩具的例子,它比算术方法犯的错误更多(见第3、6、9、12行)。(我做错了什么吗?)

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = new_interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df

    birthdate  givendate arithmetic lubridate eeptools
1  1978-12-30 2015-12-31         37        37       37
2  1978-12-31 2015-12-31         36        37       37
3  1979-01-01 2015-12-31         36        37       36
4  1962-12-30 2015-12-31         53        53       53
5  1962-12-31 2015-12-31         52        53       53
6  1963-01-01 2015-12-31         52        53       52
7  2000-06-16 2050-06-17         50        50       50
8  2000-06-17 2050-06-17         49        50       50
9  2000-06-18 2050-06-17         49        50       49
10 2007-03-18 2008-03-19          1         1        1
11 2007-03-19 2008-03-19          1         1        1
12 2007-03-20 2008-03-19          0         1        0
13 1968-02-29 2015-02-28         46        47       46
14 1968-02-29 2015-03-01         47        47       47
15 1968-02-29 2015-03-02         47        47       47
clj7thdc

clj7thdc1#

lubridate出现上述错误的原因是您正在计算持续时间(两个瞬间之间发生的确切时间量,其中1年= 31536000 s),而不是周期(两个瞬间之间发生的时钟时间变化)。
要获取时钟时间的变化(以年、月、日等为单位),您需要使用

as.period(interval(start = birthdate, end = givendate))

它给出以下输出

"37y 0m 1d 0H 0M 0S"   
 "37y 0m 0d 0H 0M 0S"   
 "36y 11m 30d 0H 0M 0S" 
 ...
 "46y 11m 30d 1H 0M 0S" 
 "47y 0m 0d 1H 0M 0S"   
 "47y 0m 1d 1H 0M 0S"

要只提取年份,可以使用以下命令

as.period(interval(start = birthdate, end = givendate))$year
#or
lubridate::year(as.period(interval(start = birthdate, end = givendate)))

 [1] 37 37 36 53 53 52 50 50 49  1  1  0 46 47 47

请注意,可悲的是,似乎比上面的方法更慢!

> mbm
Unit: microseconds
       expr       min        lq       mean    median         uq        max neval cld
 arithmetic   116.595   138.149   181.7547   184.335   196.8565   5556.306  1000  a 
  lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935  1000   b
im9ewurl

im9ewurl2#

好了,我在另一个post中找到了这个函数:

age <- function(from, to) {
    from_lt = as.POSIXlt(from)
    to_lt = as.POSIXlt(to)

    age = to_lt$year - from_lt$year

    ifelse(to_lt$mon < from_lt$mon |
               (to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
           age - 1, age)
}

它是由@Jim发布的,他说:“下面的函数采用Date对象的向量并计算年龄,正确地考虑了闰年。似乎是一个比任何其他答案更简单的解决方案”。
它确实更简单,它做了我正在寻找的技巧。平均而言,它实际上比算术方法快(大约快75%)。

mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    age = age(from = birthdate, to = givendate),
    times = 1000
)
mbm
autoplot(mbm)

而且至少在我的例子中它不会犯任何错误(在任何例子中它都不应该;这是一个非常简单的函数,使用ifelse s)。

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years"),
    age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df

    birthdate  givendate arithmetic lubridate eeptools age
1  1978-12-30 2015-12-31         37        37       37  37
2  1978-12-31 2015-12-31         36        37       37  37
3  1979-01-01 2015-12-31         36        37       36  36
4  1962-12-30 2015-12-31         53        53       53  53
5  1962-12-31 2015-12-31         52        53       53  53
6  1963-01-01 2015-12-31         52        53       52  52
7  2000-06-16 2050-06-17         50        50       50  50
8  2000-06-17 2050-06-17         49        50       50  50
9  2000-06-18 2050-06-17         49        50       49  49
10 2007-03-18 2008-03-19          1         1        1   1
11 2007-03-19 2008-03-19          1         1        1   1
12 2007-03-20 2008-03-19          0         1        0   0
13 1968-02-29 2015-02-28         46        47       46  46
14 1968-02-29 2015-03-01         47        47       47  47
15 1968-02-29 2015-03-02         47        47       47  47

我不认为它是一个完整的解决方案,因为我也想有几个月和几个星期的年龄,这个函数是特定的几年。我把它贴在这里,因为它解决了多年的问题。我不会接受,因为:
1.我会等待@Jim把它作为答案发布。
1.我将等待,看看是否有人想出一个完整的解决方案(高效,准确和生产年龄在几年,几个月或几周的期望)。

0g0grzrc

0g0grzrc3#

我本来打算把这个留在评论里,但我认为这值得单独回答。正如@Molx所指出的,你的“算术”方法并不像看起来那么简单--看看-.Date的代码,最重要的是:

return(difftime(e1, e2, units = "days"))

因此,类Date对象上的“算术”方法实际上是difftime函数的 Package 器。difftime怎么样?如果你追求的是原始的速度,这也有一堆开销。
关键是Date对象被存储为自Jan.10起/到Jan.10止的整数天数。1,1970(尽管它们实际上并没有存储为integer,因此data.table中的IDate类诞生了),所以我们可以减去这些并完成它,但为了避免调用-.Date方法,我们必须unclass我们的输入:

(unclass(birthdate) - unclass(givendate)) / 365.25

就你的bang for your buck而言,这种方法比@Jim的age方法还要快几个数量级。
以下是一些放大的测试数据:

set.seed(20349)
NN <- 1e6
birthdate <- as.Date(sprintf('%d-%02d-%02d',
                             sample(1901:2030, NN, TRUE),
                             sample(12, NN, TRUE),
                             sample(28, NN, TRUE)))

#average 30 years, most data between 20 and 40 years
givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000))

(不包括eeptools,因为它几乎慢得不可能--看一下age_calc的代码就知道,代码甚至会 * 为每对日期创建一个日期序列 *(O(n^2)-ish),更不用说ifelse s的大量出现了)

microbenchmark(
  arithmetic = (givendate - birthdate) / 365.25,
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  age = age(from = birthdate, to = givendate),
  fastar = (unclass(givendate) - unclass(birthdate)) / 365.25,
  overlaps = get_age(birthdate, givendate),
  times = 50)
# Unit: milliseconds
#        expr        min         lq      mean     median         uq      max neval  cld
#  arithmetic  28.153465  30.384639  62.96118  31.492764  34.052991 180.9556    50  b  
#   lubridate  94.327968  97.233009 157.30420 102.751351 240.717065 265.0283    50   c 
#         age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149    50    d
#      fastar   7.740098   7.831528  11.02521   7.913146   8.090902 153.3645    50 a   
#    overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929    50    d

因此,我们还强调了在小规模数据上进行基准测试的愚蠢之处。
@Jim的方法的最大代价是,随着向量的增长,as.POSIXlt的开销越来越大。
不准确的问题仍然存在,但除非这种准确性是至关重要的,否则似乎unclass方法是无与伦比的。

v64noz0r

v64noz0r4#

我一直在努力解决这个问题,终于有了一个)* 完美 * 准确 *(与迄今为止提出的 * 所有 * 其他选项相比)和b)相当快(见我的基准在另一个答案)的东西。它依赖于我手工完成的一堆算术运算和来自data.table包的精彩foverlaps函数。
该方法的本质是从Date s的整数表示开始工作,并认识到所有出生日期都落在四个1461(= 365 * 4 + 1)天周期中的一个周期中,这取决于下一年的时间,即你的生日将花费366天。
函数如下:

library(data.table)
get_age <- function(birthdays, ref_dates){
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
}

比较你的主要例子:

toy_df <- data.frame(
  birthdate = birthdate,
  givendate = givendate,
  arithmetic = as.numeric((givendate - birthdate) / 365.25),
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  eeptools = age_calc(dob = birthdate, enddate = givendate,
                      units = "years"),
  mine = get_age(birthdate, givendate)
)

toy_df
#     birthdate  givendate arithmetic lubridate   eeptools       mine
# 1  1978-12-30 2015-12-31 37.0020534 37.027397 37.0027397 37.0027322 #eeptools wrong: will be 366 days until 12/31/16, so fraction is 1/366
# 2  1978-12-31 2015-12-31 36.9993155 37.024658 37.0000000 37.0000000
# 3  1979-01-01 2015-12-31 36.9965777 37.021918 36.9972603 36.9972603
# 4  1962-12-30 2015-12-31 53.0020534 53.038356 53.0027397 53.0027322 #same problem
# 5  1962-12-31 2015-12-31 52.9993155 53.035616 53.0000000 53.0000000
# 6  1963-01-01 2015-12-31 52.9965777 53.032877 52.9972603 52.9972603
# 7  2000-06-16 2050-06-17 50.0013689 50.035616 50.0000000 50.0027397 #eeptools wrong: not exactly the birthday
# 8  2000-06-17 2050-06-17 49.9986311 50.032877 50.9972603 50.0000000 #eeptools wrong: _is_ exactly the birthday
# 9  2000-06-18 2050-06-17 49.9958932 50.030137 49.9945205 49.9972603 #eeptools wrong: fraction should be 364/365
# 10 2007-03-18 2008-03-19  1.0047912  1.005479  1.0027322  1.0027397 #eeptools wrong: 2/29 already passed, only 365 days until 3/19/2009
# 11 2007-03-19 2008-03-19  1.0020534  1.002740  1.0000000  1.0000000
# 12 2007-03-20 2008-03-19  0.9993155  1.000000  0.9966839  0.9972678 #eeptools wrong: we passed 2/29, so should be 365/366
# 13 1968-02-29 2015-02-28 46.9979466 47.030137 46.9977019 46.9972603 #my judgment: birthday occurs on 3/1 for 2/29 babies, so 364/365 the way there
# 14 1968-02-29 2015-03-01 47.0006845 47.032877 47.0000000 47.0000000
# 15 1968-02-29 2015-03-02 47.0034223 47.035616 47.0027397 47.0027322

这种方法可以很容易地扩展到处理数月/数周。月份会有点冗长(必须指定4年的月份长度),所以我没有打扰;周很简单(周不受闰年因素的影响,所以我们可以除以7)。
我也在base功能上取得了很大的进展,但是a)它非常丑陋(需要0-1460的非线性转换来避免嵌套ifelse语句等),b)最后一个for循环(以apply的形式在整个日期列表上)是不可避免的,所以我决定这会让事情变得太慢。(转换为x1 = (unclass(birthdays) - 59) %% 1461; x2 = x1 * (729 - x1) / 402232 + x1,供后人参考)
我把这个函数添加到my package中。

  • (适用于不需要考虑non-leap centuries的日期范围;不过,我相信处理这些日期的扩展不应该太麻烦)

相关问题