R语言 按组用最近的非NA替换缺失值(NA)

gdx19jrr  于 2023-04-03  发布在  其他
关注(0)|答案(7)|浏览(249)

我想用dplyr解决以下问题。最好用一个窗口函数。我有一个 Dataframe 与房屋和购买价格。下面是一个例子:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    NA
1            1998    120
1            1999    NA
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    NA
3            1995    NA
3            1996    44
3            1997    NA
3            1998    NA
3            1999    NA

我想创建一个像这样的 Dataframe :

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    100
1            1998    120
1            1999    120
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    30
3            1995    NA
3            1996    44
3            1997    44
3            1998    44
3            1999    44

以下是一些格式正确的数据:

# Number of houses
N = 15

# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))

有没有一个dplyr的方法来做到这一点?

xfyts7mz

xfyts7mz1#

tidyr::fill现在让这变得非常简单:

library(dplyr)
library(tidyr)
# or library(tidyverse)

df %>% group_by(houseID) %>% fill(price)
# Source: local data frame [15 x 3]
# Groups: houseID [3]
# 
#    houseID  year price
#      (int) (int) (int)
# 1        1  1995    NA
# 2        1  1996   100
# 3        1  1997   100
# 4        1  1998   120
# 5        1  1999   120
# 6        2  1995    NA
# 7        2  1996    NA
# 8        2  1997    NA
# 9        2  1998    30
# 10       2  1999    30
# 11       3  1995    NA
# 12       3  1996    44
# 13       3  1997    44
# 14       3  1998    44
# 15       3  1999    44
yk9xbfzb

yk9xbfzb2#

这些都使用zoo包中的na.locf。还要注意的是,na.locf0(也在zoo中定义)类似于na.locf,除了它默认为na.rm = FALSE并且需要单个向量参数。在第一个解决方案中定义的na.locf2也用于其他一些解决方案。

死亡

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

给出:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

这一点的变化是:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

下面的其他解决方案给予了非常相似的输出,所以我们不会重复它,除非格式有很大的不同。
另一种可能性是将by解决方案(下面进一步显示)与dplyr组合:

df %>% by(df$houseID, na.locf2) %>% bind_rows

通过

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo此解决方案仅使用zoo。它返回宽而不是长结果:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

给出:
这个解决方案可以像这样与dplyr结合:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

输入

下面是用于上面示例的输入:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

修订重新安排并添加了更多的解决方案。修订dplyr/zoo解决方案,以符合最新的变化dplyr。应用固定并从所有解决方案中分解出na.locf2

lsmepo6l

lsmepo6l3#

您可以执行data.table支持的滚动自连接:

require(data.table)
setDT(df)   ## change it to data.table in place
setkey(df, houseID, year)     ## needed for fast join
df.woNA <- df[!is.na(price)]  ## version without the NA rows

# rolling self-join will return what you want
df.woNA[df, roll=TRUE]  ## will match previous year if year not found
368yc8dk

368yc8dk4#

纯dplyr解决方案(没有动物园)。

df %>% 
 group_by(houseID) %>%
 mutate(price_change = cumsum(0 + !is.na(price))) %>%
 group_by(price_change, add = TRUE) %>%
 mutate(price_filled = nth(price, 1)) %>%
 ungroup() %>%
 select(-price_change) -> df2

示例解决方案的有趣部分在df2的末尾。

> tail(df2, 20)
Source: local data frame [20 x 4]

    houseID year     price price_filled
 1       14 1995        NA           NA
 2       14 1996        NA           NA
 3       14 1997        NA           NA
 4       14 1998        NA           NA
 5       14 1999 0.8374778    0.8374778
 6       14 2000        NA    0.8374778
 7       14 2001        NA    0.8374778
 8       14 2002        NA    0.8374778
 9       14 2003 2.1918880    2.1918880
10       14 2004        NA    2.1918880
11       15 1995        NA           NA
12       15 1996 0.3982450    0.3982450
13       15 1997        NA    0.3982450
14       15 1998 1.7727000    1.7727000
15       15 1999        NA    1.7727000
16       15 2000        NA    1.7727000
17       15 2001        NA    1.7727000
18       15 2002 7.8636329    7.8636329
19       15 2003        NA    7.8636329
20       15 2004        NA    7.8636329
t2a7ltrp

t2a7ltrp5#

dplyrimputeTS的组合。

library(dplyr)
library(imputeTS)
df %>% group_by(houseID) %>% 
mutate(price = na.locf(price, na.remaining="keep"))

您也可以使用imputeTS中更高级的缺失数据替换(填补)函数替换na.locf。例如na.interpolationna.kalman。为此,只需将na.locf替换为您喜欢的函数名称。

tnkciper

tnkciper6#

没有dplyr

prices$price <-unlist(lapply(split(prices$price,prices$houseID),
function(x) zoo::na.locf(x,na.rm=FALSE)))

prices
   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44
vnjpjtjt

vnjpjtjt7#

从data.table v1.12.4开始,该包具有nafill()函数,类似于tidyr::fill()zoo::na.locf(),您可以执行以下操作:

require(data.table)
setDT(df)

df[ , price := nafill(price, type = 'locf'), houseID ]

还有setnafill(),虽然不允许 group by,但允许多列。

setnafill(df, type = 'locf', cols = 'price')

数据来自@G. Grothendieck的回答:

df = data.frame(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                            2L, 3L, 3L, 3L, 3L, 3L),
                year = c(1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L,
                         1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L),
                price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 30L, NA, NA, 44L,
                          NA, NA, NA))

相关问题