选择每个项目R的最后观测值

7lrncoxx  于 2023-01-15  发布在  其他
关注(0)|答案(1)|浏览(145)

数据库(在摘录下面)由公司组成,对于这些公司,每个观察包含关于公司一年的信息。每个公司具有唯一标识符"gvkey"(第一列)。变量"costat"可以采取两种形式"A"或"I"。
我想选择变量"costat"中值为"I"的每个公司的最后一个观测值,并计算列"dlrsn"中每个值每年出现的次数("year"列)。变量"dlrsn"可以取1到14之间的数字。我想知道每个数字每年出现多少次("年"列),在选择每家公司的最后一次观察之后。
预期输出,假设只有5年,"dlrsn"有两个可能值:

year dlrsn1 dlrsn2
1 1977      1      0
2 1989      0     1

其中dlrsn1 = 9(如数据中所示)并且dlrsn2 = 7。
数据框如下

dput(example)
structure(list(gvkey = c(1000L, 1000L, 1000L, 1000L, 1000L, 1003L, 
1003L, 1003L, 1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1973L, 1974L, 
1975L, 1976L, 1977L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 
1989L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L, 
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 
1990L), costat = c("I", "I", "I", "I", "I", "I", "I", "I", "I", 
"I", "I", "I", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A"), dlrsn = c(9L, 9L, 9L, 
9L, 9L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), sale = c(37.75, 
50.325, 51.192, 66.414, 77.946, 13.793, 13.829, 24.189, 36.308, 
37.356, 32.808, 19.267, 57.312, 54.626, 65.797, 80.535, 92.82, 
118.667, 130.617, 132.482, 175.924, 155.006, 177.762, 218.946, 
248.012, 298.192, 347.64, 406.36, 444.875, 466.542)), row.names = c(NA, 
-30L), class = "data.frame")

我甚至不知道该怎么开始。谢谢

inn6fuwd

inn6fuwd1#

使用data.table

library(data.table)

dt2 <- dcast(
  setorder(
    setDT(dt),
    gvkey, fyear # make sure it's ordered by gvkey and fyear
  )[
    costat == "I", # select only rows with costat = "I
    .(year = fyear[.N], dlrsn = dlrsn[.N]), # get the last observation
    gvkey # group by gvkey
  ],
  year~dlrsn, length, value.var = "dlrsn" # cast long to wide by dlrsn count
)

setnames(dt2, names(dt2)[-1], paste0("dlrsn", names(dt2)[-1]))[] # rename columns
#>    year dlrsn7 dlrsn9
#> 1: 1977      0      1
#> 2: 1989      1      0

数据:

dt <- data.frame(
  gvkey = c(1000L, 1000L, 1000L, 1000L, 1000L, 1003L, 
            1003L, 1003L, 1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 
            1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
            1004L, 1004L, 1004L, 1004L, 1004L, 1004L),
  fyear = c(1973L, 1974L, 
            1975L, 1976L, 1977L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 
            1989L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L, 
            1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 
            1990L), 
  costat = c("I", "I", "I", "I", "I", "I", "I", "I", "I", 
             "I", "I", "I", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
             "A", "A", "A", "A", "A", "A", "A", "A"), 
  dlrsn = c(9L, 9L, 9L, 
            9L, 9L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, NA, NA, NA, NA, NA, NA, NA, 
            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
)

相关问题