数据库(在摘录下面)由公司组成,对于这些公司,每个观察包含关于公司一年的信息。每个公司具有唯一标识符"gvkey"(第一列)。变量"costat"可以采取两种形式"A"或"I"。
我想选择变量"costat"中值为"I"的每个公司的最后一个观测值,并计算列"dlrsn"中每个值每年出现的次数("year"列)。变量"dlrsn"可以取1到14之间的数字。我想知道每个数字每年出现多少次("年"列),在选择每家公司的最后一次观察之后。
预期输出,假设只有5年,"dlrsn"有两个可能值:
year dlrsn1 dlrsn2
1 1977 1 0
2 1989 0 1
其中dlrsn1 = 9(如数据中所示)并且dlrsn2 = 7。
数据框如下
dput(example)
structure(list(gvkey = c(1000L, 1000L, 1000L, 1000L, 1000L, 1003L,
1003L, 1003L, 1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L,
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1973L, 1974L,
1975L, 1976L, 1977L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L,
1989L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L,
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L,
1990L), costat = c("I", "I", "I", "I", "I", "I", "I", "I", "I",
"I", "I", "I", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A"), dlrsn = c(9L, 9L, 9L,
9L, 9L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), sale = c(37.75,
50.325, 51.192, 66.414, 77.946, 13.793, 13.829, 24.189, 36.308,
37.356, 32.808, 19.267, 57.312, 54.626, 65.797, 80.535, 92.82,
118.667, 130.617, 132.482, 175.924, 155.006, 177.762, 218.946,
248.012, 298.192, 347.64, 406.36, 444.875, 466.542)), row.names = c(NA,
-30L), class = "data.frame")
我甚至不知道该怎么开始。谢谢
1条答案
按热度按时间inn6fuwd1#
使用
data.table
:数据: