我有这样一个数据集:
year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007")
id = c("X", "X", "X", "X", "Z", "Z", "Z")
product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon")
market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL")
df = data.frame(year, id, product, market)
我想创建3个变量,表示:
- FPFM =如果是该产品在该给定市场的首次销售,则取值1
- FP =如果是第一次使用该产品,则取值1
- FM =如果是首次进入该市场,则取值1:
因此,新数据如下所示:
year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007")
id = c("X", "X", "X", "X", "Z", "Z", "Z")
product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon")
market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL")
FPFM = c(1, 1, 1, 0, 1, 1, 1)
FP = c(1, 1, 1, 0, 1, 0, 1)
FM = c(1, 1, 1, 0, 1, 1, 0)
df_desired = data.frame(year, id, product, market, FPFM, FP, FM)
我尝试了以下df_new代码,但没有成功:
df_new <- df %>%
arrange(id, year) %>%
group_by(id, product, market) %>%
mutate(FPFM = row_number(year) == 1) %>%
as.data.frame() %>%
group_by(id, product) %>%
mutate(FP = row_number(year) == 1) %>%
as.data.frame() %>%
group_by(id, market) %>%
mutate(FM = row_number(year) == 1) %>%
as.data.frame()
它只给出了第一次观察的值。我想要有观察到的产品,市场或两者结合的第一年的值。
第3行应为“真”;正确;正确”而不是“错误”;FASLE; FALSE”,因为它属于同一年。
我想到的另一个解决方案是用唯一值总结df三次,然后与原始df右连接。但是,这将花费大量的时间和空间,因为我有大量的数据。
您是否拥有最高效、最集成的解决方案?
2条答案
按热度按时间shyt4zoc1#
我只想做一个小的帮助函数,使代码更简洁。注意,我们可以用数学把逻辑函数改为二进制函数
gpfsuwkq2#
将
row_number(year) == 1
更改为year == year[1]
:另外,重复
as.data.frame
似乎是不必要的。如果你真的想要一个data.frame而不是tibble,你可以保留最后一个,但在我看来tibble是一个更好的选择。检查“高级R”的这一节,了解一些原因。结果: