在R列中用随机数替换NA

icnyk63a  于 2023-04-27  发布在  其他
关注(0)|答案(2)|浏览(87)

我有两个列,clientid和clientname,有些客户端没有id,如果客户端名称相同,我需要分配一个不重复的随机id。我想我可以这样做:
df1 <- df %>% dplyr::group_by(clientname) %>% dplyr::mutate(clientid = ifelse(is.na(clientid), sample(1:10000), clientid))
但我仍然得到重复的随机数样本,即使客户端名称是非常不同的。

clientname <- c("Mr A", "Mr B", "Mr B", "Mr C", "Mr D")
clientid <- c(NA,NA,NA,NA,1)

df <- data.frame(clientname,clientid)
df
  clientname clientid
1       Mr A       NA
2       Mr B       NA
3       Mr B       NA
4       Mr C       NA
5       Mr D        1

df1 <- df %>% dplyr::group_by(clientname) %>% dplyr::mutate(clientid = ifelse(is.na(clientid), sample(1:10000), clientid))
df1
# A tibble: 5 x 2
# Groups:   clientname [4]
  clientname clientid
  <chr>         <dbl>
1 Mr A            948
2 Mr B           4004
3 Mr B           7888
4 Mr C           4668
5 Mr D              1

我希望它看起来像这样的最后“先生B”有相同的客户端,但其不同的其他客户端名称

> df1
# A tibble: 5 x 2
# Groups:   clientname [4]
  clientname clientid
  <chr>         <dbl>
1 Mr A            948
2 Mr B           4004
3 Mr B           4004
4 Mr C           4668
5 Mr D              1
fnvucqvd

fnvucqvd1#

要确保您没有使用现有ID,可以添加到最大现有ID

suppressPackageStartupMessages(library(dplyr))

clientname <- c("Mr A", "Mr B", "Mr B", "Mr C", "Mr D")
clientid <- c(NA,NA,NA,NA,1)

df <- data.frame(clientname,clientid)

df %>% 
  mutate(temp_id = max(clientid, na.rm = T) + as.numeric(as.factor(clientname)),
         clientid = coalesce(clientid, temp_id),
         temp_id = NULL)
#>   clientname clientid
#> 1       Mr A        2
#> 2       Mr B        3
#> 3       Mr B        3
#> 4       Mr C        4
#> 5       Mr D        1

创建于2023-04-18带有reprex v2.0.2

ifmq2ha2

ifmq2ha22#

在我的原始答案中有可能会分配重复的随机数(感谢里奇萨克拉门托指出我的疏忽!)。请参阅下面编辑的答案以获得更好的解决方案。

原始答案:

if_else() dplyr函数有助于解释/理解问题;基本上,当您使用sample(1:1000)时,您将生成一个包含1000个随机数的向量。当您使用ifelse()时,这些数字将用于逐个“填充”NA,其余数字将被丢弃。在这种情况下,if_else()函数将抛出警告('你需要正确数量的随机数'),所以如果你指定你想从1:1000中选择多少个随机数(即每个客户端名称1个),你会得到想要的结果,例如

library(dplyr, warn = FALSE)

clientname <- c("Mr A", "Mr B", "Mr B", "Mr C", "Mr D")
clientid <- c(NA,NA,NA,NA,1)

df <- data.frame(clientname,clientid)
df
#>   clientname clientid
#> 1       Mr A       NA
#> 2       Mr B       NA
#> 3       Mr B       NA
#> 4       Mr C       NA
#> 5       Mr D        1

df %>%
  group_by(clientname) %>%
  mutate(clientid = if_else(is.na(clientid), sample(1:1000), clientid))
#> Error in `mutate()`:
#> ℹ In argument: `clientid = if_else(is.na(clientid), sample(1:1000),
#>   clientid)`.
#> ℹ In group 1: `clientname = "Mr A"`.
#> Caused by error in `if_else()`:
#> ! `true` must have size 1, not size 1000.
#> Backtrace:
#>      ▆
#>   1. ├─df %>% group_by(clientname) %>% ...
#>   2. ├─dplyr::mutate(...)
#>   3. ├─dplyr:::mutate.data.frame(...)
#>   4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
#>   5. │   ├─base::withCallingHandlers(...)
#>   6. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
#>   7. │     └─mask$eval_all_mutate(quo)
#>   8. │       └─dplyr (local) eval()
#>   9. └─dplyr::if_else(is.na(clientid), sample(1:1000), clientid)
#>  10.   └─dplyr:::vec_case_when(...)
#>  11.     └─vctrs::vec_assert(value, size = size, arg = value_arg, call = call)
#>  12.       └─vctrs:::stop_assert_size(x_size, size, arg, call = call)
#>  13.         └─vctrs:::stop_assert(...)
#>  14.           └─vctrs:::stop_vctrs(...)
#>  15.             └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

df %>%
  group_by(clientname) %>%
  mutate(clientid = if_else(is.na(clientid), sample(1:1000, 1), clientid))
#> # A tibble: 5 × 2
#> # Groups:   clientname [4]
#>   clientname clientid
#>   <chr>         <dbl>
#> 1 Mr A            901
#> 2 Mr B            113
#> 3 Mr B            113
#> 4 Mr C             88
#> 5 Mr D              1

# If you change sample(1:1000) to sample(1:1000, 1) you can use the generic ifelse():
df %>%
  group_by(clientname) %>%
  mutate(clientid = ifelse(is.na(clientid), sample(1:1000, 1), clientid))
#> # A tibble: 5 × 2
#> # Groups:   clientname [4]
#>   clientname clientid
#>   <chr>         <dbl>
#> 1 Mr A            555
#> 2 Mr B            374
#> 3 Mr B            374
#> 4 Mr C            979
#> 5 Mr D              1

创建于2023-04-19带有reprex v2.0.2
关于if_else()ifelse()之间的差异的更多细节:https://medium.com/@statisticswithoutborders/r-function-of-the-week-ifelse-vs-if-else-bed37f474fca

编辑一:

使用上面的方法,您可能会将同一个clientid分配给两个不同的clientname(非常糟糕)。为了避免这种情况,一个可能的选择是:

library(dplyr, warn = FALSE)
library(vctrs, warn = FALSE)

clientname <- c("Mr A", "Mr B", "Mr B", "Mr C", "Mr D")
clientid <- c(NA,NA,NA,NA,1)

df <- data.frame(clientname,clientid)
# change '+ 5' to '+ 1000' or however many you need
random_numbers <- sample(max(df$clientid, na.rm = TRUE): max(df$clientid, na.rm = TRUE) + 5)
available_numbers <- random_numbers[!random_numbers %in% df$clientid]
available_numbers
#> [1] 3 6 4 5 2
# Fill the NAs 'sequentially', regardless of group
df$clientid[is.na(df$clientid)] <- available_numbers
#> Warning in df$clientid[is.na(df$clientid)] <- available_numbers: number of
#> items to replace is not a multiple of replacement length
df
#>   clientname clientid
#> 1       Mr A        3
#> 2       Mr B        6
#> 3       Mr B        4
#> 4       Mr C        5
#> 5       Mr D        1

# Then, fill 'down' the clientid from the first clientname in each group
df %>%
  group_by(clientname) %>%
  mutate(groupid = cumsum(!is.na(clientname))) %>%
  mutate(clientid = ifelse(groupid >= 2, NA, clientid)) %>%
  mutate(clientid = vec_fill_missing(clientid, direction = "down")) %>%
  select(-groupid)
#> # A tibble: 5 × 2
#> # Groups:   clientname [4]
#>   clientname clientid
#>   <chr>         <dbl>
#> 1 Mr A              3
#> 2 Mr B              6
#> 3 Mr B              6
#> 4 Mr C              5
#> 5 Mr D              1

创建于2023-04-19带有reprex v2.0.2
这是对一个简单问题的复杂回答,但它确保(我相信)‘随机’数字按要求分配。

相关问题