R:可拆卸/合并两个框架

jgzswidk  于 2023-11-14  发布在  其他
关注(0)|答案(2)|浏览(91)

我有两个嵌套框df.olddf.new。现在,我想将df.old嵌套框中的PID变量添加到新的嵌套框中,在新的嵌套框中匹配电子邮件地址。如果没有可用的电子邮件地址,则应使用名和姓(paste(firstname, lastname))进行匹配。如何有效地完成此操作?
我的第一个猜测是创建两个“查找”函数get_PID_by_mail(mail)get_PID_by_name(firstname, lastname),将它们向量化并应用于df.new %>% mutate(PID=get_PID_by_mail(mail))。但结果证明这有点效率低下,因为数组很大。你会如何解决这个问题?谢谢!

df.old <- data.frame(PID = c(1, 2, 3, 4, NA),
                     firstname = c("", "Peter", "David", "Jessy", ""),
                     lastname = c("", "White", "Smith", "Connor", ""),
                     mail = c("[email protected]", "[email protected]", NA, "[email protected]", NA))

df.new <- data.frame(mail = c("[email protected]", "[email protected]", NA, NA , NA),
                     firstname = c("", "", "", "David", ""),
                     lastname = c("", "", "", "Smith", ""))
df.new

字符串
预期产出:

df.new
======
   mail                firstname lastname  PID
1  [email protected]                          1
2  [email protected]                          2
3  <NA>                                    <NA>
4  <NA>                David     Smith     3
5  <NA>                                    <NA>

pexxcrt2

pexxcrt21#

使用两个left_join,您可以:

library(dplyr, warn = FALSE)

df.new |>
  left_join(df.old |>
    filter(!is.na(mail)) |>
    select(PID, mail), by = "mail") |>
  left_join(df.old |>
    filter(is.na(mail)) |>
    select(-mail), by = c("firstname", "lastname")) |>
  mutate(PID = coalesce(PID.x, PID.y), .keep = "unused")
#>             mail firstname lastname PID
#> 1 [email protected]                      1
#> 2 [email protected]                      2
#> 3           <NA>                     NA
#> 4           <NA>     David    Smith   3
#> 5           <NA>                     NA

字符串

rsl1atfo

rsl1atfo2#

在另一个答案之后一分钟,但是用bind_rows来做:

library(tidyverse)
df.old <- data.frame(PID = c(1, 2, 3, 4, NA),
                     firstname = c("", "Peter", "David", "Jessy", ""),
                     lastname = c("", "White", "Smith", "Connor", ""),
                     mail = c("[email protected]", "[email protected]", NA, "[email protected]", NA))

df.new <- data.frame(mail = c("[email protected]", "[email protected]", NA, NA , NA),
                     firstname = c("", "", "", "David", ""),
                     lastname = c("", "", "", "Smith", ""))

bind_rows(
  df.new |>
    filter(mail != "") |>
    left_join(df.old |> select(mail, PID)),
  
  df.new |>
    filter(is.na(mail)) |>
    left_join(
      df.old |> filter(is.na(mail)) |> select(firstname, lastname, PID),
      by = join_by(firstname, lastname)
    )
)
#> Joining with `by = join_by(mail)`
#>             mail firstname lastname PID
#> 1 [email protected]                      1
#> 2 [email protected]                      2
#> 3           <NA>                     NA
#> 4           <NA>     David    Smith   3
#> 5           <NA>                     NA

字符串

编辑-跳过不良邮件

对电子邮件进行更强大检查的最简单方法是使用正则表达式来测试第一个匹配的电子邮件有效性,并在第二个匹配中否定它(并检测NA),然后加入结果:

library(tidyverse)
df.old <- data.frame(PID = c(1, 2, 3, 4, 5, 6),
                     firstname = c("", "Peter", "David", "Jessy", "Bad", "Gordon"),
                     lastname = c("", "White", "Smith", "Connor", "Email", "Bennet"),
                     mail = c("[email protected]", "[email protected]", NA, "[email protected]", NA, "[email protected]"))

df.new <- data.frame(mail = c("[email protected]", "[email protected]", NA, "" , "bademail@none", "[email protected]"),
                     firstname = c("", "", "", "David", "Bad", "Gordon"),
                     lastname = c("", "", "", "Smith", "Email", "Bennet"))

# Step 1: all valid, present, emails with matching records:
first_join <- df.new |>
  filter(str_detect(
    mail,
    "^\\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*$"
  )) |>
  inner_join(df.old |> select(mail, PID), by = join_by(mail))

# Step 2: all remaining records, with invalid emails, missing emails or
# non-matched emails; join to first set
df.new |>
  anti_join(first_join, by = join_by(mail)) |>
  left_join(df.old |> 
              anti_join(first_join, by = join_by(mail)) |> 
              select(firstname, lastname, PID),
            by = join_by(firstname, lastname)) |> 
  bind_rows(first_join, second = _)
#>                mail firstname lastname PID
#> 1    [email protected]                      1
#> 2    [email protected]                      2
#> 3              <NA>                     NA
#> 4                       David    Smith   3
#> 5     bademail@none       Bad    Email   5
#> 6 [email protected]    Gordon   Bennet   6


这里的坏先生电子邮件无法通过电子邮件匹配(因为它不是一个有效的电子邮件地址),所以它被跳过,他的名字匹配。戈登·贝内特先生已经改变了他的电子邮件,所以他的新的没有找到,他的名字匹配。最后,大卫有一个空字符串作为电子邮件(""),所以他被跳过并按名字匹配。没有名字,电子邮件或PID的不可见人被保留在指定的新的字符串中。

相关问题