如何在dplyr连接后删除重复列？

tjrkku2a 于 2023-04-03 发布在其他

关注(0)|答案(6)|浏览(128)

考虑两个 Dataframe ，df1 和 df2。
df1具有列 id、a、B。
df2具有列 id、a、c。
我想执行左连接，使得组合的 Dataframe 具有列 id，a，b，c。

combined <- df1 %>% left_join(df2, by="id")

但是在组合 Dataframe 中，列是 id、a.x、b、a.y、c。
我可以在连接键中包含“a”（即：left_join(df1, df2, by=c("id", "a"))）但是像 a 这样的列太多了。我想只通过主键 id 连接，并删除df2中所有重复的列。

来源：https://stackoverflow.com/questions/61627921/how-to-remove-duplicate-columns-after-dplyr-join

6条答案

按热度按时间

col17t5w1#

我喜欢用尽可能少的步骤来做事情。我认为这会减少步骤的数量：

combine<-df1%>%
  left_join(df2, by="id", suffix=c("",".y")%>%
  select(-ends_with(".y"))

select命令中的减号意味着你要选择除这些变量之外的所有内容。如果你想删除所有重复的列（根本没有列a），你可以这样做：

combine<-df1%>%
  left_join(df2, by="id", suffix=c(".x",".y")%>%
  select(-ends_with(".x"),-ends_with(".y"))

赞(0）回复(0）举报 2023-04-03

omjgkv6w2#

首先我们通过 id 执行连接

combined <- df1 %>% left_join(df2, by="id")

然后我们用 .x 重命名它们，用 .y 删除它们

combined <- combined %>% 
  rename_at(
    vars(ends_with(".x")),
    ~str_replace(., "\\..$","")
  ) %>% 
  select_at(
    vars(-ends_with(".y"))
  )

赞(0）回复(0）举报 2023-04-03

xqnpmsa83#

更通用的方法是在左连接之前删除列，否则组合数据集最初可能非常大：

df1<- data.frame(id= seq(1:0), a=rnorm(1:10,0.2),b=rpois(10,0.2))
df2<- data.frame(id= seq(1:0), a=rnorm(1:10,0.2),c=rnorm(10,0.2))

varList<- names(df2)[!(names(df2) %in% names(df1))] # get non common names
varList<- c(varList,"id") # appending key parameter

combined <- df1 %>% left_join((df2 %>% select(varList)), by="id")

组合数据集将不包含任何.x或.y

赞(0）回复(0）举报 2023-04-03

rjzwgtxy4#

我认为这是最简单的方法来实现你所要做的

df <- left_join(df1, df2, by = "id", suffix = c("", ".annoying_duplicate_column")) %>%
  select(-ends_with(".annoying_duplicate_column"))

（结合@Ernest Han的回答和上面@大卫T的非常有帮助的评论）

赞(0）回复(0）举报 2023-04-03

v2g6jxz65#

通过 “there is too many of columns like a”，你的意思是你想找到两个源中共有的所有列吗？在这种情况下，为什么不直接使用交集（默认行为）呢？

## two data.frames, only id = 3, a = 4 matches
(df1 <- data.frame(id = 1:3, a = 2:4, b = 3:5))
#>   id a b
#> 1  1 2 3
#> 2  2 3 4
#> 3  3 4 5
(df2 <- data.frame(id = 3:2, a = 4:5, c = 1:2))
#>   id a c
#> 1  3 4 1
#> 2  2 5 2

## this produces a.x and a.y                  
dplyr::left_join(df1, df2, by = "id")
#>   id a.x b a.y  c
#> 1  1   2 3  NA NA
#> 2  2   3 4   5  2
#> 3  3   4 5   4  1

## which columns are common?
intersect(names(df1), names(df2))
#> [1] "id" "a"

## this produces id, a, b, c
dplyr::left_join(df1, df2, by = intersect(names(df1), names(df2)))
#>   id a b  c
#> 1  1 2 3 NA
#> 2  2 3 4 NA
#> 3  3 4 5  1

## this is, however, the default behaviour for left_join
## i.e. use all columns which are present in both
dplyr::left_join(df1, df2)
#> Joining, by = c("id", "a")
#>   id a b  c
#> 1  1 2 3 NA
#> 2  2 3 4 NA
#> 3  3 4 5  1

由reprex package（v0.3.0）于2020年5月6日创建

赞(0）回复(0）举报 2023-04-03

svdrlsy46#

一个简单的解决方案可能是：

df <- dplyr::inner_join(
   df1,
   dplyr::select(df2, -any_of(names(df1)), id),
   by = "id"
)

names（df 1）将创建所有df 1名称的向量。
any_of（）是一个tidyselect帮助器，它将选择df 2中包含的任何列。（参见https://tidyselect.r-lib.org/reference/all_of.html）。
any_of（）前面的“-”表示删除该列。

换句话说，它将删除已经存在于df 2中的列。

在any_of（）后面添加主键，以保留df 2中的“id”变量。

赞(0）回复(0）举报 2023-04-03

我来回答

如何在dplyr连接后删除重复列？

6条答案

相关问题

热门标签

最新问答