R语言 跨列Map以将两个数据框full_join在一起

qzlgjiam  于 2023-01-18  发布在  其他
关注(0)|答案(1)|浏览(100)

我有2个 Dataframe ,如下所示:

# A tibble: 52 × 2
   provincia     mean_price
   <chr>              <dbl>
 1 A Coruña         137624.
 2 Albacete           2115.
 3 Alicante          31798.

   col1                   col2                   col3       col4
1    04                Almería                Almería       <NA>
2    11                  Cádiz                  Cádiz       <NA>
3    14                Córdoba                Córdoba       <NA>

我尝试使用full_join将它们连接在一起,但由于拼写不同,并非所有名称都匹配正确。我可以运行以下命令(带有相应的输出):

df1 %>% 
  full_join(df2, by = c("provincia" = "col2"))

# A tibble: 60 × 5
   provincia     mean_price col1  col3      col4 
   <chr>              <dbl> <chr> <chr>     <chr>
 1 A Coruña         137624. NA    NA        NA   
 2 Albacete           2115. 02    Albacete  NA   
 3 Alicante          31798. NA    NA        NA   
 4 Almería         -210404. 04    Almería   NA

因此,它能够匹配许多名字,但在少数问题上失败了。在另一篇专栏文章中,我有不同的拼写变体:

df1 %>% 
  full_join(df2, by = c("provincia" = "col3")) %>% 
  drop_na(col4)

  provincia  mean_price col1  col2       col4      
  <chr>           <dbl> <chr> <chr>      <chr>     
1 Lugo         -134055. 27    Lugo       Lugo      
2 Pontevedra   -374778. 36    Pontevedra Pontevedra
3 La Coruña         NA  15    Coruña, A  A Coruña  
4 Orense            NA  32    Ourense    Ourense

因此,我有2个 Dataframe 要连接,在df1中,列不变,始终为provincias-在df2中,如果名称与col1中的名称不匹配,则尝试col2,然后尝试col3,依此类推,直到所有列都已应用,以尝试full_join数据。
一些问题示例如下:
df1中,我们有

A Coruña
Alicante
Castellón

df2中我们有:

col2 = Coruña, A
col3 = La Coruña
col4 = A Coruña  # joined up using this

以及

col2 = Alicante/Alacant
col3 = Alicante  # joined up using this
col4 = NA

以及

col2 = Castellón/Castelló
col3 = Castellón  # joined up using this
col4 = NA

预期产出:尝试创建一个函数来Map/循环,并尝试将数据联接在一起,如果前一列联接不成功,则沿着列移动。
数据

df1 = structure(list(provincia = c("A Coruña", "Albacete", "Alicante", 
"Almería", "Araba - Álava", "Asturias", "Ávila", "Badajoz", 
"Barcelona", "Bizkaia", "Burgos", "Cáceres", "Cádiz", "Cantabria", 
"Castellón", "Ceuta", "Ciudad Real", "Córdoba", "Cuenca", "Gipuzkoa", 
"Girona", "Granada", "Guadalajara", "Huelva", "Huesca", "Illes Balears", 
"Jaén", "La Rioja", "Las Palmas", "León", "Lleida", "Lugo", 
"Madrid", "Málaga", "Melilla", "Murcia", "Navarra", "Ourense", 
"Palencia", "Pontevedra", "Salamanca", "Santa Cruz de Tenerife", 
"Segovia", "Sevilla", "Soria", "Tarragona", "Teruel", "Toledo", 
"Valencia", "Valladolid", "Zamora", "Zaragoza"), mean_price = c(137624.198693295, 
2114.81534770306, 31797.6042278187, -210404.350690943, 677319.793678721, 
-105607.758762033, -149924.778464006, -2097656.49617517, 128038.251992261, 
-378733.392057684, 191990.956696032, 695.564248280345, 1428676.05256214, 
3153.68145681899, 718682.972504407, -190874.713744811, 106101.705786415, 
338330.550863887, 13849.7490258121, 87318.2683658272, 118338.163156024, 
162218.313332224, -851960.092180063, 221206.432450402, 90652.890697791, 
-8181130.3432743, 44727.9976028575, -1573959.50553556, -497997.406878741, 
927365.585585463, 145165.81648469, -134054.92318867, 987306.447344823, 
150960.003949561, -517183.316043251, -267412.822954099, -414282.66968961, 
1013507.94117714, 589115.889523749, -374778.481273493, 108896.549765545, 
-9019.89307188885, -52818.219435273, -217912.286950506, 6030463.0529272, 
361426.680584978, -1069517.98520776, 135542.864155717, -236847.461222156, 
156897.290850635, -80528.5521733855, 333167.648531738)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -52L))


df2 = structure(list(col1 = c("04", "11", "14", "18", "21", "23", "29", 
"41", "22", "44", "50", "33", "07", "35", "38", "39", "05", "09", 
"24", "34", "37", "40", "42", "47", "49", "02", "13", "16", "19", 
"45", "08", "17", "25", "43", "03", "12", "46", "06", "10", "15", 
"27", "32", "36", "28", "30", "31", "01", "20", "48", "26", "51", 
"52"), col2 = c("Almería", "Cádiz", "Córdoba", "Granada", 
"Huelva", "Jaén", "Málaga", "Sevilla", "Huesca", "Teruel", 
"Zaragoza", "Asturias", "Balears, Illes", "Palmas, Las", "Santa Cruz de Tenerife", 
"Cantabria", "Ávila", "Burgos", "León", "Palencia", "Salamanca", 
"Segovia", "Soria", "Valladolid", "Zamora", "Albacete", "Ciudad Real", 
"Cuenca", "Guadalajara", "Toledo", "Barcelona", "Girona", "Lleida", 
"Tarragona", "Alicante/Alacant", "Castellón/Castelló", "Valencia/València", 
"Badajoz", "Cáceres", "Coruña, A", "Lugo", "Ourense", "Pontevedra", 
"Madrid", "Murcia", "Navarra", "Araba/Álava", "Gipuzkoa", "Bizkaia", 
"Rioja, La", "Ceuta", "Melilla"), col3 = c("Almería", "Cádiz", 
"Córdoba", "Granada", "Huelva", "Jaén", "Málaga", "Sevilla", 
"Huesca", "Teruel", "Zaragoza", "Asturias", "Baleares", "Las Palmas", 
"Santa Cruz de Tenerife", "Cantabria", "Ávila", "Burgos", "León", 
"Palencia", "Salamanca", "Segovia", "Soria", "Valladolid", "Zamora", 
"Albacete", "Ciudad Real", "Cuenca", "Guadalajara", "Toledo", 
"Barcelona", "Gerona", "Lérida", "Tarragona", "Alicante", "Castellón", 
"Valencia", "Badajoz", "Cáceres", "La Coruña", "Lugo", "Orense", 
"Pontevedra", "Madrid", "Murcia", "Navarra", "Álava", NA, NA, 
"La Rioja", "Ceuta", "Melilla"), col4 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, "A Coruña", "Lugo", "Ourense", "Pontevedra", NA, NA, 
NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-52L))
n9vozmp4

n9vozmp41#

我们可以使用fuzzyjoin中的regex_full_join

library(fuzzyjoin)
regex_full_join(df1, df2, by = c("provincia" = "col3"))
  • 输出
# A tibble: 59 × 6
   provincia     mean_price col1  col2             col3      col4 
   <chr>              <dbl> <chr> <chr>            <chr>     <chr>
 1 Albacete           2115. 02    Albacete         Albacete  <NA> 
 2 Alicante          31798. 03    Alicante/Alacant Alicante  <NA> 
 3 Almería         -210404. 04    Almería          Almería   <NA> 
 4 Araba - Álava    677320. 01    Araba/Álava      Álava     <NA> 
 5 Asturias        -105608. 33    Asturias         Asturias  <NA> 
 6 Ávila           -149925. 05    Ávila            Ávila     <NA> 
 7 Badajoz        -2097656. 06    Badajoz          Badajoz   <NA> 
 8 Barcelona        128038. 08    Barcelona        Barcelona <NA> 
 9 Burgos           191991. 09    Burgos           Burgos    <NA> 
10 Cáceres             696. 10    Cáceres          Cáceres   <NA> 
# … with 49 more rows

如果我们想按顺序对col2、col3、col4中的每一个进行连接,下面是一个使用powerjoin的方法

library(powerjoin)
library(purrr)
out <- map(names(df2)[-1],
   ~ regex_left_join(df1, df2, by = c("provincia" = .x))) %>% 
  reduce(power_left_join, by = c("provincia", "mean_price"), 
    conflict = coalesce_xy)
  • 输出
> as.data.frame(out)
                provincia    mean_price col1                   col2                   col3       col4
1                A Coruña   137624.1987   15              Coruña, A              La Coruña   A Coruña
2                Albacete     2114.8153   02               Albacete               Albacete       <NA>
3                Alicante    31797.6042   03       Alicante/Alacant               Alicante       <NA>
4                 Almería  -210404.3507   04                Almería                Almería       <NA>
5           Araba - Álava   677319.7937   01            Araba/Álava                  Álava       <NA>
6                Asturias  -105607.7588   33               Asturias               Asturias       <NA>
7                   Ávila  -149924.7785   05                  Ávila                  Ávila       <NA>
8                 Badajoz -2097656.4962   06                Badajoz                Badajoz       <NA>
9               Barcelona   128038.2520   08              Barcelona              Barcelona       <NA>
10                Bizkaia  -378733.3921   48                Bizkaia                   <NA>       <NA>
11                 Burgos   191990.9567   09                 Burgos                 Burgos       <NA>
12                Cáceres      695.5642   10                Cáceres                Cáceres       <NA>
13                  Cádiz  1428676.0526   11                  Cádiz                  Cádiz       <NA>
14              Cantabria     3153.6815   39              Cantabria              Cantabria       <NA>
15              Castellón   718682.9725   12     Castellón/Castelló              Castellón       <NA>
16                  Ceuta  -190874.7137   51                  Ceuta                  Ceuta       <NA>
17            Ciudad Real   106101.7058   13            Ciudad Real            Ciudad Real       <NA>
18                Córdoba   338330.5509   14                Córdoba                Córdoba       <NA>
19                 Cuenca    13849.7490   16                 Cuenca                 Cuenca       <NA>
20               Gipuzkoa    87318.2684   20               Gipuzkoa                   <NA>       <NA>
21                 Girona   118338.1632   17                 Girona                 Gerona       <NA>
22                Granada   162218.3133   18                Granada                Granada       <NA>
23            Guadalajara  -851960.0922   19            Guadalajara            Guadalajara       <NA>
24                 Huelva   221206.4325   21                 Huelva                 Huelva       <NA>
25                 Huesca    90652.8907   22                 Huesca                 Huesca       <NA>
26          Illes Balears -8181130.3433 <NA>                   <NA>                   <NA>       <NA>
27                   Jaén    44727.9976   23                   Jaén                   Jaén       <NA>
28               La Rioja -1573959.5055   26              Rioja, La               La Rioja       <NA>
29             Las Palmas  -497997.4069   35            Palmas, Las             Las Palmas       <NA>
30                   León   927365.5856   24                   León                   León       <NA>
31                 Lleida   145165.8165   25                 Lleida                 Lérida       <NA>
32                   Lugo  -134054.9232   27                   Lugo                   Lugo       Lugo
33                 Madrid   987306.4473   28                 Madrid                 Madrid       <NA>
34                 Málaga   150960.0039   29                 Málaga                 Málaga       <NA>
35                Melilla  -517183.3160   52                Melilla                Melilla       <NA>
36                 Murcia  -267412.8230   30                 Murcia                 Murcia       <NA>
37                Navarra  -414282.6697   31                Navarra                Navarra       <NA>
38                Ourense  1013507.9412   32                Ourense                 Orense    Ourense
39               Palencia   589115.8895   34               Palencia               Palencia       <NA>
40             Pontevedra  -374778.4813   36             Pontevedra             Pontevedra Pontevedra
41              Salamanca   108896.5498   37              Salamanca              Salamanca       <NA>
42 Santa Cruz de Tenerife    -9019.8931   38 Santa Cruz de Tenerife Santa Cruz de Tenerife       <NA>
43                Segovia   -52818.2194   40                Segovia                Segovia       <NA>
44                Sevilla  -217912.2870   41                Sevilla                Sevilla       <NA>
45                  Soria  6030463.0529   42                  Soria                  Soria       <NA>
46              Tarragona   361426.6806   43              Tarragona              Tarragona       <NA>
47                 Teruel -1069517.9852   44                 Teruel                 Teruel       <NA>
48                 Toledo   135542.8642   45                 Toledo                 Toledo       <NA>
49               Valencia  -236847.4612   46      Valencia/València               Valencia       <NA>
50             Valladolid   156897.2909   47             Valladolid             Valladolid       <NA>
51                 Zamora   -80528.5522   49                 Zamora                 Zamora       <NA>
52               Zaragoza   333167.6485   50               Zaragoza               Zaragoza       <NA>

相关问题