R语言 用相同和正确的名称替换不匹配的字符串

wgx48brx  于 12个月前  发布在  其他
关注(0)|答案(2)|浏览(115)

我目前正在处理的问题是从不同来源收到的物种列表。我复制了原始列表的一小部分来解决这个问题。第一列包含物种的名称,来自同一物种的名称并不相同,因为已经提到它们来自不同的来源。例如,在前三行中,正确的名称将是对应于位置C2的名称(Arabidopsis.thaliana)。并且对应于位置C1C3的名称应相应地被校正。同样地,下一个正确的名称是对应于D1F3,和G1和同一物种内的其他名称必须根据正确的名称进行更正。提前感谢您的建议。我的数据的一部分:

dat<-structure(list(Species=c("A. thaliana","Arabidopsis.thaliana","ARABIDOPSIS Thaliana","Pisum.sativum","P. Sativum","PISUM SATIVUM","B. Vulgaris"
                    ,"BETA VULGARIS","Beta.vulgaris","Secale.cereale","S. CEREALE","SECALE CEREALE")
                    ,Position=c("C1","C2","C3","D1","D2","D3","F1","F2","F3","G1","G2","G3")
                      ))

dat<-data.frame(dat)

dat
                Species Position
1           A. thaliana       C1
2  Arabidopsis.thaliana       C2
3  ARABIDOPSIS Thaliana       C3
4         Pisum.sativum       D1
5            P. Sativum       D2
6         PISUM SATIVUM       D3
7           B. Vulgaris       F1
8         BETA VULGARIS       F2
9         Beta.vulgaris       F3
10       Secale.cereale       G1
11           S. CEREALE       G2
12       SECALE CEREALE       G3

字符串

ttp71kqs

ttp71kqs1#

我认为这可以帮助:

pacman::p_load(dplyr, stringr, tidyr)
  dat %>% 
    mutate(Species2 = str_extract(Species, pattern = "(^[A-Z]{1}[a-z]+\\.[a-z]+$)", group = 1),
           P = str_extract(Species, pattern ="\\D")) %>% 
    group_by(P) %>% 
    fill(Species2, .direction="downup") %>% 
    ungroup() %>% 
    select(-P)
# A tibble: 12 × 3
   Species              Position Species2            
   <chr>                <chr>    <chr>               
 1 A. thaliana          C1       Arabidopsis.thaliana
 2 Arabidopsis.thaliana C2       Arabidopsis.thaliana
 3 ARABIDOPSIS Thaliana C3       Arabidopsis.thaliana
 4 Pisum.sativum        D1       Pisum.sativum       
 5 P. Sativum           D2       Pisum.sativum       
 6 PISUM SATIVUM        D3       Pisum.sativum       
 7 B. Vulgaris          F1       Beta.vulgaris       
 8 BETA VULGARIS        F2       Beta.vulgaris       
 9 Beta.vulgaris        F3       Beta.vulgaris       
10 Secale.cereale       G1       Secale.cereale      
11 S. CEREALE           G2       Secale.cereale      
12 SECALE CEREALE       G3       Secale.cereale

字符串

u2nhd7ah

u2nhd7ah2#

我不确定这是否100%正确,但你可以试试:

library(tidyverse)
df <- tibble(Species = c("A. thaliana", "Arabidopsis.thaliana", "ARABIDOPSIS Thaliana"))


df %>% 
  mutate(First = str_sub(Species, 1, 1) %>% toupper(), 
         Second = str_split(Species, " |[.]") %>% map_chr(tail, 1) %>% tolower()) %>% 
  unite(SpeciesID, First, Second)
# A tibble: 3 × 2
  Species              SpeciesID 
  <chr>                <chr>     
1 A. thaliana          A_thaliana
2 Arabidopsis.thaliana A_thaliana
3 ARABIDOPSIS Thaliana A_Thaliana

字符串

相关问题