在R中提取两行部分字符串匹配(不包括匹配的字符串行)之间的字符串行

gzjq41n4  于 2022-12-05  发布在  其他
关注(0)|答案(3)|浏览(175)

我有一个基本上由字符串行组成的文件。我试图将行的部分提取到字符串行之间的单独文件中。该文件如下所示:

**File Begins**

"Name: XXX_2" 
"Description:  Object 1210 , 111"
"Sampling_info: statexy=1346"
"Num value: 15"
"32 707; 33 71; 37 11; 38 3; 40 146; " 
"41 64; 42 36; 43 24; 44 69; 45 324; " 
"46 49; 47 52; 50 11; 51 90; 52 22; " 
"Name: XXX_3" 
"Description:  Object 1341 , 111"
"Sampling_info: statexy=1346"
"Num value: 18"
"32 999; 33 4; 34 17; 39 84; 41 84; " 
"42 4; 44 137; 45 102; 50 13; 52 22; " 
"53 4; 54 4; 55 84; 58 40; 59 13; "
"65 57; 66 13; 67 173; " 
"Name: XXX_4" 
"Description:  Object 1561 , 111"
"Sampling_info: statexy=1346"
"Num value: 21"
"32 925; 34 5; 40 409; 41 55; 44 43; "   
"45 154; 46 5; 47 5; 50 38; 52 16; "  
"56 99; 58 5; 59 110; 61 5; 62 55; " 
"63 11; 68 5; 69 38; 70 22; 73 999; " 
"74 49; "
"Name: XXX_5" 

**And then the next entry begins**

我想得到“Num value:15”和“名称:XXX_3”,同时排除这两行,并将其放入自己的文本文件中。接下来的两个条目也是如此。这将在for循环或其他循环中实现,以将文件中的所有独立条目提取到它们自己的文件中。
我尝试了str_match,但它返回NA:

str_match(data, "Name: UNK_1\\s*(.*?)\\s*Name: UNK_2")

我也尝试了gsub,但它返回了整个文件...:

gsub(".*Name: UNK_1 (.+) Name: UNK_2.*", "\\1", data)

str_match和gsub的实现有什么问题吗?
提前感谢!

41ik7eoe

41ik7eoe1#

这样的事情怎么样:

library(tidyverse)
# Build dataset
df <- data.frame(
  col1 = c("Name: XXX_2" ,
           "Description:  Object 1210 , 111",
           "Sampling_info: statexy=1346",
           "Num value: 15",
           "32 707; 33 71; 37 11; 38 3; 40 146; " ,
           "41 64; 42 36; 43 24; 44 69; 45 324; " ,
           "46 49; 47 52; 50 11; 51 90; 52 22; " ,
           "Name: XXX_3" ,
           "Shouldn't get this number: 8675309")
)

df %>%
  # Combine row into single string
  map_chr(paste, collapse = " ") %>%
  # Remove everything before "Num value:"
  str_extract(" Num value:.*") %>%
  # Remove numbers after "Num value:
  str_remove("Num value: \\d+") %>%
  # Remove everything after "Name:"
  str_extract(" .*Name:") %>%
  # Extract digits
  str_extract_all("\\d+") %>%
  unlist() %>%
  as.numeric()

#[1]  32 707  33  71  37  11  38   3  40 146  41  64  42  36  43  24  44  69  45
#[20] 324  46  49  47  52  50  11  51  90  52  22
3phpmpom

3phpmpom2#

一种无循环的方法:

library(dplyr)
library(tidyr)

df <- read.delim('path_to_input_file/your_file.txt',
                 sep = ':', header = FALSE)

df %>%
    separate(V1, into = c('param', 'value'), sep = ' *: *') %>%
    filter(param == 'Name' | grepl(';', param)) %>%
    fill(value, .direction = 'down') %>%
    filter(param != 'Name') %>%
    separate_rows(param, sep = ' *; *')

## follow up with blank removal, conversion to numeric as needed

输出(列value包含的名称来自初始名称:xxx行)

# A tibble: 18 x 2
   param    value   
   <chr>    <chr>   
 1 "32 707" "XXX_2 "
 2 "33 71"  "XXX_2 "
 3 "37 11"  "XXX_2 "
 4 "38 3"   "XXX_2 "
 5 "40 146" "XXX_2 "
 6 ""       "XXX_2 "

您可能希望对上述管道进行分区,并检查中间 Dataframe ,以了解在哪个步骤发生了什么。

htrmnn0y

htrmnn0y3#

with base and for , and various notes.

library(stringr)
# msp_list <- scan(file='', what = character()) #paste in 1:24 above <return>
# dput(msp_list)
msp_list <- c("Name: XXX_2", "Description:  Object 1210 , 111", "Sampling_info: statexy=1346", 
"Num value: 15", "32 707; 33 71; 37 11; 38 3; 40 146; ", "41 64; 42 36; 43 24; 44 69; 45 324; ", 
"46 49; 47 52; 50 11; 51 90; 52 22; ", "Name: XXX_3", "Description:  Object 1341 , 111", 
"Sampling_info: statexy=1346", "Num value: 18", "32 999; 33 4; 34 17; 39 84; 41 84; ", 
"42 4; 44 137; 45 102; 50 13; 52 22; ", "53 4; 54 4; 55 84; 58 40; 59 13; ", 
"65 57; 66 13; 67 173; ", "Name: XXX_4", "Description:  Object 1561 , 111", 
"Sampling_info: statexy=1346", "Num value: 21", "32 925; 34 5; 40 409; 41 55; 44 43; ", 
"45 154; 46 5; 47 5; 50 38; 52 16; ", "56 99; 58 5; 59 110; 61 5; 62 55; ", 
"63 11; 68 5; 69 38; 70 22; 73 999; ", "74 49; ")
# get rid of trailing whitespace that will be annoying later
msp_lst <- trimws(msp_list, 'r')

index msp_list, for start and end of future sub dfs. I am assuming you .mps is properly formed, which is to say all are complete (I've wished away your line 25 above).

msp_name_rle <- rle(str_starts(msp_list, 'Name'))$lengths
msp_rle_mtx<- matrix(msp_name_rle, nrow = length(msp_name_rle)/2, ncol = 2, byrow = TRUE)
msp_rowsums <-matrix(rowSums(msp_rle_mtx), ncol = 1)
starts <- which(str_starts(msp_list, 'Name') == TRUE)
starts
[1]  1  8 16
ends <- as.vector(starts + msp_rowsums -1)
ends
[1]  7 15 24

# and then get names as they will be useful later
> msp_names <- trimws(str_extract(msp_list[which(str_starts(msp_list, 'Name') == TRUE)], '\\s\\w+'))
# a similar extract could be done on ? whatever is informative and applied to attributes later (not done here)
# initialize an object to receive output from the `for` loop
many_msp <- list()

At this point we have what we need in the global environment to inform the operations in the for loop so it won't complain that some value isn't found. And things are sufficiently detailed to operate on one index (i.e. not nested i,j ), well, at least I hope, and we'll do a bunch of data cleaning here, but hopefully return an extracted list of .msp values in a two column df each (that basically relies on the regularity of the .msp file format

# first checking that the indexing is working
for(i in 1:length(starts)) {
  many_msp[[i]] <- df3[starts[i]:ends[i], ]
}
many_msp[[3]]
[[3]]
[1] "Name: XXX_4"                         
[2] "Description:  Object 1561 , 111"     
[3] "Sampling_info: statexy=1346"         
[4] "Num value: 21"                       
[5] "32 925; 34 5; 40 409; 41 55; 44 43; "
[6] "45 154; 46 5; 47 5; 50 38; 52 16; "  
[7] "56 99; 58 5; 59 110; 61 5; 62 55; "  
[8] "63 11; 68 5; 69 38; 70 22; 73 999; " 
[9] "74 49; "                             
# OK. Now, we can either make another `for`, or extend what happens within this one.

Extending:

for(i in 1:length(starts)) {
many_msp[[i]] <- msp_list[starts[i]:ends[i]]
#return only values
many_msp[[i]] <- many_msp[[i]][5:lengths(many_msp)[i]]
#take to vector, after a bunch of tidying up
many_msp[[i]] <- as.numeric(strsplit(trimws(paste(gsub(';', '', many_msp[[i]]), collapse = ''), 'r'), ' ')[[1]])
#take to data.frame
many_msp[[i]] <- data.frame(col1 = many_msp[[i]][seq(1, length(many_msp[[i]]), 2)], col2 = many_msp[[i]][seq(2, length(many_msp[[i]]), 2)])
# name the data.frames
names(many_msp)[i] <- msp_names[[i]]
}

names(many_msp)
[1] "XXX_2" "XXX_3" "XXX_4"

many_msp$XXX_4
   col1 col2
1    32  925
2    34    5
3    40  409
4    41   55
5    44   43
6    45  154
7    46    5
8    47    5
9    50   38
10   52   16
11   56   99
12   58    5
13   59  110
14   61    5
15   62   55
16   63   11
17   68    5
18   69   38
19   70   22
20   73  999
21   74   49

so can be done with a for loop. The accessing/addressing in this list stuff may be a little less apparent when reaching into col1, col2 values as you have

many_msp$XXX_4$col1
 [1] 32 34 40 41 44 45 46 47 50 52 56 58 59 61 62 63 68 69 70 73 74

which is unexpected, at first.

相关问题