R语言保留大小写模式的正则表达式，大写

dldeef67 于 2023-06-19 发布在其他

关注(0)|答案(4)|浏览(126)

bounty将在5天内到期。回答此问题可获得+200声望奖励。Ricardo Saporta正在寻找这个问题的更详细的答案：这个问题背后的意图是扩展有关RegEx的知识。为了说明好奇心，所给的例子被简化了。

这个问题是在寻找一个基于正则表达式的方法，而不是，例如，解决方案来完成示例中的任务。
有许多很棒的函数和包可以帮助解决特定的任务。

主要问题：

在\U和\L的脉络中，是否有一个正则表达式来保留case模式？

理想情况下，它还应该尊重单词边界和锚点。

示例

假设我们有一个很大的文本，我们想将一个单词转换为另一个单词，同时保留单词的大写。例如，将"date"的所有示例替换为"month"

Input: `"This Date is a DATE that is daTe and date."`
Output: `"This Month is a MONTH that is moNth and month."`

input     output
------     -------
"date" ~~> "month"
"Date" ~~> "Month"
"DATE" ~~> "MONTH"
"daTe" ~~> "moNth"   ## This example might be asking for too much.

保留字边界

我对保留单词边界的解决方案感兴趣（即，只能匹配“整个单词”）。在给定的示例中，"date"将更改，但"dated"不会更改

`R`现有解决方案：

我目前使用三个对sub的嵌套调用来实现这一点。

input <- c("date", "Date", "DATE")
expected.out <- c("month", "Month", "MONTH")

sub("date", "month", 
  sub("Date", "Month", 
    sub("DATE", "MONTH", input)
  )
)

目标是拥有一个pattern和一个replace，例如

gsub("(date)", "\\Umonth", input, perl=TRUE)

这将产生期望的输出

注意事项（2023年更新）

1.* 这个问题背后的动机是扩展有关RegEx功能的知识。以下示例仅作为说明给出。这个问题的目的不是寻找替代的解决方法。*
1.* 该问题使用R标签提问，但会接受调用R中当前未提供的RegEx风格的答案 *

来源：https://stackoverflow.com/questions/26171318/regex-for-preserving-case-pattern-capitalization

4条答案

按热度按时间

pbwdgjma1#

这是我认为for循环是合理的情况之一：

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")

for(i in seq_along(pat)) { input <- gsub(pat[i],ret[i],input) }
input
#[1] "Here are a month, a Month, and a MONTH" 
#[2] "Here are a month, a Month, and a MONTH"

@flodel的另一个好处是实现了与通过Reduce的循环相同的逻辑：

Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input)

有关这些选项的一些基准测试，请参阅@TylerRinker的答案。

赞(0）回复(0）举报 2023-06-19

nukf8bse2#

使用gsubfn包，您可以避免使用嵌套的子函数，并在一个调用中完成此操作。

> library(gsubfn)
> x <- 'Here we have a date, a different Date, and a DATE'
> gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), x, ignore.case=T)
# [1] "Here we have a month, a different Month, and a MONTH"

赞(0）回复(0）举报 2023-06-19

ttcibm8c3#

这里有一个qdap方法。非常简单，但不是最快的：

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")

library(qdap)
mgsub(pat, ret, input)

## [1] "Here are a month, a Month, and a MONTH"
## [2] "Here are a month, a Month, and a MONTH"

基准测试：

input <- rep("Here are a date, a Date, and a DATE",1000)

library(microbenchmark)

(op <- microbenchmark( 
    GSUBFN = gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), 
             input, ignore.case=T),
    QDAP = mgsub(pat, ret, input),
    REDUCE = Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input),
    FOR = function() {
       for(i in seq_along(pat)) { 
          input <- gsub(pat[i],ret[i],input) 
       }
       input
    },

times=100L))

## Unit: milliseconds
##    expr        min         lq     median         uq        max neval
##  GSUBFN 682.549812 815.908385 847.361883 925.385557 1186.66743   100
##    QDAP  10.499195  12.217805  13.059149  13.912157   25.77868   100
##  REDUCE   4.267602   5.184986   5.482151   5.679251   28.57819   100
##     FOR   4.244743   5.148132   5.434801   5.870518   10.28833   100

赞(0）回复(0）举报 2023-06-19

n6lpvg4x4#

你得写点逻辑

你不会找到一个纯正则表达式的解决方案。C#和javascript中类似的SO问题包含了大量的逻辑流程来确定哪些字符是大写字母。注解中链接的Perl response只有在你事先知道大写字母可以在哪个位置的情况下才有效。
此外，这些问题有额外的限制，使它们比你的问题简单得多：
1.图案和替换长度相同。
1.模式中的每个字符具有唯一的替换字符，例如"abcd" => "wxyz"。
作为对Rust reddit上similar question的回应：
有很多种可能会出问题。例如，如果尝试替换不同数量的字符（“abc”->“wxyz”），会发生什么？如果你有一个Map有多个输出链接（"aaa"->"xyz"）怎么办？
这正是你试图做的。当模式和替换长度不同时，通常您希望模式中每个大写字母的索引Map到替换中的索引，例如。"daTe" => ""moNth。但是，有时候你不需要，例如："DATE" => "MONTH"，而不是"MONTh"。regex怎么会知道呢？
此外，模式或替换中的字母不保证是唯一的：您希望能够将"WEEK"替换为"MONTH"，反之亦然。这排除了哈希Map方法。

R方案

我写了一个函数swap()，它将为你做这两个字符串，即使有不同数量的字母：

x <- "This Date is a DATE that is daTe and date."
swap("date", "month", x)
# [1] "This Month is a MONTH that is moNth and month."

工作原理

swap()函数使用Reduce()的方式与answer非常相似：

swap <- function(old, new, str, preserve_boundaries = TRUE) {
    l <- create_replacement_pairs(old, new, str, preserve_boundaries)
    Reduce(\(x, l) gsub(l[1], l[2], x, fixed = TRUE), l, init = str)
}

worldhorse函数是create_replacement_pairs()，它创建了一个实际出现在字符串中的模式对列表，例如：c("daTe", "DATE")，并生成具有正确大小写的替换，例如c("moNth", "MONTH").函数逻辑为：
1.查找字符串中的所有匹配项，例如"Date" "DATE" "daTe" "date"。
1.创建一个布尔掩码，指示每个字母是否为大写。
1.如果所有字母都是大写，则替换也应该是全部大写，例如。"DATE" => "MONTH".否则，如果模式中相应索引处的字母是大写，则将替换中每个索引处的字母大写。

create_replacement_pairs <- function(old = "date", new = "month", str, preserve_boundaries) {
    if (preserve_boundaries) {
        pattern <- paste0("\\b", old, "\\b")
    } else {
        pattern <- old
    }

    matches <- unlist(
        regmatches(str, gregexpr(pattern, str, ignore.case = TRUE))
    ) # e.g. "Date" "DATE" "daTe" "date"

    capital_shift <- lapply(matches, \(x) {
        out_length <- nchar(new)
        # Boolean mask if <= capital Z
        capitals <- utf8ToInt(x) <= 90

        # If e.g. DATE, replacement should be
        # MONTH and not MONTh
        if (all(capitals)) {
            shift <- rep(32, out_length)
        } else {
            # If not all capitals replace corresponding
            # index with capital e.g. daTe => moNth

            # Pad with lower case if replacement is longer
            length_diff <- max(out_length - nchar(old), 0)
            shift <- c(
                ifelse(capitals, 32, 0),
                rep(0, length_diff)
            )[1:out_length] # truncate if replacement shorter than pattern
        }
    })

    replacements <- lapply(capital_shift, \(x) {
        paste(vapply(
            utf8ToInt(new) - x,
            intToUtf8,
            character(1)
        ), collapse = "")
    })

    replacement_list <- Map(\(x, y) c(old = x, new = y), matches, replacements)

    replacement_list
}

用例

这种方法不受Rust和C#答案相同的约束。我们已经看到，在替换比模式长的情况下，这是可行的。反之亦然：

swap("date", "day", x)
# [1] "This Day is a DAY that is daY and day."

此外，由于它不使用散列Map，所以它在替换中的字母不唯一的情况下工作。

swap("date", "week", x)
# [1] "This Week is a WEEK that is weEk and week."

最后，它也适用于模式中的字母不唯一的情况：

swap("that", "which", x)
# [1] "This Date is a DATE which is daTe and date."

编辑：感谢@shs在评论中指出，这没有保留单词边界。现在默认情况下是这样的，但是你可以使用 * preserve_boundaries = FALSE禁用它：

swap("date", "week", "this dAte is dated", preserve_boundaries = FALSE)
# [1] "this wEek is weekd"
swap("date", "week", "this dAte is dated")
# [1] "this wEek is dated"

性能

在性能方面，以这种方式从小写参数动态生成匹配将不如硬编码list(c("Date", "Month"), c("DATE", "MONTH"), c("daTe", "moNth"), c("date", "month"))那么快。然而，一个公平的比较应该包括键入列表所花费的时间，我怀疑即使是最忠诚的vim用户，也不可能在不到千分之一秒的时间内完成。
我在Tyler Rinker的answer中看到了基准测试，所以我使用了Reduce()和gsub()，这是测试中最快的替换方法。此外，这个答案中的方法生成了精确匹配和替换对，因此我们可以在gsub()中设置fixed = TRUE，与fixed = FALSE相比，reduces by about 75%具有五个字符的模式进行替换的时间。

赞(0）回复(0）举报 2023-06-19

我来回答

R语言保留大小写模式的正则表达式，大写

主要问题：

示例

保留字边界

`R`现有解决方案：

注意事项（2023年更新）

4条答案

你得写点逻辑

R方案

工作原理

用例

性能

相关问题

热门标签

最新问答

R语言 保留大小写模式的正则表达式，大写

主要问题：

示例

保留字边界

R现有解决方案：

注意事项（2023年更新）

4条答案

你得写点逻辑

R方案

工作原理

用例

性能

相关问题

热门标签

最新问答

R语言保留大小写模式的正则表达式，大写

`R`现有解决方案：