regex 为什么我的正则表达式删除特殊字符添加更多的话我的文本？

tez616oj 于 2022-11-18 发布在其他

关注(0)|答案(2)|浏览(129)

我遇到了这个问题，当我累了运行我的regex函数对我的文本，可以找到here。
我用HttpRequest从上面的链接中获取文本，然后运行正则表达式来清理文本，然后过滤某个单词出现次数最多的部分。
在清理完单词后，我用空格分割字符串，并将其添加到字符串数组中，注意到索引的数量有很大的不同。
有没有人知道为什么会发生这种情况，因为单词“the“的出现次数是6806次。
raw data correct answer is 6806
使用正则表达式，我得到8073个匹配项
with regex
我使用的正则表达式是here in the sandbox with the text和下面的代码。

//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);

// Cleaning up a bit
var words = CleanByRegex(rawSource);

string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);

string CleanByRegex(string rawSource)
{
    Regex r = RemoveSpecialChars();
    return r.Replace(rawSource, " ");
}

//  arr {string[220980]} - with regex
//  arr {string[157594]} - without regex

foreach (var word in arr)
{
    // some logic

}

partial class Program
{
[GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\"]\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
private static partial Regex RemoveSpecialChars();
}

我试着调试它，我怀疑我添加了尾随空格，但我不知道如何处理它。
我已经厌倦了添加一个空格删除正则表达式，我删除多个空格，并取代为一个空格。
正则表达式看起来类似于-[ ]{2,}"

partial class Program
{
    [GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
    private static partial Regex RemoveWhiteSpaceTrails();
}

regex

来源：https://stackoverflow.com/questions/74475009/why-is-my-regex-for-removing-special-characters-adding-more-words-to-my-text

2条答案

按热度按时间

waxmsbnn1#

如果你能描述一下你要清理的东西，这会很有帮助。但是你的具体问题是可以回答的：从sandbox中，我看到您正在删除换行符和标点符号。这肯定会导致出现以前没有的the：

The quick brown fox jumps over the
lazy dog
//the+newline does not match

//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the+space*

如果您将搜索更改为不太常见的内容，例如Seward，则在正则表达式前后应该会看到相同的结果。

赞(0）回复(0）举报 2022-11-18

7dl7o3gd2#

我相信正则表达式创建了更多的文本，而我用字符串。空或" "替换它的原因是不正确的，我只是创建了更多的匹配。
是因为我以为Chrome ctrl + f中的搜索会给予我某个搜索的所有单词，这肯定不是真的。
我累了我的代码，而不是我添加了一个子集的lorem Ipsum文本。因为我质疑的搜索在Chrome上，如果它真的是正确的答案。
简短的回答是NO。如果我搜索“the“，这将意味着我不会得到**@simmetric**证明的"the+Environmental.NewLine"，
另一个场景是以单词"The "开头的句子。由于我对文本中的单词很好奇，我使用了下面的正则表达式\w+来获得单词，并返回了一个MatchCollection (IList<Match>())，我随后循环通过它来将值添加到我的字典中。
代码演示

var rawSource = "Some text"
var words = CleanByRegex(rawSource);

IList<Match> CleanByRegex(string rawSource)
{
    IList<Match> r = Regex.Matches(rawSource, "\\w+");
    return r;
}

foreach (var word in words)
{
    
    if (word.Value.Length >= 1) // at least 3 letters and has any letters
    {
        if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
            dictionary[word.Value] = dictionary[word.Value] + 1; //Increment the count
        else
            dictionary[word.Value] = 1; //put it in the dictionary with a count 1
    }
}

赞(0）回复(0）举报 2022-11-18

我来回答

regex 为什么我的正则表达式删除特殊字符添加更多的话我的文本？

2条答案

相关问题

热门标签

最新问答