c# regex输出字符串与我的预期不符

wa7juj8i 于 2023-02-20 发布在 C#

关注(0)|答案(2)|浏览(186)

我正在使用下面的代码来检索运费从amazon.com通过扫描任何产品的网页的html源代码。但输出不是我想要的。下面的代码。

regexString = "<span class=\"plusShippingText\">(.*)</span>";
match = Regex.Match(htmlSource, regexString);
string shipCost = match.Groups[1].Value;
MessageBox.Show(shipCost);

它显示一个消息框，显示返回运费为

&nbsp;+&nbsp;Free Shipping</span>

但实际上我只需要以下干净的文本。

Free Shipping

请帮我解决这个问题。

regex

来源：https://stackoverflow.com/questions/23307067/c-sharp-regex-output-string-is-not-according-to-my-expectations

2条答案

按热度按时间

jaxagkaj1#

你能试试下面的代码吗（尽管使用regex进行HTML解析是个坏主意）：

string shipCostHtml = Regex.Match(htmlSource, "(?<=<span class=\"plusShippingText\">).*?(?=</span>)").Value;
string shipCost = System.Net.WebUtility.HtmlDecode(shipCostHtml);
shipCost = shipCost.Trim(' ', '+', '\xa0');

您的正则表达式几乎没有问题，您只需要将贪婪的(.*)替换为懒惰的(.*?)。
怎么可能用HtmlAgilityPack解决呢。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
string shipCostHtml = doc.DocumentNode.SelectSingleNode("//span[@class='plusShippingText']").InnerText;
string shipCost = System.Net.WebUtility.HtmlDecode(shipCostHtml);
shipCost = shipCost.Trim(' ', '+', '\xa0');

现在，您可以免受Amazon决定向<span>添加一些附加属性的影响，例如：<span class='plusShippingText newClass'>或<span style='{color:blue}' class='plusShippingText'>等等。

赞(0）回复(0）举报 2023-02-20

yeotifhr2#

您需要删除HTML标签，可以使用以下功能：

shipCost = System.Net.WebUtility.HtmlDecode(shipCost).Replace("+","").Trim()

赞(0）回复(0）举报 2023-02-20

我来回答

c# regex输出字符串与我的预期不符

2条答案

相关问题

热门标签

最新问答