使用Golang Colly进行网页抓取，如何处理找不到XML路径？

0x6upsns 于 2022-12-31 发布在 Go

关注(0)|答案(1)|浏览(140)

我正在使用科利报废一个电子商务网站。我会循环了许多产品。
下面是我的代码片段，它获取了一个副标题

c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
        fmt.Println(e.Text)
})

但是，并非所有产品都有副标题，因此上述XML路径并不适用于所有情况。
当我到达一个没有副标题的产品时，我的代码崩溃并返回错误
panic: expression must evaluate to a node-set
下面是我的代码：

c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
    log.Println("Something went wrong:", err)
})

//Sub Title
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
    fmt.Println(e.Text)
})

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
})

c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")

这就是我想要的

c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
    if no error {

        fmt.Println("NO ERROR)

    } else {

        fmt.Println("GOT ERROR")

    }
    
})

go

来源：https://stackoverflow.com/questions/74949682/web-scrapping-using-golang-colly-how-to-handle-xml-path-not-found

1条答案

按热度按时间

ujv3wf0j1#

也许我知道你的代码哪里出错了。让我从最后一个开始。正如你所看到的，错误源于parse.go文件第473行的panic语句。包xpath有一个名为parseNodeTest的方法，它执行以下检查：

func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
    switch p.r.typ {
    case itemName:
        if p.r.canBeFunc && isNodeType(p.r) {
            var prop string
            switch p.r.name {
            case "comment", "text", "processing-instruction", "node":
                prop = p.r.name
            }
            var name string
            p.next()
            p.skipItem(itemLParens)
            if prop == "processing-instruction" && p.r.typ != itemRParens {
                checkItem(p.r, itemString)
                name = p.r.strval
                p.next()
            }
            p.skipItem(itemRParens)
            opnd = newAxisNode(axeTyp, name, "", prop, n)
        } else {
            prefix := p.r.prefix
            name := p.r.name
            p.next()
            if p.r.name == "*" {
                name = ""
            }
            opnd = newAxisNode(axeTyp, name, prefix, "", n)
        }
    case itemStar:
        opnd = newAxisNode(axeTyp, "", "", "", n)
        p.next()
    default:
        panic("expression must evaluate to a node-set")
    }
    return opnd
}

p.r.typ的值为itemNumber（28），导致交换机进入默认分支并报错，在上述方法之前调用的方法（您可以在IDE的调用堆栈中看到它们）将文字1234的typ设置为该值，这会导致无效的XPath查询。你必须去掉1234，并输入一些有效的值。
如果这解决了你的问题，请告诉我，谢谢！

赞(0）回复(0）举报 2022-12-31

我来回答

使用Golang Colly进行网页抓取，如何处理找不到XML路径？

1条答案

相关问题

热门标签

最新问答