axios 使用Javascript的网页抓取

7d7tgy0s  于 2022-11-23  发布在  iOS
关注(0)|答案(1)|浏览(177)

我试图刮多页网站的内容使用javascript和导出到Excel或csv文件。
问题是我只刮第一页,我无法将其导出到Excel或csv。
以下是目前为止我代码

const PORT =8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')

const app = express()
const url = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=1&selectedItem=viewAllAwardedContracts.do&T01_ps=100'
axios(url)
 .then(response => {
    const html = response.data
    const $ = cheerio.load(html)
    const articles = []
    $('#T01',html).each(function(){
        const contract = $(this).text()
        articles.push({
            contract
        })
        
    })
    console.log(articles)
   
 }).catch(err => console.log(err))


app.listen(PORT,() => console.log(`Server listening on port ${PORT}`))

我想刮所有页面,并将其存储在csv或excel文件

dbf7pr2w

dbf7pr2w1#

以下是一个可能的解决方案:

import axios from "axios"
import {load} from "cheerio"
import fs from "fs"

const data_to_csv = (arr) => {
    const array = [].concat(arr)
    return array.map(el => {
        return Object.values(el).toString()
    }).join('\n') + '\n'
  }

const get_data = async (page) => {
    try {
        const response = await axios.get(`https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=${page}&selectedItem=viewAllAwardedContracts.do&T01_ps=100`)
        const html = response.data
        const $ = load(html)
        const data = []
        $('#T01>tbody>tr').each((_idx, el) => {
            const tender_no = $(el).find('td:nth-child(1)').text()
                .replace(/(\s+)/g, '')
                .replace(/,/g, '.')
            const procuring_entity = $(el).find('td:nth-child(2)').text()
                .replace(/(\s\s+)/g, '')
            const supplier_name = $(el).find('td:nth-child(3)').text()
                .replace(/(\s\s+)/g, '')
            const award_date = $(el).find('td:nth-child(4)').text()
                .replace(/(\s\s+)/g, '')
            const award_amount = $(el).find('td:nth-child(5)').text()
                .replace(/(\s\s+)/g, '')
            data.push({
                "Tender No": tender_no, 
                "Procuring Entity": procuring_entity, 
                "Supplier Name": supplier_name, 
                "Award Date": award_date, 
                "Award Amount": award_amount
            })
        });
        return data
    } catch (error) {
        throw error;
    }
};

for (let i = 1; i <= 100; i++) {
    get_data(i).then((data) => {
        console.log(`Page number: ${i}`)
        fs.appendFileSync("taneps.csv", data_to_csv(data))
    })
}

输出csv文件taneps.csv

PA/009/2021-22/HQ/G/19-,Muhimbili National Hospital,BADRA ATHUMAN NDULLAH,11/10/2021 11:29:36,451350.00(TZS)
PA/058/2021-2022/G/22,Mkwawa University College of Education,IANMAC TECHNOLOGIES,12/11/2021 17:03:52,1413000.00(TZS)
PA/055/2021-2022/HQ/NC/08,Institute of Social Work,LUMINA INVESTMENTS LIMITED,11/11/2021 08:02:03,2343480.00(TZS)

Node v16.15.0上测试使用的axios v1.1.3cheerio v1.0.0-rc.12

相关问题