如何删除HTML中使用的JSON格式？

kkih6yb8 于 2023-11-20 发布在其他

关注(0)|答案(2)|浏览(135)

我已经将我的**（scraped）**JSON响应转换为HTML中使用的字符串。这个程序只是试图从Amazon中提取一本书的标题，删除JSON格式，并在我的HTML正文中以常规字符串格式输出该标题。
有没有一种正确的方法来实现我在这个线程的底部提供的（replace）或（replaceAll）片段到我的代码中，或者有没有一种不同的方法来完成这个任务？这是我的代码，我先发制人地感谢大家的帮助。

JS Code（scrapers.js）：

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();

app.get('/scrape', async (req, res) => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.amazon.com/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X');

  const [el2] = await page.$x('//*[@id="productTitle"]');
  const txt = await el2.getProperty('textContent');
  const rawTxt = await txt.jsonValue();
  
  const myObj = {rawTxt};

  res.json(myObj); // Send the JSON response

  browser.close();
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
});

字符串

HTML编码（index.html）：

<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <p id="myObjValue"></p>
  <script>
    fetch('/scrape') // Send a GET request to the server
      .then(response => response.json())
      .then(data => {
        const myObjValueElement = document.getElementById('myObjValue');
        myObjValueElement.textContent = `rawTxt: ${data.rawTxt}`;
      })
      .catch(error => console.error(error));
  </script>
</body>
</html>

型
我在网上看过，我发现的主要解决方案是将这些应用于我的字符串转换的JSON消息：

.replaceAll("\\{","")
.replaceAll("\\}","")
.replaceAll("\\[","")
.replaceAll("\\]","")
.replaceAll(":","")
.replaceAll(",","")
.replaceAll(" ","");

.replace(/"/g, "")       // Remove double quotes
.replace(/\{/g, "")       // Remove opening curly braces
.replace(/\}/g, "")       // Remove closing curly braces
.replace(/\[/g, "")       // Remove opening square brackets
.replace(/\]/g, "")       // Remove closing square brackets
.replace(/:/g, "")        // Remove colons
.replace(/,/g, "")        // Remove commas
.replace(/ /g, "");       // Remove spaces

的字符串
不幸的是，我没有能够正确地实现这些解决方案中的任何一个，每次JSON格式的字符串{“rawTxt”：“The Black Swan：Second Edition：The Impact of the Highly Improbable：With a new section：“On Robustness and Fragility”（Incerto）“}都在浏览器中的localhost 3000上输出。
我希望这是输出，而不是-黑天鹅：第二版：影响的高度不可能的：与一个新的部分：“在鲁棒性和脆弱性”（Incerto）。

JSON

来源：https://stackoverflow.com/questions/77371057/how-can-i-remove-json-formatting-for-use-in-html

2条答案

按热度按时间

bfhwhh0e1#

看起来您正在启动Express服务器，然后直接导航到http://localhost:3000/scrape上的API路由，而不是查看HTML页面，该页面使用fetch访问API路由。缺少的这一步意味着您将看到来自API的原始JSON输出，而无需HTML文件中的脚本所做的处理（即response.json()，它将JSON解析为普通JS对象）。
要在与服务器相同的源上提供HTML页面，您可以按如下方式调整服务器代码：

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static('public')); // <-- added

app.get('/scrape', async (req, res) => {
// ...

字符串
然后在项目根目录下创建一个名为public的文件夹，并将index.html移动到其中。
最后，重新启动服务器并导航到http://localhost:3000（它提供index.html），您应该看到预期的输出。
简而言之，您的代码是正确的，但您可能误解了如何提供和查看HTML页面。
顺便说一句，你可以简化你的Puppeteer选择器并改进错误处理，以确保你总是关闭浏览器：

app.get('/scrape', async (req, res) => {
  let browser;
  try {
    browser = await puppeteer.launch();
    const [page] = await browser.pages();
    await page.goto('<Your URL>'); // TODO: paste your URL here!
    const rawTxt = await page.$eval(
      '#productTitle',
      el => el.textContent.trim()
    );
    res.json({rawTxt});
  }
  catch (err) {
    console.error(err);
  }
  finally {
    await browser?.close();
  }
});

型
更好的是，由于你想要的特定数据被烘焙到静态HTML中，你可以通过禁用JS，阻止除了基本HTML页面之外的所有请求并使用domcontentloaded来加快速度：

const puppeteer = require("puppeteer"); // ^21.2.1
const express = require("express"); // ^4.18.2
const app = express();
app.use(express.static("public"));

const url = "<Your URL>";

app.get("/scrape", async (req, res) => {
  let browser;
  try {
    browser = await puppeteer.launch({headless: "new"});
    const [page] = await browser.pages();
    await page.setJavaScriptEnabled(false);
    await page.setRequestInterception(true);
    page.on("request", req =>
      req.url() === url ? req.continue() : req.abort()
    );
    await page.goto(url, {waitUntil: "domcontentloaded"});
    const rawTxt = await page.$eval("#productTitle", el =>
      el.textContent.trim()
    );
    res.json({rawTxt});
  }
  catch (err) {
    console.error(err);
  }
  finally {
    await browser?.close();
  }
});

app.listen(3000, () => {
  console.log("Server is running on http://localhost:3000");
});

型
您还可以在所有请求中共享一个浏览器示例。如果您感兴趣，请参阅this example。

赞(0）回复(0）举报 2023-11-20

6ovsh4lw2#

以下是我们可以优化此Web抓取代码的一些方法：
1.缓存浏览器示例-打开浏览器一次，然后在多个请求中重用它，而不是在每个请求中都启动新的浏览器。
1.使用promise/await而不是promise-通过避免promise链使代码更干净。
1.提取常量-将URL等字符串移动到常量中以避免重复。
1.使用模板文字设置文本-设置元素文本时避免串联。
1.错误处理-在抓取路径中添加一些错误处理，以防页面出错。
1.使用无头模式-在无头模式下启动chromium，以在没有UI的情况下更快地处理。
1.并发处理请求--并发处理多个请求，而不是按顺序处理。
1.持久性存储-将抓取的数据保存在缓存或数据库中，以避免重复抓取。
1.流式响应-流式处理JSON响应，而不是构建完整的对象，以提高内存使用率。下面是实现其中一些优化的一种方法：

// scraper.js
const { URL } = require('./constants'); 

app.get('/scrape', async (req, res) => {
  try {
    const browser = getBrowser(); // cached instance
    const page = await browser.newPage();
    await page.goto(URL);

    const title = await getTitle(page); 

    res.json({ title }); // stream JSON response  
  } catch(e) {
    console.error(e);
    res.sendStatus(500);
  }
});

// page-utils.js
const getTitle = async (page) => {
  const [el] = await page.$x('//*[@id="productTitle"]');
  const txt = await el.getProperty('textContent');
  return txt.jsonValue(); 
}

// stream response, error handling, async/await, template literal

字符串

赞(0）回复(0）举报 2023-11-20

我来回答

如何删除HTML中使用的JSON格式？

2条答案

相关问题

热门标签

最新问答