php 尝试使用HTML DOM解析器获取Amazon页面上的主图像

ubby3x7f 于 2023-02-21 发布在 PHP

关注(0)|答案(3)|浏览(150)

我尝试使用HTMLDOMParser获取“主”产品图像的图像源，而不管解析器指向哪个产品页面。
在每个页面上，似乎该图像的ID都是“landingImage”。您可能会认为这应该可以做到这一点：

$finalarray[$i][2] = $html->find('img[id="landingImage"]', 0)->src;

但没这么幸运。
我也试过

foreach($html->find('img') as $e)
    if (strpos($e,'landingImage') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

我注意到通常图像源有SY300或SX300，所以我这样做：

foreach($html->find('img') as $e)
    if (strpos($e,'SX300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }
    else if (strpos($e,'SY300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

遗憾的是，一些图像源链接不包含此信息，例如：

http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20

php

来源：https://stackoverflow.com/questions/21842618/trying-to-use-html-dom-parser-to-get-main-image-on-amazon-page

3条答案

按热度按时间

qhhrdooz1#

使用Amazon API可能是更好的解决方案，但这不是问题所在。
当我从示例网页下载html时（内容没有运行JavaScript），我找不到任何带有id="landingImage" [1]的标记。但是我可以找到一个带有id="main-image"的图像标记。尝试用DOMDocument提取这个标记没有成功。不知何故，方法loadHTML()和loadHTMLFile()无法解析html。
但是有趣的部分可以用正则表达式提取出来，下面的代码会给予你图像源代码：

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';
$html = file_get_contents($url);

$matches = array();
if (preg_match('#<img[^>]*id="main-image"[^>]*src="(.*?)"[^>]*>#', $html, $matches)) {
    $src = $matches[1];
}

// The source of the image is
// $src: 'http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg'

[1]html源代码是在php中用file_get_contents函数下载的。用Firefox下载html源代码会产生不同的html代码。在最后一种情况下，你会发现一个id属性为“landingImage”的图像标签（JavaScript未启用！）。下载的html源代码似乎取决于客户端（http请求中的头文件）。

赞(0）回复(0）举报 2023-02-21

mkh04yzy2#

在带有id="landingImage"的img标签示例的页面上，不包含属性src。此属性是由JavaScript添加的。
但此标记包含值为{"http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg":[200,200]}的属性data-a-dynamic-image
你可以尝试获取这个属性的值，然后只解析值。通过regexp或者strpos和substr函数。

赞(0）回复(0）举报 2023-02-21

neskvpey3#

看起来并不是每个页面都使用相同的html。你需要检查很多可能性，并在找不到图片时记录情况，以便添加对它们的支持。例如：

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';

$html = file_get_html($url);

$image = $html->find('img[id="landingImage"]', 0);

if(!is_object($image)) {
  $image = $html->find('img[id="main-image"]', 0);
}

if(!is_object($image)) {
  // Log the error to apache error log
  error_log('Could not find amazon image: ' + $url);
} else {
  print $image->src;
}

赞(0）回复(0）举报 2023-02-21

我来回答

php 尝试使用HTML DOM解析器获取Amazon页面上的主图像

3条答案

相关问题

热门标签

最新问答