如何在php中抓取网站的所有页面并获取 meta描述

cgvd09ve 于 2022-12-02 发布在 PHP

关注(0)|答案(2)|浏览(143)

我想刮取网站的所有页面并获得meta tag description，如
<meta name="description" content="I want to get this description of this meta tag" />
类似地，对于所有其他页面，我希望获得它们各自的meta description
这是我的代码

add_action('woocommerce_before_single_product', 'my_function_get_description');

function my_function_get_description($url) {
   $the_html = file_get_contents('https://tipodense.dk/');
   print_r($the_html)
}

这个print_r($the_html)给了我整个网站，我不知道如何得到每个页面的 meta描述
请引导我谢谢

php

来源：https://stackoverflow.com/questions/74639813/how-to-scrape-all-pages-of-a-website-and-get-the-meta-description-in-php

2条答案

按热度按时间

cotxawn71#

你必须了解preg_match和regex表达式。这里很简单：

function my_function_get_description($url) {
    $the_html = file_get_contents('https://tipodense.dk/');
    preg_match('meta name="description" content="([\w\s]+)"', $the_html, $matches);
    print_r($matches);
}

https://regex101.com/r/JMcaUh/1
描述由捕获组（）捕获并保存在$matches[0][1]中
编辑：DOMDocument也是一个很好的解决方案，但是假设你只想要描述，使用regex对我来说看起来更容易！

赞(0）回复(0）举报 2022-12-02

kgsdhlau2#

解析HTML文件的更好方法是使用DOMDocument，并且在许多情况下，将其与DOMXPath结合起来，在DOM上运行查询，以查找感兴趣的元素。
例如，在您的情况下，提取 meta描述，您可以：

$url='https://tipodense.dk/';

# create the DOMDocument and load url
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->loadHTMLFile( $url );
libxml_clear_errors();

# load XPath
$xp=new DOMXPath( $dom );
$expr='//meta[@name="description"]';

$col=$xp->query($expr);
if( $col && $col->length > 0 ){
    foreach( $col as $node ){
        echo $node->getAttribute('content');
    }
}

得到：

Har du brug for at vide hvad der sker i Odense? Vores fokuspunkter er især events, mad, musik, kultur og nyheder. Hvis du vil vide mere så læs med på sitet.

使用站点Map（或部分Map），您可以这样做：

$url='https://tipodense.dk/';
$sitemap='https://tipodense.dk/sitemap-pages.xml';

$urls=array();

# create the DOMDocument and load url
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;

# read the sitemap & store urls
$dom->load( $sitemap );
libxml_clear_errors();

$col=$dom->getElementsByTagName('loc');
foreach( $col as $node )$urls[]=$node->nodeValue;


foreach( $urls as $url ){
    
    $dom->loadHTMLFile( $url );
    libxml_clear_errors();
    
    # load XPath
    $xp=new DOMXPath( $dom );
    $expr='//meta[@name="description"]';
    
    
    $col=$xp->query( $expr );
    if( $col && $col->length > 0 ){
        foreach( $col as $node ){
            printf('<div>%s: %s</div>', $url, $node->getAttribute('content') );
        }
    }
}

赞(0）回复(0）举报 2022-12-02

我来回答

如何在php中抓取网站的所有页面并获取 meta描述

2条答案

相关问题

热门标签

最新问答