如何从页面获取所有网址(php)

q5iwbnjs  于 2023-02-11  发布在  PHP
关注(0)|答案(4)|浏览(119)

我有一个页面的网址和描述一个列在另一个之下(类似书签/网站列表),我如何使用php从该页面获得所有的网址,并把它们写入txt文件(每行一个,只有网址没有描述)?
页面如下所示:
Some description
Other description
Another one
我希望脚本的txt输出看起来像这样:
http://link.com
http://link2.com
http://link3.com

wwtsj6pe

wwtsj6pe1#

单程

$url="http://wwww.somewhere.com";
$data=file_get_contents($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$u ){
    if( strpos($u, "<a href=") !== FALSE ){
        $u = preg_replace("/.*<a\s+href=\"/sm","",$u);
        $u = preg_replace("/\".*/","",$u);
        print $u."\n";
    }
}
new9mtju

new9mtju2#

另一种方法

$url = "http://wwww.somewhere.com";

$html = file_get_contents($url);

$doc = new DOMDocument();
$doc->loadHTML($html); //helps if html is well formed and has proper use of html entities!

$xpath = new DOMXpath($doc);

$nodes = $xpath->query('//a');

foreach($nodes as $node) {
    var_dump($node->getAttribute('href'));
}
lf3rwulv

lf3rwulv3#

你可以用这个来获取给定网页中的所有链接。

<?php

    $var = fread_url($url);

    preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
                    "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", 
                    $var, &$matches);

    $matches = $matches[1];
    $list = array();

    foreach($matches as $var)
    {    
        print($var."<br>");
    }

    function fread_url($url,$ref="")
    {
        if(function_exists("curl_init")){
            $ch = curl_init();
            $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; ".
                          "Windows NT 5.0)";
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
            curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
            curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
            curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
            curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
            curl_setopt( $ch, CURLOPT_URL, $url );
            curl_setopt( $ch, CURLOPT_REFERER, $ref );
            curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
            $html = curl_exec($ch);
            curl_close($ch);
        }
       else{
            $hfile = fopen($url,"r");
            if($hfile){
                while(!feof($hfile)){
                    $html.=fgets($hfile,1024);
                }
            }
        }
        return $html;
    }

    ?>
wxclj1h5

wxclj1h54#

要获取页面的所有URL,请获取所有链接并下载文件。

<?php
$host = "http://urlname/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $host);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
curl_setopt($ch, CURLOPT_REFERER, "http://192.168.2.104/filetest/");
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$result = curl_exec($ch);
curl_close($ch);   

$files = array();
$data  = strip_tags($result,"<a>");
$d     = preg_split("/<\/a>/",$data);

foreach ( $d as $k=>$u ){
 if( strpos($u, "<a href=") !== FALSE )
 {
    $u = preg_replace("/.*<a\s+href=\"/sm","",$u);
    $u = preg_replace("/\".*/","",$u);
    if (strpos($u , ".") !== false) 
    {
        array_push($files,urldecode($u));
    }        
 }
}

foreach ($files as $filenm) {
  $url = "http://urlname/".$filenm;
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
  curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');

  $result = curl_exec($ch);
  if (curl_errno($ch)) {
     echo 'Error:' . curl_error($ch);
  }
  curl_close($ch);

  file_put_contents("downloadfile/".$filenm, $result);
 }
 ?>

相关问题