PHP simplexml_load_string会修剪掉两个XML标记之间的任何HTML ...如何保留?

zmeyuzjn  于 2023-01-16  发布在  PHP
关注(0)|答案(1)|浏览(112)

下面是XML和XML解析对象。使用的代码是

$XML = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $XML); 
echo('\n\n'.$XML);
$xmldoc = simplexml_load_string($XML);
print_r($xmldoc);
$jsondoc = json_encode($xmldoc);
$phpobjectsdoc = json_decode($json, true);
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2023-01-06T02:06:06Z</responseDate>
<request identifier="journals:aajses19230810-01" metadataPrefix="oai_dc" verb="GetRecord"> https://x.x.edu/journals/cgi-bin/bcjournals-oaiserver</request>
<GetRecord>
<record>
<header>
<identifier>bcjournals:aajses19230810-01</identifier>
<datestamp>2020-12-03</datestamp>
<setSpec>bcjournals:aajses-documents</setSpec>
</header>
<metadata>
<oai_dcdc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dctitle>Bulletin of the American Association of Jesuit Scientists, Eastern Section</dctitle>
<dcdate>1923-08-10</dcdate>
<dcdescription>
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)
</dcdescription>
<dclanguage>en</dclanguage>
</oai_dcdc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>
SimpleXMLElement Object
(
    [responseDate] => 2023-01-06T02:06:06Z
    [request] =>  https://.ddd.edu/bcjournals/cg-bin/brnals-oaiserver
    [GetRecord] => SimpleXMLElement Object
        (
            [record] => SimpleXMLElement Object
                (
                    [header] => SimpleXMLElement Object
                        (
                            [identifier] => bcrnals:aajses19230810-01
                            [datestamp] => 2020-12-03
                            [setSpec] => bnals:aajses-documents
                        )

                    [metadata] => SimpleXMLElement Object
                        (
                            [oai_dcdc] => SimpleXMLElement Object
                                (
                                    [dctitle] => Bulletin of the American Association of Jesuit Scientists, Eastern Section
                                    [dcdate] => 1923-08-10
                                    [dcdescription] => 
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923

(22 pages, 19 articles)

                                    [dclanguage] => en
                                )

                        )

                )

        )

)
y1aodyip

y1aodyip1#

这不是你的错,但你已经学会了别人的坏习惯,并最终把自己挖到一个完全不必要的洞。
首先要理解的是,SimpleXML是一种API,而不是创建普通PHP对象的方法-XML文档的结构比PHP对象能够轻松表示的复杂得多,因此SimpleXML提供了访问数据的方法,但没有将其全部放在明显的位置。
所以每个人都会犯的第一个错误就是期望这个能成功

$xmldoc = simplexml_load_string($XML);
print_r($xmldoc);

第一行创建了一个SimpleXMLElement object,第二行试图显示它--但不幸的是print_r不知道如何显示所有可用的数据,所以对象的一些部分是不可见的。这会让人们产生困惑,做许多不必要的、适得其反的事情,因为他们认为不可见的东西是"缺失的"。
显示对象内部内容的唯一方法是turn it back into XML,此时您将看到所有内容实际上都还在:

$xmldoc = simplexml_load_string($XML);
echo $xmldoc->asXML();

一旦你意识到这一点,你就会意识到这一行是不必要的:

$XML = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $XML);

这是一个 * 糟糕 * 的尝试,它试图处理中带有冒号的元素和属性名称,而冒号实际上代表"XML命名空间"--一种相对复杂但有用的方法,可以将不同的格式组合成一个格式,而不会发生名称冲突(我的<link>和您的<link>可以放在一起,我们可以分辨出它们的区别)。
显然,大多数时候你不只是想再看到XML,你还想从中得到数据,因为print_r没有告诉他们如何去做,人们使用另一个"可怕"的技巧来得到一个不同类型的对象:

$jsondoc = json_encode($xmldoc);
$phpobjectsdoc = json_decode($json, true);

不要这样做。一旦你这样做了,那些在print_r * 中看不见的数据就真的永远消失了。
而是读取the examples in the manual of how to traverse a basic XML document
回到你的例子,我猜测了一下冒号在你删除之前的位置,然后把它粘贴到一个程序员的编辑器中(在我的例子中是PhpStorm),使它自动缩进以便于阅读:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2023-01-06T02:06:06Z</responseDate>
    <request identifier="journals:aajses19230810-01" metadataPrefix="oai_dc" verb="GetRecord"> https://x.x.edu/journals/cgi-bin/bcjournals-oaiserver</request>
    <GetRecord>
        <record>
            <header>
                <identifier>bcjournals:aajses19230810-01</identifier>
                <datestamp>2020-12-03</datestamp>
                <setSpec>bcjournals:aajses-documents</setSpec>
            </header>
            <metadata>
                <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
                    <dc:title>Bulletin of the American Association of Jesuit Scientists, Eastern Section</dc:title>
                    <dc:date>1923-08-10</dc:date>
                    <dc:description>
                        Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
                        <a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
                            <img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
                        </a>
                        (22 pages, 19 articles)
                    </dc:description>
                    <dc:language>en</dc:language>
                </oai_dc:dc>
            </metadata>
        </record>
    </GetRecord>
</OAI-PMH>

现在,要获得dc:description元素,我们需要一步一步地进行:

// 1. Our initial object represents the root element, `<OAI=PMH>`
$xmldoc = simplexml_load_string($XML);

// 2. Unprefixed children are in namespace `http://www.openarchives.org/OAI/2.0/` namespace
//    because of the xmlns="http://www.openarchives.org/OAI/2.0/"
$children = $xmldoc->children('http://www.openarchives.org/OAI/2.0/');

// 3. We want the element <GetRecord>
$GetRecord = $children->GetRecord;

// 4. Inside that, we want <record>, then <metadata>
$record = $GetRecord->record;
$metadata = $record->metadata;

// Or all in one statement:
$metadata = $xmldoc->children('http://www.openarchives.org/OAI/2.0/')
    ->GetRecord->record->metadata;

// 5. Now we have <oai_dc:dc> - a new namespace
// xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" so:
$dc = $metadata->children('http://www.openarchives.org/OAI/2.0/oai_dc/')->dc;

// 6. Another new namespace, for dc:description
// xmlns:dc="http://purl.org/dc/elements/1.1/"
$description = $dc->children('http://purl.org/dc/elements/1.1/')->description;

// 7. At this point, see what we've got:
echo $description->asXML();

给予

<dc:description>
Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)
</dc:description>

描述,其所有内容到位!
还有一个问题如果元素只包含文本,我们可以写echo (string)$description;,但是如果我们在这里这样做,HTML又会消失--这是因为生成这个数据的人搞砸了一点,包含HTML而没有转义它,所以SimpleXML认为<a><img>元素是XML结构的一部分,而不是内容。
获取"内部XML"是SimpleXML没有一个简洁的方法来实现的一件事,但是there are tricks for doing it

$content= '';
foreach (dom_import_simplexml($description)->childNodes as $child)
{
    $content .= $child->ownerDocument->saveXML( $child );
}
echo $content;

给予

Bulletin of the American Association of Jesuit Scientists, Eastern Section, 10 August 1923
<a href="https://xxx.xxx.edu/iiif/issue/aajses19230810-01/manifest.json?manifest=https%3a%2f%2fxxx.xxx.edu%2fiiif%2fissue%2faajses19230810-01%2fmanifest.json" target="_blank">
<img style="width: 20px;" alt="IIIF Collection Link" src="/custom/bournals/web/images/iiif-logo.png"/>
</a>
(22 pages, 19 articles)

完美!
如果这看起来冗长,那只是因为我中途停下来解释;下面是整理后版本:

$xmldoc = simplexml_load_string($XML);

$description = $xmldoc
    ->children('http://www.openarchives.org/OAI/2.0/')
        ->GetRecord->record->metadata
    ->children('http://www.openarchives.org/OAI/2.0/oai_dc/')
        ->dc
    ->children('http://purl.org/dc/elements/1.1/')
        ->description;

$content= '';
foreach (dom_import_simplexml($description)->childNodes as $child)
{
    $content .= $child->ownerDocument->saveXML( $child );
}

相关问题