PHP -如何有效地识别和计算一个非常大的XML的父元素

js5cn81o 于 2023-03-11 发布在 PHP

关注(0)|答案(4)|浏览(132)

我有一个非常大的XML文件，其格式如下（这是其中两个部分的一个非常小的片段）。

<?xml version="1.0" standalone="yes"?>
<LaunchBox>
  <Game>
    <Name>Violet</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Game>
    <Name>Wishbringer</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Platform>
    <Name>3DO Interactive Multiplayer</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
    <Developer>The 3DO Company</Developer>
  </Platform>
  <Platform>
    <Name>Commodore Amiga</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
    <Developer>Commodore International</Developer>
  </Platform>
</LaunchBox>

我希望快速找到所有父元素的示例（即上面示例中的Game和Platform），以便对它们进行计数，同时提取内容。
更复杂的是，在Game中还有一个Platform“子”（我不想计算），我只想要父（也就是说，我不想要Game -> Platform，但我只想要Platform）。
从这个网站和谷歌的组合，我想出了下面的函数代码：

$attributeCount = 0;

$xml = new XMLReader();
$xml->open($xmlFile);
$elements = new \XMLElementIterator($xml, $sectionNameWereGetting);
// $sectionNameWereGetting is a variable that changes to Game and Platform etc

foreach( $elements as $key => $indElement ){
            if ($xml->nodeType == XMLReader::ELEMENT && $xml->name == $sectionNameWereGetting) {
                $parseElement = new SimpleXMLElement($xml->readOuterXML());
// NOW I CAN COUNT IF THE ELEMENT HAS CHILDREN
                $thisCount = $parseElement->count();
                unset($parseElement);
                if ($thisCount == 0){
// IF THERE'S NO CHILDREN THEN SKIP THIS ELEMENT
                    continue;
                }
// IF THERE IS CHILDREN THEN INCREMENT THE COUNT
// - IN ANOTHER FUNCTION I GRAB THE CONTENTS HERE
// - AND PUT THEM IN THE DATABASE
                $attributeCount++;
            }
}
unset($elements);
$xml->close();
unset($xml);

return  $attributeCount;

我在https://github.com/hakre/XMLReaderIterator/blob/master/src/XMLElementIterator.php使用了Hakre编写的优秀脚本
这确实有效，但是我认为分配一个新的SimpleXMLElement会减慢操作速度。
我只需要SimpleXMLElement来检查元素是否有子元素（我用它来确定元素是否在另一个父元素中--也就是说，如果它是父元素，它“将”有子元素，所以我想对它进行计数，但如果它在另一个父元素中，那么它将没有子元素，我想忽略它）。
但是也许有一个比计算孩子数量更好的解决方案？？例如$xml->isParent()函数或其他什么？
当前函数在完全计算完xml的所有部分之前超时（大约有8个不同的部分，其中一些有几十万条记录）。
我怎样才能使这个过程更有效，因为我也使用类似的代码来获取主要部分的内容，并将它们放入数据库，这样它将支付红利，以尽可能有效。
同样值得注意的是，我不是特别擅长编程，所以请随时指出我可能犯的其他错误，以便我可以改进。

php

来源：https://stackoverflow.com/questions/75686615/php-how-to-identify-and-count-only-parent-elements-of-a-very-large-xml-efficie

4条答案

按热度按时间

ou6hu8tu1#

不需要序列化XML就可以将其加载到DOM或SimpleXML中。可以将其展开为DOM文档：

$reader = new XMLReader();
$reader->open(getXMLDataURL());

$document = new DOMDocument();

// navigate using read()/next()

while ($found) {
  // expand into DOM 
  $node = $reader->expand($document);
  // import DOM into SimpleXML 
  $simpleXMLObject = simplexml_import_dom($node);
 
  // navigate using read()/next()
}

但是，通过正确调用XMLReader:read()和XMLReader:next()，可以计算文档元素的子元素数量。read()将导航到包含后代的下一个节点，而next()将导航到下一个兄弟节点-忽略后代。

$reader = new XMLReader();
$reader->open(getXMLDataURL());

$document = new DOMDocument();
$xpath = new DOMXpath($document);

$found = false;
// look for the document element
do {
  $found = $found ? $reader->next() : $reader->read();
} while (
  $found && 
  $reader->localName !== 'LaunchBox'
);

// go to first child of the document element
if ($found) {
    $found = $reader->read();
}

$counts = [];

// found a node at depth 1 
while ($found && $reader->depth === 1) {
     if ($reader->nodeType === XMLReader::ELEMENT) {
        if (isset($counts[$reader->localName])) {
            $counts[$reader->localName]++;
        } else {
            $counts[$reader->localName] = 1;
        }
    }
    // go to next sibling node
    $found = $reader->next();
}

var_dump($counts);

function getXMLDataURL() {
   $xml = <<<'XML'
<?xml version="1.0" standalone="yes"?>
<LaunchBox>
  <Game>
    <Name>Violet</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Game>
    <Name>Wishbringer</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Platform>
    <Name>3DO Interactive Multiplayer</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
    <Developer>The 3DO Company</Developer>
  </Platform>
  <Platform>
    <Name>Commodore Amiga</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
    <Developer>Commodore International</Developer>
  </Platform>
</LaunchBox>
XML;
    return 'data:application/xml;base64,'.base64_encode($xml);
}

输出：

array(2) {
  ["Game"]=>
  int(2)
  ["Platform"]=>
  int(2)
}

赞(0）回复(0）举报 2023-03-11

pengsaosao2#

听起来使用xpath而不是遍历XML可能适合您的用例，使用xpath可以选择您需要的特定节点：

$xml = simplexml_load_string($xmlStr);

$games = $xml->xpath('/LaunchBox/Game');

echo count($games).' games'.PHP_EOL;

foreach ($games as $game) {
    print_r($game);
}

https://3v4l.org/bLLEi#v8.2.3

赞(0）回复(0）举报 2023-03-11

ukqbszuj3#

我不确定我是否完全理解了您的要求，但如果您要寻找的输出是：

{ "Game":2, "Platform":2 }

那么您可以使用这个可流式处理的XSLT3.0样式表来实现它：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:map="http://www.w3.org/2005/xpath-functions/map"
   version="3.0">
  
   <xsl:mode streamable="yes"/>
   <xsl:output method="json" indent="yes"/>
   <xsl:template match="/">
      <xsl:sequence select="fold-left(/*/*/local-name(), map{}, 
         function($map, $name){
           map:put($map, $name, 
             if (map:contains($map, $name)) 
             then map:get($map, $name) + 1 
             else 1)})"/>
   </xsl:template>
   
</xsl:stylesheet>

XSLT 3.0可以通过SaxonC产品中的PHP API获得（注意，这是我公司的产品）。

赞(0）回复(0）举报 2023-03-11

nkoocmlb4#

解决方案建立在巨人的肩膀上（感谢所有回复的人-- espeically @ThW）我使用了DOMDocument解决方案。随着时间的推移，我发现搜索文档以到达正确的起始点花费了很多时间。所以我循环了“while”以保持指针在正确的位置。这改变了传输时间，从4.当我从while循环中“中断”时，我返回到 AJAX 查询，然后更新屏幕并重新运行，直到我们导入了整个XML。

$reader = new XMLReader();
        $reader->open($xmlFile);

        $document = new DOMDocument();
        $xpath = new DOMXpath($document);

        $found = false;
        // look for the document element
        do {
          $found = $found ? $reader->next() : $reader->read();
        } while (
          $found && 
          $reader->localName !== 'LaunchBox'
        );

        // go to first child of the document element
        if ($found) {
            $found = $reader->read();
        }

        $counts = [];

        while ($found && $reader->depth === 1) {

            $currentElementKey++;

            if( $currentElementKey <= $positionInDocument ){
                // WE DON'T WANT THIS RECORD AS WE'VE ALREADY ADDED IT
                $reader->next();                
            }    

            if ($reader->nodeType === XMLReader::ELEMENT && $reader->localName == $sectionNameWereGetting) {

                // expand into DOM 
                $node = $reader->expand($document);
                // import DOM into SimpleXML 
                $simpleXMLObject = simplexml_import_dom($node);

                // TRANSFER OBJECT INTO ARRAY READY FOR DATABASE
                foreach($simpleXMLObject as $elIndex => $elContent){
                    $addRecord[$elIndex] = trim($elContent);
                }

                // MAKE ARRAY OF ARRAYS FOR DATABASE
                $allRecordsToAdd[] = $addRecord;
                // INCREMENT THE COUNT OF RECORDS WE'VE TRANSFERRED
                $currentRecordNumberTransferring++;
                // clearing current element
                unset($simpleXMLObject);

            }
            $positionInDocument = $currentElementKey;
            $reader->next();
            if( $currentRecordNumberTransferring >= $nextStoppingPoint ){
                // WE NEED TO STOP AND REPORT BACK

                \DB::disableQueryLog();              
                DB::table($dbTableName)->insert($allRecordsToAdd);
                $allRecordsToAdd = array();

                $loopTheWhileForSpeed++;
                if( $loopTheWhileForSpeed < $maxLoops ){
                    $nextStoppingPoint = self::calculateNextAjaxStoppingPoint($currentRecordNumberTransferring, $totalNumberOfRecords, $maxRecordsAtATime);           
                } else {
                    break;
                }

                
            }

        }

    $documentStats["positionInDocument"] = $positionInDocument;
    $documentStats["currentRecordNumberTransferring"] = $currentRecordNumberTransferring;

    $reader->close();
    unset($reader);
    unset($document);
    unset($xpath);

    return  $documentStats;

赞(0）回复(0）举报 2023-03-11

我来回答

PHP -如何有效地识别和计算一个非常大的XML的父元素

4条答案

相关问题

热门标签

最新问答