PHP – 如何高效识别和计算非常大的XML中的父元素

huangapple go评论118阅读模式
英文:

PHP - How to identify and count only parent elements of a very large XML efficiently

问题

$attributeCount = 0;

$xml = new XMLReader();
$xml->open($xmlFile);
$insideGame = false; // Flag to track if inside a Game element

while ($xml->read()) {
    if ($xml->nodeType == XMLReader::ELEMENT) {
        $elementName = $xml->name;
        
        if ($elementName == $sectionNameWereGetting) {
            $insideGame = true; // Set the flag when entering Game element
            continue;
        }
        
        if ($insideGame && $elementName == 'Platform') {
            continue; // Skip Platform inside Game
        }
        
        // Check if the current element has children
        $isEmptyElement = $xml->isEmptyElement;
        if (!$isEmptyElement) {
            $xml->read(); // Move to the next node to check for children
            $hasChildren = $xml->nodeType == XMLReader::ELEMENT;
            $xml->read(); // Move back to the original position
        } else {
            $hasChildren = false;
        }
        
        if ($hasChildren) {
            // If the element has children, increment the count
            $attributeCount++;
        }
        
        if ($insideGame && $elementName == $sectionNameWereGetting) {
            $insideGame = false; // Reset the flag when leaving Game element
        }
    }
}

$xml->close();

return  $attributeCount;
英文:

I have a very large xml file with the following format (this is a very small snip of two of the sections).

<?xml version="1.0" standalone="yes"?>
<LaunchBox>
  <Game>
    <Name>Violet</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Game>
    <Name>Wishbringer</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Platform>
    <Name>3DO Interactive Multiplayer</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
    <Developer>The 3DO Company</Developer>
  </Platform>
  <Platform>
    <Name>Commodore Amiga</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
    <Developer>Commodore International</Developer>
  </Platform>
</LaunchBox>

I would like to quickly find the instances of all the parent elements (i.e. Game and Platform in the above example) to count them but also to extract the contents.

To complicate matters, there is also a Platform "child" inside Game (which I don't want to count). I only want the Parent (i.e. I do not want Game -> Platform but I do want just Platform.

From a combination of this site and Google I came up with the following function code:

$attributeCount = 0;

$xml = new XMLReader();
$xml->open($xmlFile);
$elements = new \XMLElementIterator($xml, $sectionNameWereGetting);
// $sectionNameWereGetting is a variable that changes to Game and Platform etc

foreach( $elements as $key => $indElement ){
            if ($xml->nodeType == XMLReader::ELEMENT && $xml->name == $sectionNameWereGetting) {
                $parseElement = new SimpleXMLElement($xml->readOuterXML());
// NOW I CAN COUNT IF THE ELEMENT HAS CHILDREN
                $thisCount = $parseElement->count();
                unset($parseElement);
                if ($thisCount == 0){
// IF THERE'S NO CHILDREN THEN SKIP THIS ELEMENT
                    continue;
                }
// IF THERE IS CHILDREN THEN INCREMENT THE COUNT
// - IN ANOTHER FUNCTION I GRAB THE CONTENTS HERE
// - AND PUT THEM IN THE DATABASE
                $attributeCount++;
            }
}
unset($elements);
$xml->close();
unset($xml);

return  $attributeCount;

I'm using the excellent script by Hakre at https://github.com/hakre/XMLReaderIterator/blob/master/src/XMLElementIterator.php

This does work. But I think assigning a new SimpleXMLElement is slowing the operation down.

I only need the SimpleXMLElement to check if the element has children (which I'm using to ascertain if the element is inside another parent or not - i.e. if it's a parent it 'will' have children so I want to count it but, if it's inside another parent then it won't have children and I want to ignore it).

But perhaps there is a better solution than counting children? i.e. a $xml->isParent() function or something?

The current function times out before it has fully counted all the sections of the xml (there are around 8 different sections and some of them have several 100,000's of records).

How can I make this process more efficient as I'm also using similar code to grab the contents of the main sections and put them into a database so it will pay dividends to be as efficient as possible.

Also worth noting that I'm not particularly good at programming so please feel free to point out other mistakes I may have made so that I can improve.

答案1

得分: 1

It sounds like using a xpath instead of iterating over the XML might work for your use case. With an xpath you can select the specific nodes you need:

$xml = simplexml_load_string($xmlStr);

$games = $xml->xpath('/LaunchBox/Game');

echo count($games).' games'.PHP_EOL;

foreach ($games as $game) {
    print_r($game);
}

https://3v4l.org/bLLEi#v8.2.3

英文:

It sounds like using a xpath instead of iterating over the XML might work for your use case. With an xpath you can select the specific nodes you need:

$xml = simplexml_load_string($xmlStr);

$games = $xml->xpath('/LaunchBox/Game');

echo count($games).' games'.PHP_EOL;

foreach ($games as $game) {
    print_r($game);
}

https://3v4l.org/bLLEi#v8.2.3

答案2

得分: 1

你不需要将XML序列化以加载到DOM或SimpleXML中。您可以扩展为DOM文档:

$reader = new XMLReader();
$reader->open(getXMLDataURL());

$document = new DOMDocument();

// 使用read()/next()导航

while ($found) {
  // 扩展为DOM 
  $node = $reader->expand($document);
  // 将DOM导入SimpleXML 
  $simpleXMLObject = simplexml_import_dom($node);
 
  // 使用read()/next()导航
}

但是,只需正确调用XMLReader:read()XMLReader:next()即可计算文档元素的元素子级数。read()将导航到包括后代的以下节点,而next()将移至下一个兄弟节点,忽略后代。

$reader = new XMLReader();
$reader->open(getXMLDataURL());

$document = new DOMDocument();
$xpath = new DOMXpath($document);

$found = false;
// 查找文档元素
do {
  $found = $found ? $reader->next() : $reader->read();
} while (
  $found && 
  $reader->localName !== 'LaunchBox'
);

// 转到文档元素的第一个子节点
if ($found) {
    $found = $reader->read();
}

$counts = [];

// 在深度为1处找到节点
while ($found && $reader->depth === 1) {
     if ($reader->nodeType === XMLReader::ELEMENT) {
        if (isset($counts[$reader->localName])) {
            $counts[$reader->localName]++;
        } else {
            $counts[$reader->localName] = 1;
        }
    }
    // 转到下一个兄弟节点
    $found = $reader->next();
}

var_dump($counts);

function getXMLDataURL() {
   $xml = <<<'XML'
   ...
   // XML内容
   ...
   return 'data:application/xml;base64,'.base64_encode($xml);
}

输出:

array(2) {
  ["Game"]=>
  int(2)
  ["Platform"]=>
  int(2)
}

希望这有所帮助!

英文:

You do not need to serialize the XML to load it into DOM or SimpleXML. You can expand into a DOM document:

$reader = new XMLReader();
$reader->open(getXMLDataURL());

$document = new DOMDocument();

// navigate using read()/next()

while ($found) {
  // expand into DOM 
  $node = $reader->expand($document);
  // import DOM into SimpleXML 
  $simpleXMLObject = simplexml_import_dom($node);
 
  // navigate using read()/next()
}

However counting the element children of the document element can be done with just the right calls to XMLReader:read() and XMLReader:next(). read() will navigate to the following node including descendants while next() goes to the following sibling node - ignoring the descendants.

$reader = new XMLReader();
$reader->open(getXMLDataURL());

$document = new DOMDocument();
$xpath = new DOMXpath($document);

$found = false;
// look for the document element
do {
  $found = $found ? $reader->next() : $reader->read();
} while (
  $found && 
  $reader->localName !== 'LaunchBox'
);

// go to first child of the document element
if ($found) {
    $found = $reader->read();
}

$counts = [];

// found a node at depth 1 
while ($found && $reader->depth === 1) {
     if ($reader->nodeType === XMLReader::ELEMENT) {
        if (isset($counts[$reader->localName])) {
            $counts[$reader->localName]++;
        } else {
            $counts[$reader->localName] = 1;
        }
    }
    // go to next sibling node
    $found = $reader->next();
}

var_dump($counts);


function getXMLDataURL() {
   $xml = <<<'XML'
<?xml version="1.0" standalone="yes"?>
<LaunchBox>
  <Game>
    <Name>Violet</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Game>
    <Name>Wishbringer</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Platform>
    <Name>3DO Interactive Multiplayer</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
    <Developer>The 3DO Company</Developer>
  </Platform>
  <Platform>
    <Name>Commodore Amiga</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
    <Developer>Commodore International</Developer>
  </Platform>
</LaunchBox>
XML;
    return 'data:application/xml;base64,'.base64_encode($xml);
}

Output:

array(2) {
["Game"]=>
int(2)
["Platform"]=>
int(2)
}

答案3

得分: 1

I'm not sure I've fully understood your requirement but if the output you are looking for is:

{ "Game":2, "Platform":2 }

then you can achieve it with this streamable XSLT 3.0 stylesheet:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
version="3.0">

<xsl:mode streamable="yes"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/">
<xsl:sequence select="fold-left(///local-name(), map{},
function($map, $name){
map:put($map, $name,
if (map:contains($map, $name))
then map:get($map, $name) + 1
else 1})"/>
</xsl:template>

</xsl:stylesheet>

XSLT 3.0 is available via a PHP API in the SaxonC product (caveat, this is my company's product).

英文:

I'm not sure I've fully understood your requirement but if the output you are looking for is:

{ &quot;Game&quot;:2, &quot;Platform&quot;:2 }

then you can achieve it with this streamable XSLT 3.0 stylesheet:

&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;
xmlns:map=&quot;http://www.w3.org/2005/xpath-functions/map&quot;
version=&quot;3.0&quot;&gt;
&lt;xsl:mode streamable=&quot;yes&quot;/&gt;
&lt;xsl:output method=&quot;json&quot; indent=&quot;yes&quot;/&gt;
&lt;xsl:template match=&quot;/&quot;&gt;
&lt;xsl:sequence select=&quot;fold-left(/*/*/local-name(), map{}, 
function($map, $name){
map:put($map, $name, 
if (map:contains($map, $name)) 
then map:get($map, $name) + 1 
else 1)})&quot;/&gt;
&lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;

XSLT 3.0 is available via a PHP API in the SaxonC product (caveat, this is my company's product).

答案4

得分: 0

解决方案
在伟大的前辈们的基础上(特别感谢@ThW),我使用了DOMDocument解决方案。通过记录执行时间,我发现搜索文档以找到正确的起点占用了很多时间。因此,我循环在'while'中以保持指针在正确的位置。这将传输时间从4.5小时缩短到几分钟。当我从while循环中'break'时,我返回到一个Ajax查询,然后更新屏幕并重新运行,直到导入整个XML。

$reader = new XMLReader();
$reader->open($xmlFile);

$document = new DOMDocument();
$xpath = new DOMXpath($document);

$found = false;
// 查找文档元素
do {
  $found = $found ? $reader->next() : $reader->read();
} while (
  $found && 
  $reader->localName !== 'LaunchBox'
);

// 转到文档元素的第一个子元素
if ($found) {
    $found = $reader->read();
}

$counts = [];

while ($found && $reader->depth === 1) {

    $currentElementKey++;

    if( $currentElementKey <= $positionInDocument ){
        // 我们不希望这条记录,因为我们已经添加了它
        $reader->next();                
    }    

    if ($reader->nodeType === XMLReader::ELEMENT && $reader->localName == $sectionNameWereGetting) {

        // 展开为DOM 
        $node = $reader->expand($document);
        // 将DOM导入SimpleXML 
        $simpleXMLObject = simplexml_import_dom($node);

        // 将对象转换为准备插入数据库的数组
        foreach($simpleXMLObject as $elIndex => $elContent){
            $addRecord[$elIndex] = trim($elContent);
        }

        // 为数据库创建数组的数组
        $allRecordsToAdd[] = $addRecord;
        // 增加已传输的记录数
        $currentRecordNumberTransferring++;
        // 清除当前元素
        unset($simpleXMLObject);

    }
    $positionInDocument = $currentElementKey;
    $reader->next();
    if( $currentRecordNumberTransferring >= $nextStoppingPoint ){
        // 我们需要停止并报告

        \DB::disableQueryLog();              
        DB::table($dbTableName)->insert($allRecordsToAdd);
        $allRecordsToAdd = array();

        $loopTheWhileForSpeed++;
        if( $loopTheWhileForSpeed < $maxLoops ){
            $nextStoppingPoint = self::calculateNextAjaxStoppingPoint($currentRecordNumberTransferring, $totalNumberOfRecords, $maxRecordsAtATime);           
        } else {
            break;
        }

        
    }


}

$documentStats["positionInDocument"] = $positionInDocument;
$documentStats["currentRecordNumberTransferring"] = $currentRecordNumberTransferring;


$reader->close();
unset($reader);
unset($document);
unset($xpath);

return  $documentStats;
英文:

** Solution **
Building on the shoulders of giants (thanks all who replied - espeically @ThW) I used the DOMDocument solution. With some time logging I found that the searching the document to get to the correct starting point was taking a lot of the time. So I looped around the 'while' to keep the pointer in the correct position. This has changed the transfer time from 4.5 hours down to a few minutes. When I 'break' from the while loop I return to an Ajax query that then updates the screen and re-runs until we have imported the whole XML.

        $reader = new XMLReader();
$reader-&gt;open($xmlFile);
$document = new DOMDocument();
$xpath = new DOMXpath($document);
$found = false;
// look for the document element
do {
$found = $found ? $reader-&gt;next() : $reader-&gt;read();
} while (
$found &amp;&amp; 
$reader-&gt;localName !== &#39;LaunchBox&#39;
);
// go to first child of the document element
if ($found) {
$found = $reader-&gt;read();
}
$counts = [];
while ($found &amp;&amp; $reader-&gt;depth === 1) {
$currentElementKey++;
if( $currentElementKey &lt;= $positionInDocument ){
// WE DON&#39;T WANT THIS RECORD AS WE&#39;VE ALREADY ADDED IT
$reader-&gt;next();                
}    
if ($reader-&gt;nodeType === XMLReader::ELEMENT &amp;&amp; $reader-&gt;localName == $sectionNameWereGetting) {
// expand into DOM 
$node = $reader-&gt;expand($document);
// import DOM into SimpleXML 
$simpleXMLObject = simplexml_import_dom($node);
// TRANSFER OBJECT INTO ARRAY READY FOR DATABASE
foreach($simpleXMLObject as $elIndex =&gt; $elContent){
$addRecord[$elIndex] = trim($elContent);
}
// MAKE ARRAY OF ARRAYS FOR DATABASE
$allRecordsToAdd[] = $addRecord;
// INCREMENT THE COUNT OF RECORDS WE&#39;VE TRANSFERRED
$currentRecordNumberTransferring++;
// clearing current element
unset($simpleXMLObject);
}
$positionInDocument = $currentElementKey;
$reader-&gt;next();
if( $currentRecordNumberTransferring &gt;= $nextStoppingPoint ){
// WE NEED TO STOP AND REPORT BACK
\DB::disableQueryLog();              
DB::table($dbTableName)-&gt;insert($allRecordsToAdd);
$allRecordsToAdd = array();
$loopTheWhileForSpeed++;
if( $loopTheWhileForSpeed &lt; $maxLoops ){
$nextStoppingPoint = self::calculateNextAjaxStoppingPoint($currentRecordNumberTransferring, $totalNumberOfRecords, $maxRecordsAtATime);           
} else {
break;
}
}
}
$documentStats[&quot;positionInDocument&quot;] = $positionInDocument;
$documentStats[&quot;currentRecordNumberTransferring&quot;] = $currentRecordNumberTransferring;
$reader-&gt;close();
unset($reader);
unset($document);
unset($xpath);
return  $documentStats;

huangapple
  • 本文由 发表于 2023年3月9日 23:19:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686615.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定