PHP – 如何高效识别和计算非常大的XML中的父元素

huangapple go评论151阅读模式
英文:

PHP - How to identify and count only parent elements of a very large XML efficiently

问题

  1. $attributeCount = 0;
  2. $xml = new XMLReader();
  3. $xml->open($xmlFile);
  4. $insideGame = false; // Flag to track if inside a Game element
  5. while ($xml->read()) {
  6. if ($xml->nodeType == XMLReader::ELEMENT) {
  7. $elementName = $xml->name;
  8. if ($elementName == $sectionNameWereGetting) {
  9. $insideGame = true; // Set the flag when entering Game element
  10. continue;
  11. }
  12. if ($insideGame && $elementName == 'Platform') {
  13. continue; // Skip Platform inside Game
  14. }
  15. // Check if the current element has children
  16. $isEmptyElement = $xml->isEmptyElement;
  17. if (!$isEmptyElement) {
  18. $xml->read(); // Move to the next node to check for children
  19. $hasChildren = $xml->nodeType == XMLReader::ELEMENT;
  20. $xml->read(); // Move back to the original position
  21. } else {
  22. $hasChildren = false;
  23. }
  24. if ($hasChildren) {
  25. // If the element has children, increment the count
  26. $attributeCount++;
  27. }
  28. if ($insideGame && $elementName == $sectionNameWereGetting) {
  29. $insideGame = false; // Reset the flag when leaving Game element
  30. }
  31. }
  32. }
  33. $xml->close();
  34. return $attributeCount;
英文:

I have a very large xml file with the following format (this is a very small snip of two of the sections).

  1. <?xml version="1.0" standalone="yes"?>
  2. <LaunchBox>
  3. <Game>
  4. <Name>Violet</Name>
  5. <ReleaseYear>1985</ReleaseYear>
  6. <MaxPlayers>1</MaxPlayers>
  7. <Platform>ZiNc</Platform>
  8. </Game>
  9. <Game>
  10. <Name>Wishbringer</Name>
  11. <ReleaseYear>1985</ReleaseYear>
  12. <MaxPlayers>1</MaxPlayers>
  13. <Platform>ZiNc</Platform>
  14. </Game>
  15. <Platform>
  16. <Name>3DO Interactive Multiplayer</Name>
  17. <Emulated>true</Emulated>
  18. <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
  19. <Developer>The 3DO Company</Developer>
  20. </Platform>
  21. <Platform>
  22. <Name>Commodore Amiga</Name>
  23. <Emulated>true</Emulated>
  24. <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
  25. <Developer>Commodore International</Developer>
  26. </Platform>
  27. </LaunchBox>

I would like to quickly find the instances of all the parent elements (i.e. Game and Platform in the above example) to count them but also to extract the contents.

To complicate matters, there is also a Platform "child" inside Game (which I don't want to count). I only want the Parent (i.e. I do not want Game -> Platform but I do want just Platform.

From a combination of this site and Google I came up with the following function code:

  1. $attributeCount = 0;
  2. $xml = new XMLReader();
  3. $xml->open($xmlFile);
  4. $elements = new \XMLElementIterator($xml, $sectionNameWereGetting);
  5. // $sectionNameWereGetting is a variable that changes to Game and Platform etc
  6. foreach( $elements as $key => $indElement ){
  7. if ($xml->nodeType == XMLReader::ELEMENT && $xml->name == $sectionNameWereGetting) {
  8. $parseElement = new SimpleXMLElement($xml->readOuterXML());
  9. // NOW I CAN COUNT IF THE ELEMENT HAS CHILDREN
  10. $thisCount = $parseElement->count();
  11. unset($parseElement);
  12. if ($thisCount == 0){
  13. // IF THERE'S NO CHILDREN THEN SKIP THIS ELEMENT
  14. continue;
  15. }
  16. // IF THERE IS CHILDREN THEN INCREMENT THE COUNT
  17. // - IN ANOTHER FUNCTION I GRAB THE CONTENTS HERE
  18. // - AND PUT THEM IN THE DATABASE
  19. $attributeCount++;
  20. }
  21. }
  22. unset($elements);
  23. $xml->close();
  24. unset($xml);
  25. return $attributeCount;

I'm using the excellent script by Hakre at https://github.com/hakre/XMLReaderIterator/blob/master/src/XMLElementIterator.php

This does work. But I think assigning a new SimpleXMLElement is slowing the operation down.

I only need the SimpleXMLElement to check if the element has children (which I'm using to ascertain if the element is inside another parent or not - i.e. if it's a parent it 'will' have children so I want to count it but, if it's inside another parent then it won't have children and I want to ignore it).

But perhaps there is a better solution than counting children? i.e. a $xml->isParent() function or something?

The current function times out before it has fully counted all the sections of the xml (there are around 8 different sections and some of them have several 100,000's of records).

How can I make this process more efficient as I'm also using similar code to grab the contents of the main sections and put them into a database so it will pay dividends to be as efficient as possible.

Also worth noting that I'm not particularly good at programming so please feel free to point out other mistakes I may have made so that I can improve.

答案1

得分: 1

It sounds like using a xpath instead of iterating over the XML might work for your use case. With an xpath you can select the specific nodes you need:

  1. $xml = simplexml_load_string($xmlStr);
  2. $games = $xml->xpath('/LaunchBox/Game');
  3. echo count($games).' games'.PHP_EOL;
  4. foreach ($games as $game) {
  5. print_r($game);
  6. }

https://3v4l.org/bLLEi#v8.2.3

英文:

It sounds like using a xpath instead of iterating over the XML might work for your use case. With an xpath you can select the specific nodes you need:

  1. $xml = simplexml_load_string($xmlStr);
  2. $games = $xml->xpath('/LaunchBox/Game');
  3. echo count($games).' games'.PHP_EOL;
  4. foreach ($games as $game) {
  5. print_r($game);
  6. }

https://3v4l.org/bLLEi#v8.2.3

答案2

得分: 1

你不需要将XML序列化以加载到DOM或SimpleXML中。您可以扩展为DOM文档:

  1. $reader = new XMLReader();
  2. $reader->open(getXMLDataURL());
  3. $document = new DOMDocument();
  4. // 使用read()/next()导航
  5. while ($found) {
  6. // 扩展为DOM
  7. $node = $reader->expand($document);
  8. // 将DOM导入SimpleXML
  9. $simpleXMLObject = simplexml_import_dom($node);
  10. // 使用read()/next()导航
  11. }

但是,只需正确调用XMLReader:read()XMLReader:next()即可计算文档元素的元素子级数。read()将导航到包括后代的以下节点,而next()将移至下一个兄弟节点,忽略后代。

  1. $reader = new XMLReader();
  2. $reader->open(getXMLDataURL());
  3. $document = new DOMDocument();
  4. $xpath = new DOMXpath($document);
  5. $found = false;
  6. // 查找文档元素
  7. do {
  8. $found = $found ? $reader->next() : $reader->read();
  9. } while (
  10. $found &&
  11. $reader->localName !== 'LaunchBox'
  12. );
  13. // 转到文档元素的第一个子节点
  14. if ($found) {
  15. $found = $reader->read();
  16. }
  17. $counts = [];
  18. // 在深度为1处找到节点
  19. while ($found && $reader->depth === 1) {
  20. if ($reader->nodeType === XMLReader::ELEMENT) {
  21. if (isset($counts[$reader->localName])) {
  22. $counts[$reader->localName]++;
  23. } else {
  24. $counts[$reader->localName] = 1;
  25. }
  26. }
  27. // 转到下一个兄弟节点
  28. $found = $reader->next();
  29. }
  30. var_dump($counts);
  31. function getXMLDataURL() {
  32. $xml = <<<'XML'
  33. ...
  34. // XML内容
  35. ...
  36. return 'data:application/xml;base64,'.base64_encode($xml);
  37. }

输出:

  1. array(2) {
  2. ["Game"]=>
  3. int(2)
  4. ["Platform"]=>
  5. int(2)
  6. }

希望这有所帮助!

英文:

You do not need to serialize the XML to load it into DOM or SimpleXML. You can expand into a DOM document:

  1. $reader = new XMLReader();
  2. $reader->open(getXMLDataURL());
  3. $document = new DOMDocument();
  4. // navigate using read()/next()
  5. while ($found) {
  6. // expand into DOM
  7. $node = $reader->expand($document);
  8. // import DOM into SimpleXML
  9. $simpleXMLObject = simplexml_import_dom($node);
  10. // navigate using read()/next()
  11. }

However counting the element children of the document element can be done with just the right calls to XMLReader:read() and XMLReader:next(). read() will navigate to the following node including descendants while next() goes to the following sibling node - ignoring the descendants.

  1. $reader = new XMLReader();
  2. $reader->open(getXMLDataURL());
  3. $document = new DOMDocument();
  4. $xpath = new DOMXpath($document);
  5. $found = false;
  6. // look for the document element
  7. do {
  8. $found = $found ? $reader->next() : $reader->read();
  9. } while (
  10. $found &&
  11. $reader->localName !== 'LaunchBox'
  12. );
  13. // go to first child of the document element
  14. if ($found) {
  15. $found = $reader->read();
  16. }
  17. $counts = [];
  18. // found a node at depth 1
  19. while ($found && $reader->depth === 1) {
  20. if ($reader->nodeType === XMLReader::ELEMENT) {
  21. if (isset($counts[$reader->localName])) {
  22. $counts[$reader->localName]++;
  23. } else {
  24. $counts[$reader->localName] = 1;
  25. }
  26. }
  27. // go to next sibling node
  28. $found = $reader->next();
  29. }
  30. var_dump($counts);
  31. function getXMLDataURL() {
  32. $xml = <<<'XML'
  33. <?xml version="1.0" standalone="yes"?>
  34. <LaunchBox>
  35. <Game>
  36. <Name>Violet</Name>
  37. <ReleaseYear>1985</ReleaseYear>
  38. <MaxPlayers>1</MaxPlayers>
  39. <Platform>ZiNc</Platform>
  40. </Game>
  41. <Game>
  42. <Name>Wishbringer</Name>
  43. <ReleaseYear>1985</ReleaseYear>
  44. <MaxPlayers>1</MaxPlayers>
  45. <Platform>ZiNc</Platform>
  46. </Game>
  47. <Platform>
  48. <Name>3DO Interactive Multiplayer</Name>
  49. <Emulated>true</Emulated>
  50. <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
  51. <Developer>The 3DO Company</Developer>
  52. </Platform>
  53. <Platform>
  54. <Name>Commodore Amiga</Name>
  55. <Emulated>true</Emulated>
  56. <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
  57. <Developer>Commodore International</Developer>
  58. </Platform>
  59. </LaunchBox>
  60. XML;
  61. return 'data:application/xml;base64,'.base64_encode($xml);
  62. }

Output:

  1. array(2) {
  2. ["Game"]=>
  3. int(2)
  4. ["Platform"]=>
  5. int(2)
  6. }

答案3

得分: 1

I'm not sure I've fully understood your requirement but if the output you are looking for is:

{ "Game":2, "Platform":2 }

then you can achieve it with this streamable XSLT 3.0 stylesheet:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
version="3.0">

<xsl:mode streamable="yes"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/">
<xsl:sequence select="fold-left(///local-name(), map{},
function($map, $name){
map:put($map, $name,
if (map:contains($map, $name))
then map:get($map, $name) + 1
else 1})"/>
</xsl:template>

</xsl:stylesheet>

XSLT 3.0 is available via a PHP API in the SaxonC product (caveat, this is my company's product).

英文:

I'm not sure I've fully understood your requirement but if the output you are looking for is:

  1. { &quot;Game&quot;:2, &quot;Platform&quot;:2 }

then you can achieve it with this streamable XSLT 3.0 stylesheet:

  1. &lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;
  2. xmlns:map=&quot;http://www.w3.org/2005/xpath-functions/map&quot;
  3. version=&quot;3.0&quot;&gt;
  4. &lt;xsl:mode streamable=&quot;yes&quot;/&gt;
  5. &lt;xsl:output method=&quot;json&quot; indent=&quot;yes&quot;/&gt;
  6. &lt;xsl:template match=&quot;/&quot;&gt;
  7. &lt;xsl:sequence select=&quot;fold-left(/*/*/local-name(), map{},
  8. function($map, $name){
  9. map:put($map, $name,
  10. if (map:contains($map, $name))
  11. then map:get($map, $name) + 1
  12. else 1)})&quot;/&gt;
  13. &lt;/xsl:template&gt;
  14. &lt;/xsl:stylesheet&gt;

XSLT 3.0 is available via a PHP API in the SaxonC product (caveat, this is my company's product).

答案4

得分: 0

解决方案
在伟大的前辈们的基础上(特别感谢@ThW),我使用了DOMDocument解决方案。通过记录执行时间,我发现搜索文档以找到正确的起点占用了很多时间。因此,我循环在'while'中以保持指针在正确的位置。这将传输时间从4.5小时缩短到几分钟。当我从while循环中'break'时,我返回到一个Ajax查询,然后更新屏幕并重新运行,直到导入整个XML。

  1. $reader = new XMLReader();
  2. $reader->open($xmlFile);
  3. $document = new DOMDocument();
  4. $xpath = new DOMXpath($document);
  5. $found = false;
  6. // 查找文档元素
  7. do {
  8. $found = $found ? $reader->next() : $reader->read();
  9. } while (
  10. $found &&
  11. $reader->localName !== 'LaunchBox'
  12. );
  13. // 转到文档元素的第一个子元素
  14. if ($found) {
  15. $found = $reader->read();
  16. }
  17. $counts = [];
  18. while ($found && $reader->depth === 1) {
  19. $currentElementKey++;
  20. if( $currentElementKey <= $positionInDocument ){
  21. // 我们不希望这条记录,因为我们已经添加了它
  22. $reader->next();
  23. }
  24. if ($reader->nodeType === XMLReader::ELEMENT && $reader->localName == $sectionNameWereGetting) {
  25. // 展开为DOM
  26. $node = $reader->expand($document);
  27. // 将DOM导入SimpleXML
  28. $simpleXMLObject = simplexml_import_dom($node);
  29. // 将对象转换为准备插入数据库的数组
  30. foreach($simpleXMLObject as $elIndex => $elContent){
  31. $addRecord[$elIndex] = trim($elContent);
  32. }
  33. // 为数据库创建数组的数组
  34. $allRecordsToAdd[] = $addRecord;
  35. // 增加已传输的记录数
  36. $currentRecordNumberTransferring++;
  37. // 清除当前元素
  38. unset($simpleXMLObject);
  39. }
  40. $positionInDocument = $currentElementKey;
  41. $reader->next();
  42. if( $currentRecordNumberTransferring >= $nextStoppingPoint ){
  43. // 我们需要停止并报告
  44. \DB::disableQueryLog();
  45. DB::table($dbTableName)->insert($allRecordsToAdd);
  46. $allRecordsToAdd = array();
  47. $loopTheWhileForSpeed++;
  48. if( $loopTheWhileForSpeed < $maxLoops ){
  49. $nextStoppingPoint = self::calculateNextAjaxStoppingPoint($currentRecordNumberTransferring, $totalNumberOfRecords, $maxRecordsAtATime);
  50. } else {
  51. break;
  52. }
  53. }
  54. }
  55. $documentStats["positionInDocument"] = $positionInDocument;
  56. $documentStats["currentRecordNumberTransferring"] = $currentRecordNumberTransferring;
  57. $reader->close();
  58. unset($reader);
  59. unset($document);
  60. unset($xpath);
  61. return $documentStats;
英文:

** Solution **
Building on the shoulders of giants (thanks all who replied - espeically @ThW) I used the DOMDocument solution. With some time logging I found that the searching the document to get to the correct starting point was taking a lot of the time. So I looped around the 'while' to keep the pointer in the correct position. This has changed the transfer time from 4.5 hours down to a few minutes. When I 'break' from the while loop I return to an Ajax query that then updates the screen and re-runs until we have imported the whole XML.

  1. $reader = new XMLReader();
  2. $reader-&gt;open($xmlFile);
  3. $document = new DOMDocument();
  4. $xpath = new DOMXpath($document);
  5. $found = false;
  6. // look for the document element
  7. do {
  8. $found = $found ? $reader-&gt;next() : $reader-&gt;read();
  9. } while (
  10. $found &amp;&amp;
  11. $reader-&gt;localName !== &#39;LaunchBox&#39;
  12. );
  13. // go to first child of the document element
  14. if ($found) {
  15. $found = $reader-&gt;read();
  16. }
  17. $counts = [];
  18. while ($found &amp;&amp; $reader-&gt;depth === 1) {
  19. $currentElementKey++;
  20. if( $currentElementKey &lt;= $positionInDocument ){
  21. // WE DON&#39;T WANT THIS RECORD AS WE&#39;VE ALREADY ADDED IT
  22. $reader-&gt;next();
  23. }
  24. if ($reader-&gt;nodeType === XMLReader::ELEMENT &amp;&amp; $reader-&gt;localName == $sectionNameWereGetting) {
  25. // expand into DOM
  26. $node = $reader-&gt;expand($document);
  27. // import DOM into SimpleXML
  28. $simpleXMLObject = simplexml_import_dom($node);
  29. // TRANSFER OBJECT INTO ARRAY READY FOR DATABASE
  30. foreach($simpleXMLObject as $elIndex =&gt; $elContent){
  31. $addRecord[$elIndex] = trim($elContent);
  32. }
  33. // MAKE ARRAY OF ARRAYS FOR DATABASE
  34. $allRecordsToAdd[] = $addRecord;
  35. // INCREMENT THE COUNT OF RECORDS WE&#39;VE TRANSFERRED
  36. $currentRecordNumberTransferring++;
  37. // clearing current element
  38. unset($simpleXMLObject);
  39. }
  40. $positionInDocument = $currentElementKey;
  41. $reader-&gt;next();
  42. if( $currentRecordNumberTransferring &gt;= $nextStoppingPoint ){
  43. // WE NEED TO STOP AND REPORT BACK
  44. \DB::disableQueryLog();
  45. DB::table($dbTableName)-&gt;insert($allRecordsToAdd);
  46. $allRecordsToAdd = array();
  47. $loopTheWhileForSpeed++;
  48. if( $loopTheWhileForSpeed &lt; $maxLoops ){
  49. $nextStoppingPoint = self::calculateNextAjaxStoppingPoint($currentRecordNumberTransferring, $totalNumberOfRecords, $maxRecordsAtATime);
  50. } else {
  51. break;
  52. }
  53. }
  54. }
  55. $documentStats[&quot;positionInDocument&quot;] = $positionInDocument;
  56. $documentStats[&quot;currentRecordNumberTransferring&quot;] = $currentRecordNumberTransferring;
  57. $reader-&gt;close();
  58. unset($reader);
  59. unset($document);
  60. unset($xpath);
  61. return $documentStats;

huangapple
  • 本文由 发表于 2023年3月9日 23:19:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686615.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定