更快的将大型嵌套XML转换为R数据框的方法

huangapple go评论99阅读模式
英文:

Faster way to put large nested xml into R dataframe

问题

我有一个XML文件,它的嵌套结构与我见过的其他示例不同。以下是它的格式。我对np节点中的数据感兴趣,但我也想在每一行中获取组(group)和ID信息。

我有下面的一个工作解决方案,但实际文件非常庞大,包含数千个节点。这个解决方案在下面的示例代码上运行良好,但在大文件上运行需要数小时。

我的问题是 - 有没有更快的方法来获取我想要的数据框(dataframe)?

  1. <File>
  2. <Time>
  3. <date>20220301</date>
  4. <Name>1</Name>
  5. <folder>
  6. <group>800</group>
  7. <ID>ESK</ID>
  8. <Type>S</Type>
  9. <Customer>1</Customer>
  10. <currency>USD</currency>
  11. <Port>
  12. <ec>X</ec>
  13. <np>
  14. <A>FIRST</A>
  15. <B>ES</B>
  16. <C>GOR</C>
  17. <D>2021</D>
  18. <E>-1000</E>
  19. </np>
  20. <np>
  21. <A>TEST</A>
  22. <B>ES</B>
  23. <C>RUN</C>
  24. <D>202303</D>
  25. <E>202303</E>
  26. <F>C</F>
  27. <G>3200</G>
  28. <H>32</H>
  29. </np>
  30. </Port>
  31. </folder>
  32. <folder>
  33. <group>900</group>
  34. <ID>ABC</ID>
  35. <Type>D</Type>
  36. <Customer>1</Customer>
  37. <currency>USD</currency>
  38. <Port>
  39. <ec>X</ec>
  40. <np>
  41. <A>CAT</A>
  42. <B>ES</B>
  43. <C>GO</C>
  44. <D>202303</D>
  45. <E>-500</E>
  46. </np>
  47. </Port>
  48. </folder>
  49. </Time>
  50. </File>

这是我当前的工作解决方案。这个解决方案在小型XML文件上运行良好,但对于具有这种XML格式的大型文件来说,它运行得太慢,需要数小时。

  1. URL <- 'H:/testSO.xml'
  2. doc <- read_xml(URL)
  3. df <- xml_find_all(doc, ".//np") %>%
  4. map_df( function(x) {
  5. set_names( c(
  6. xml_find_all( x, "./ancestor::folder/group") %>% xml_text(),
  7. xml_find_all( x, "./ancestor::folder/ID") %>% xml_text(),
  8. xml_find_all( x, ".//A") %>% xml_text(),
  9. xml_find_all( x, ".//B") %>% xml_text(),
  10. xml_find_all( x, ".//C") %>% xml_text(),
  11. xml_find_all( x, ".//D") %>% xml_text(),
  12. xml_find_all( x, ".//E") %>% xml_text()),
  13. c( "group","id", "A", "B", "C","D","E") ) %>%
  14. as.list() %>%
  15. flatten_df()
  16. }) %>%
  17. type_convert()
  18. head(df)

谢谢!

英文:

I have a XML file that has nesting that is different from other examples I have seen. Here's the format it takes. I am interested in the data in the np nodes, but I also want the group and ID info in each row.

I have a working solution below, but the actual file is very large with thousands of <folder> nodes. The solution takes hours and hours to run on the large file, although it works on the sample code below.

My question is - is there a faster way to get the dataframe I want?

  1. &lt;File&gt;
  2. &lt;Time&gt;
  3. &lt;date&gt;20220301&lt;/date&gt;
  4. &lt;Name&gt;1&lt;/Name&gt;
  5. &lt;folder&gt;
  6. &lt;group&gt;800&lt;/group&gt;
  7. &lt;ID&gt;ESK&lt;/ID&gt;
  8. &lt;Type&gt;S&lt;/Type&gt;
  9. &lt;Customer&gt;1&lt;/Customer&gt;
  10. &lt;currency&gt;USD&lt;/currency&gt;
  11. &lt;Port&gt;
  12. &lt;ec&gt;X&lt;/ec&gt;
  13. &lt;np&gt;
  14. &lt;A&gt;FIRST&lt;/A&gt;
  15. &lt;B&gt;ES&lt;/B&gt;
  16. &lt;C&gt;GOR&lt;/C&gt;
  17. &lt;D&gt;2021&lt;/D&gt;
  18. &lt;E&gt;-1000&lt;/E&gt;
  19. &lt;/np&gt;
  20. &lt;np&gt;
  21. &lt;A&gt;TEST&lt;/A&gt;
  22. &lt;B&gt;ES&lt;/B&gt;
  23. &lt;C&gt;RUN&lt;/C&gt;
  24. &lt;D&gt;202303&lt;/D&gt;
  25. &lt;E&gt;202303&lt;/E&gt;
  26. &lt;F&gt;C&lt;/F&gt;
  27. &lt;G&gt;3200&lt;/G&gt;
  28. &lt;H&gt;32&lt;/H&gt;
  29. &lt;/np&gt;
  30. &lt;/Port&gt;
  31. &lt;/folder&gt;
  32. &lt;folder&gt;
  33. &lt;group&gt;900&lt;/group&gt;
  34. &lt;ID&gt;ABC&lt;/ID&gt;
  35. &lt;Type&gt;D&lt;/Type&gt;
  36. &lt;Customer&gt;1&lt;/Customer&gt;
  37. &lt;currency&gt;USD&lt;/currency&gt;
  38. &lt;Port&gt;
  39. &lt;ec&gt;X&lt;/ec&gt;
  40. &lt;np&gt;
  41. &lt;A&gt;CAT&lt;/A&gt;
  42. &lt;B&gt;ES&lt;/B&gt;
  43. &lt;C&gt;GO&lt;/C&gt;
  44. &lt;D&gt;202303&lt;/D&gt;
  45. &lt;E&gt;-500&lt;/E&gt;
  46. &lt;/np&gt;
  47. &lt;/Port&gt;
  48. &lt;/folder&gt;
  49. &lt;/Time&gt;
  50. &lt;/File&gt;

Here is my current working solution. This works on small xml files, but for large files with that xml format it is too slow and takes hours to run.

  1. URL &lt;- &#39;H:/testSO.xml&#39;
  2. doc &lt;- read_xml(URL)
  3. df &lt;-
  4. xml_find_all(doc, &quot;.//np&quot;) %&gt;%
  5. map_df( function(x) {
  6. set_names( c(
  7. xml_find_all( x, &quot;./ancestor::folder/group&quot;) %&gt;% xml_text(),
  8. xml_find_all( x, &quot;./ancestor::folder/ID&quot;) %&gt;% xml_text(),
  9. xml_find_all( x, &quot;.//A&quot;) %&gt;% xml_text(),
  10. xml_find_all( x, &quot;.//B&quot;) %&gt;% xml_text(),
  11. xml_find_all( x, &quot;.//C&quot;) %&gt;% xml_text(),
  12. xml_find_all( x, &quot;.//D&quot;) %&gt;% xml_text(),
  13. xml_find_all( x, &quot;.//E&quot;) %&gt;% xml_text()),
  14. #set the column names
  15. c( &quot;group&quot;,&quot;id&quot;, &quot;A&quot;, &quot;B&quot;, &quot;C&quot;,&quot;D&quot;,&quot;E&quot;) ) %&gt;%
  16. as.list() %&gt;% #make list
  17. flatten_df()
  18. }) %&gt;%
  19. type_convert()
  20. head(df)

Thanks!

答案1

得分: 1

您可以利用xml2库的向量化功能,避免使用map_df循环,这应该会显著提高速度。

此例程查找所有的np节点,将所请求的信息提取到一系列向量中,然后使用这些结果创建一个数据框。

  1. library(xml2)
  2. library(dplyr)
  3. npnodes <- xml_find_all(doc, ".//np")
  4. group <- npnodes %>% xml_find_first( "./ancestor::folder/group") %>% xml_text()
  5. id <- npnodes %>% xml_find_first("./ancestor::folder/ID") %>% xml_text()
  6. A <- npnodes %>% xml_find_first(".//A") %>% xml_text()
  7. B <- npnodes %>% xml_find_first(".//B") %>% xml_text()
  8. C <- npnodes %>% xml_find_first(".//C") %>% xml_text()
  9. D <- npnodes %>% xml_find_first(".//D") %>% xml_text()
  10. E <- npnodes %>% xml_find_first(".//E") %>% xml_text()
  11. answer <- data.frame(group, id, A, B, C, D, E)
  12. answer
  1. group id A B C D E
  2. 1 800 ESK FIRST ES GOR 2021 -1000
  3. 2 800 ESK TEST ES RUN 202303 202303
  4. 3 900 ABC CAT ES GO 202303 -500

希望这个翻译对您有帮助。

英文:

You can take advantage of the vectorization of the xml2 library and avoid the map_df loop, this should provide a significant improvement in speed.

This routine finds all of the np nodes, extracts the requested information into a series of vectors, and then creates a data frame with the results.

  1. library(xml2)
  2. library(dplyr)
  3. npnodes &lt;- xml_find_all(doc, &quot;.//np&quot;)
  4. group &lt;- npnodes %&gt;% xml_find_first( &quot;./ancestor::folder/group&quot;) %&gt;% xml_text()
  5. id &lt;- npnodes %&gt;% xml_find_first(&quot;./ancestor::folder/ID&quot;) %&gt;% xml_text()
  6. A &lt;- npnodes %&gt;% xml_find_first(&quot;.//A&quot;) %&gt;% xml_text()
  7. B &lt;- npnodes %&gt;% xml_find_first(&quot;.//B&quot;) %&gt;% xml_text()
  8. C &lt;- npnodes %&gt;% xml_find_first(&quot;.//C&quot;) %&gt;% xml_text()
  9. D &lt;- npnodes %&gt;% xml_find_first(&quot;.//D&quot;) %&gt;% xml_text()
  10. E &lt;- npnodes %&gt;% xml_find_first(&quot;.//E&quot;) %&gt;% xml_text()
  11. answer &lt;- data.frame(group,id, A, B, C, D, E)
  12. answer
  13. group id A B C D E
  14. 1 800 ESK FIRST ES GOR 2021 -1000
  15. 2 800 ESK TEST ES RUN 202303 202303
  16. 3 900 ABC CAT ES GO 202303 -500

huangapple
  • 本文由 发表于 2023年2月23日 23:20:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75546815.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定