2023年2月23日 23:20:24go评论99阅读模式

英文:

Faster way to put large nested xml into R dataframe

问题

我有一个XML文件，它的嵌套结构与我见过的其他示例不同。以下是它的格式。我对np节点中的数据感兴趣，但我也想在每一行中获取组(group)和ID信息。

我有下面的一个工作解决方案，但实际文件非常庞大，包含数千个节点。这个解决方案在下面的示例代码上运行良好，但在大文件上运行需要数小时。

我的问题是 - 有没有更快的方法来获取我想要的数据框(dataframe)？

<File>
  <Time>
    <date>20220301</date>
    <Name>1</Name>
    <folder>
      <group>800</group>
      <ID>ESK</ID>
      <Type>S</Type>
      <Customer>1</Customer>
      <currency>USD</currency>
      <Port>
        <ec>X</ec>
        <np>
          <A>FIRST</A>
          <B>ES</B>
          <C>GOR</C>
          <D>2021</D>
          <E>-1000</E>
        </np>
        <np>
          <A>TEST</A>
          <B>ES</B>
          <C>RUN</C>
          <D>202303</D>
          <E>202303</E>
          <F>C</F>
          <G>3200</G>
          <H>32</H>
        </np>
      </Port>
    </folder>
    <folder>
      <group>900</group>
      <ID>ABC</ID>
      <Type>D</Type>
      <Customer>1</Customer>
      <currency>USD</currency>
      <Port>
        <ec>X</ec>
        <np>
          <A>CAT</A>
          <B>ES</B>
          <C>GO</C>
          <D>202303</D>
          <E>-500</E>
        </np>
      </Port>
    </folder>
  </Time>
</File>

这是我当前的工作解决方案。这个解决方案在小型XML文件上运行良好，但对于具有这种XML格式的大型文件来说，它运行得太慢，需要数小时。

URL <- 'H:/testSO.xml'
doc <- read_xml(URL)
df <- xml_find_all(doc, ".//np") %>%
  map_df( function(x) {
    set_names( c(  
      xml_find_all( x, "./ancestor::folder/group") %>% xml_text(),
      xml_find_all( x, "./ancestor::folder/ID") %>% xml_text(),
      xml_find_all( x, ".//A") %>% xml_text(),
      xml_find_all( x, ".//B") %>% xml_text(),
      xml_find_all( x, ".//C") %>% xml_text(),
      xml_find_all( x, ".//D") %>% xml_text(),
      xml_find_all( x, ".//E") %>% xml_text()), 
      c( "group","id", "A", "B", "C","D","E") ) %>%
      as.list() %>%
      flatten_df() 
  }) %>%
  type_convert() 
head(df)

谢谢！

英文:

I have a XML file that has nesting that is different from other examples I have seen. Here's the format it takes. I am interested in the data in the np nodes, but I also want the group and ID info in each row.

I have a working solution below, but the actual file is very large with thousands of <folder> nodes. The solution takes hours and hours to run on the large file, although it works on the sample code below.

My question is - is there a faster way to get the dataframe I want?

&lt;File&gt;
  &lt;Time&gt;
    &lt;date&gt;20220301&lt;/date&gt;
    &lt;Name&gt;1&lt;/Name&gt;
&lt;folder&gt;
      &lt;group&gt;800&lt;/group&gt;
      &lt;ID&gt;ESK&lt;/ID&gt;
      &lt;Type&gt;S&lt;/Type&gt;
      &lt;Customer&gt;1&lt;/Customer&gt;
      &lt;currency&gt;USD&lt;/currency&gt;
      &lt;Port&gt;
        &lt;ec&gt;X&lt;/ec&gt;
        &lt;np&gt;
          &lt;A&gt;FIRST&lt;/A&gt;
          &lt;B&gt;ES&lt;/B&gt;
          &lt;C&gt;GOR&lt;/C&gt;
          &lt;D&gt;2021&lt;/D&gt;
          &lt;E&gt;-1000&lt;/E&gt;
        &lt;/np&gt;
        &lt;np&gt;
          &lt;A&gt;TEST&lt;/A&gt;
          &lt;B&gt;ES&lt;/B&gt;
          &lt;C&gt;RUN&lt;/C&gt;
          &lt;D&gt;202303&lt;/D&gt;
          &lt;E&gt;202303&lt;/E&gt;
          &lt;F&gt;C&lt;/F&gt;
          &lt;G&gt;3200&lt;/G&gt;
          &lt;H&gt;32&lt;/H&gt;
        &lt;/np&gt;
      &lt;/Port&gt;
&lt;/folder&gt;
&lt;folder&gt;
      &lt;group&gt;900&lt;/group&gt;
      &lt;ID&gt;ABC&lt;/ID&gt;
      &lt;Type&gt;D&lt;/Type&gt;
      &lt;Customer&gt;1&lt;/Customer&gt;
      &lt;currency&gt;USD&lt;/currency&gt;
      &lt;Port&gt;
        &lt;ec&gt;X&lt;/ec&gt;
        &lt;np&gt;
          &lt;A&gt;CAT&lt;/A&gt;
          &lt;B&gt;ES&lt;/B&gt;
          &lt;C&gt;GO&lt;/C&gt;
          &lt;D&gt;202303&lt;/D&gt;
          &lt;E&gt;-500&lt;/E&gt;
        &lt;/np&gt;
       &lt;/Port&gt;
&lt;/folder&gt;
&lt;/Time&gt;
&lt;/File&gt;

Here is my current working solution. This works on small xml files, but for large files with that xml format it is too slow and takes hours to run.


URL &lt;- &#39;H:/testSO.xml&#39;
doc &lt;- read_xml(URL)
df &lt;-
  xml_find_all(doc, &quot;.//np&quot;) %&gt;%
  map_df( function(x) {
    set_names( c(  
      xml_find_all( x, &quot;./ancestor::folder/group&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;./ancestor::folder/ID&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//A&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//B&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//C&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//D&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//E&quot;) %&gt;% xml_text()), 
      #set the column names
      c( &quot;group&quot;,&quot;id&quot;, &quot;A&quot;, &quot;B&quot;, &quot;C&quot;,&quot;D&quot;,&quot;E&quot;) ) %&gt;% 
      as.list() %&gt;% #make list
      flatten_df() 
  }) %&gt;%
  type_convert() 
head(df)

Thanks!

答案1

得分: 1

您可以利用xml2库的向量化功能，避免使用map_df循环，这应该会显著提高速度。

此例程查找所有的np节点，将所请求的信息提取到一系列向量中，然后使用这些结果创建一个数据框。

library(xml2)
library(dplyr)
npnodes <- xml_find_all(doc, ".//np") 
group <- npnodes %>% xml_find_first( "./ancestor::folder/group") %>% xml_text()
id <- npnodes %>% xml_find_first("./ancestor::folder/ID") %>% xml_text()
A <- npnodes %>% xml_find_first(".//A") %>% xml_text()
B <- npnodes %>% xml_find_first(".//B") %>% xml_text()
C <- npnodes %>% xml_find_first(".//C") %>% xml_text()
D <- npnodes %>% xml_find_first(".//D") %>% xml_text()
E <- npnodes %>% xml_find_first(".//E") %>% xml_text() 
answer <- data.frame(group, id, A, B, C, D, E)
answer

  group  id     A  B   C      D      E
1   800 ESK FIRST ES GOR   2021  -1000
2   800 ESK  TEST ES RUN 202303 202303
3   900 ABC   CAT ES  GO 202303   -500

希望这个翻译对您有帮助。

英文:

You can take advantage of the vectorization of the xml2 library and avoid the map_df loop, this should provide a significant improvement in speed.

This routine finds all of the np nodes, extracts the requested information into a series of vectors, and then creates a data frame with the results.

library(xml2)
library(dplyr)
npnodes &lt;- xml_find_all(doc, &quot;.//np&quot;) 
   
group &lt;- npnodes %&gt;% xml_find_first( &quot;./ancestor::folder/group&quot;) %&gt;% xml_text()
id &lt;- npnodes %&gt;% xml_find_first(&quot;./ancestor::folder/ID&quot;) %&gt;% xml_text()
A &lt;- npnodes %&gt;% xml_find_first(&quot;.//A&quot;) %&gt;% xml_text()
B &lt;- npnodes %&gt;% xml_find_first(&quot;.//B&quot;) %&gt;% xml_text()
C &lt;- npnodes %&gt;% xml_find_first(&quot;.//C&quot;) %&gt;% xml_text()
D &lt;- npnodes %&gt;% xml_find_first(&quot;.//D&quot;) %&gt;% xml_text()
E &lt;- npnodes %&gt;% xml_find_first(&quot;.//E&quot;) %&gt;% xml_text() 
answer &lt;- data.frame(group,id, A, B, C, D, E)
answer
  group  id     A  B   C      D      E
1   800 ESK FIRST ES GOR   2021  -1000
2   800 ESK  TEST ES RUN 202303 202303
3   900 ABC   CAT ES  GO 202303   -500

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

更快的将大型嵌套XML转换为R数据框的方法

问题

答案1

移除由R中的geom_smooth引起的趋势线。

A*路径搜索器（Java）在使用1024维多维数组时效率低下。

无法在Java中解析DatatypeConverter类。

使用提供的数据集，在R中如何创建一个分组条形图可视化。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。