更快的将大型嵌套XML转换为R数据框的方法

huangapple go评论80阅读模式
英文:

Faster way to put large nested xml into R dataframe

问题

我有一个XML文件,它的嵌套结构与我见过的其他示例不同。以下是它的格式。我对np节点中的数据感兴趣,但我也想在每一行中获取组(group)和ID信息。

我有下面的一个工作解决方案,但实际文件非常庞大,包含数千个节点。这个解决方案在下面的示例代码上运行良好,但在大文件上运行需要数小时。

我的问题是 - 有没有更快的方法来获取我想要的数据框(dataframe)?

<File>
  <Time>
    <date>20220301</date>
    <Name>1</Name>
    <folder>
      <group>800</group>
      <ID>ESK</ID>
      <Type>S</Type>
      <Customer>1</Customer>
      <currency>USD</currency>
      <Port>
        <ec>X</ec>
        <np>
          <A>FIRST</A>
          <B>ES</B>
          <C>GOR</C>
          <D>2021</D>
          <E>-1000</E>
        </np>
        <np>
          <A>TEST</A>
          <B>ES</B>
          <C>RUN</C>
          <D>202303</D>
          <E>202303</E>
          <F>C</F>
          <G>3200</G>
          <H>32</H>
        </np>
      </Port>
    </folder>
    <folder>
      <group>900</group>
      <ID>ABC</ID>
      <Type>D</Type>
      <Customer>1</Customer>
      <currency>USD</currency>
      <Port>
        <ec>X</ec>
        <np>
          <A>CAT</A>
          <B>ES</B>
          <C>GO</C>
          <D>202303</D>
          <E>-500</E>
        </np>
      </Port>
    </folder>
  </Time>
</File>

这是我当前的工作解决方案。这个解决方案在小型XML文件上运行良好,但对于具有这种XML格式的大型文件来说,它运行得太慢,需要数小时。

URL <- 'H:/testSO.xml'
doc <- read_xml(URL)

df <- xml_find_all(doc, ".//np") %>%
  map_df( function(x) {
    set_names( c(  
      xml_find_all( x, "./ancestor::folder/group") %>% xml_text(),
      xml_find_all( x, "./ancestor::folder/ID") %>% xml_text(),
      xml_find_all( x, ".//A") %>% xml_text(),
      xml_find_all( x, ".//B") %>% xml_text(),
      xml_find_all( x, ".//C") %>% xml_text(),
      xml_find_all( x, ".//D") %>% xml_text(),
      xml_find_all( x, ".//E") %>% xml_text()), 
      c( "group","id", "A", "B", "C","D","E") ) %>%
      as.list() %>%
      flatten_df() 
  }) %>%
  type_convert() 

head(df)

谢谢!

英文:

I have a XML file that has nesting that is different from other examples I have seen. Here's the format it takes. I am interested in the data in the np nodes, but I also want the group and ID info in each row.

I have a working solution below, but the actual file is very large with thousands of <folder> nodes. The solution takes hours and hours to run on the large file, although it works on the sample code below.

My question is - is there a faster way to get the dataframe I want?

&lt;File&gt;
  &lt;Time&gt;
    &lt;date&gt;20220301&lt;/date&gt;
    &lt;Name&gt;1&lt;/Name&gt;
&lt;folder&gt;
      &lt;group&gt;800&lt;/group&gt;
      &lt;ID&gt;ESK&lt;/ID&gt;
      &lt;Type&gt;S&lt;/Type&gt;
      &lt;Customer&gt;1&lt;/Customer&gt;
      &lt;currency&gt;USD&lt;/currency&gt;
      &lt;Port&gt;
        &lt;ec&gt;X&lt;/ec&gt;
        &lt;np&gt;
          &lt;A&gt;FIRST&lt;/A&gt;
          &lt;B&gt;ES&lt;/B&gt;
          &lt;C&gt;GOR&lt;/C&gt;
          &lt;D&gt;2021&lt;/D&gt;
          &lt;E&gt;-1000&lt;/E&gt;
        &lt;/np&gt;
        &lt;np&gt;
          &lt;A&gt;TEST&lt;/A&gt;
          &lt;B&gt;ES&lt;/B&gt;
          &lt;C&gt;RUN&lt;/C&gt;
          &lt;D&gt;202303&lt;/D&gt;
          &lt;E&gt;202303&lt;/E&gt;
          &lt;F&gt;C&lt;/F&gt;
          &lt;G&gt;3200&lt;/G&gt;
          &lt;H&gt;32&lt;/H&gt;
        &lt;/np&gt;
      &lt;/Port&gt;
&lt;/folder&gt;
&lt;folder&gt;
      &lt;group&gt;900&lt;/group&gt;
      &lt;ID&gt;ABC&lt;/ID&gt;
      &lt;Type&gt;D&lt;/Type&gt;
      &lt;Customer&gt;1&lt;/Customer&gt;
      &lt;currency&gt;USD&lt;/currency&gt;
      &lt;Port&gt;
        &lt;ec&gt;X&lt;/ec&gt;
        &lt;np&gt;
          &lt;A&gt;CAT&lt;/A&gt;
          &lt;B&gt;ES&lt;/B&gt;
          &lt;C&gt;GO&lt;/C&gt;
          &lt;D&gt;202303&lt;/D&gt;
          &lt;E&gt;-500&lt;/E&gt;
        &lt;/np&gt;
       &lt;/Port&gt;
&lt;/folder&gt;
&lt;/Time&gt;
&lt;/File&gt;

Here is my current working solution. This works on small xml files, but for large files with that xml format it is too slow and takes hours to run.


URL &lt;- &#39;H:/testSO.xml&#39;
doc &lt;- read_xml(URL)

df &lt;-
  xml_find_all(doc, &quot;.//np&quot;) %&gt;%
  map_df( function(x) {
    set_names( c(  
      xml_find_all( x, &quot;./ancestor::folder/group&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;./ancestor::folder/ID&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//A&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//B&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//C&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//D&quot;) %&gt;% xml_text(),
      xml_find_all( x, &quot;.//E&quot;) %&gt;% xml_text()), 
      #set the column names
      c( &quot;group&quot;,&quot;id&quot;, &quot;A&quot;, &quot;B&quot;, &quot;C&quot;,&quot;D&quot;,&quot;E&quot;) ) %&gt;% 
      as.list() %&gt;% #make list
      flatten_df() 
  }) %&gt;%
  type_convert() 

head(df)

Thanks!

答案1

得分: 1

您可以利用xml2库的向量化功能,避免使用map_df循环,这应该会显著提高速度。

此例程查找所有的np节点,将所请求的信息提取到一系列向量中,然后使用这些结果创建一个数据框。

library(xml2)
library(dplyr)
npnodes <- xml_find_all(doc, ".//np") 

group <- npnodes %>% xml_find_first( "./ancestor::folder/group") %>% xml_text()
id <- npnodes %>% xml_find_first("./ancestor::folder/ID") %>% xml_text()
A <- npnodes %>% xml_find_first(".//A") %>% xml_text()
B <- npnodes %>% xml_find_first(".//B") %>% xml_text()
C <- npnodes %>% xml_find_first(".//C") %>% xml_text()
D <- npnodes %>% xml_find_first(".//D") %>% xml_text()
E <- npnodes %>% xml_find_first(".//E") %>% xml_text() 

answer <- data.frame(group, id, A, B, C, D, E)
answer
  group  id     A  B   C      D      E
1   800 ESK FIRST ES GOR   2021  -1000
2   800 ESK  TEST ES RUN 202303 202303
3   900 ABC   CAT ES  GO 202303   -500

希望这个翻译对您有帮助。

英文:

You can take advantage of the vectorization of the xml2 library and avoid the map_df loop, this should provide a significant improvement in speed.

This routine finds all of the np nodes, extracts the requested information into a series of vectors, and then creates a data frame with the results.

library(xml2)
library(dplyr)
npnodes &lt;- xml_find_all(doc, &quot;.//np&quot;) 
   
group &lt;- npnodes %&gt;% xml_find_first( &quot;./ancestor::folder/group&quot;) %&gt;% xml_text()
id &lt;- npnodes %&gt;% xml_find_first(&quot;./ancestor::folder/ID&quot;) %&gt;% xml_text()
A &lt;- npnodes %&gt;% xml_find_first(&quot;.//A&quot;) %&gt;% xml_text()
B &lt;- npnodes %&gt;% xml_find_first(&quot;.//B&quot;) %&gt;% xml_text()
C &lt;- npnodes %&gt;% xml_find_first(&quot;.//C&quot;) %&gt;% xml_text()
D &lt;- npnodes %&gt;% xml_find_first(&quot;.//D&quot;) %&gt;% xml_text()
E &lt;- npnodes %&gt;% xml_find_first(&quot;.//E&quot;) %&gt;% xml_text() 

answer &lt;- data.frame(group,id, A, B, C, D, E)
answer

  group  id     A  B   C      D      E
1   800 ESK FIRST ES GOR   2021  -1000
2   800 ESK  TEST ES RUN 202303 202303
3   900 ABC   CAT ES  GO 202303   -500

huangapple
  • 本文由 发表于 2023年2月23日 23:20:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75546815.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定