英文:
Faster way to put large nested xml into R dataframe
问题
我有一个XML文件,它的嵌套结构与我见过的其他示例不同。以下是它的格式。我对np节点中的数据感兴趣,但我也想在每一行中获取组(group)和ID信息。
我有下面的一个工作解决方案,但实际文件非常庞大,包含数千个
我的问题是 - 有没有更快的方法来获取我想要的数据框(dataframe)?
<File>
<Time>
<date>20220301</date>
<Name>1</Name>
<folder>
<group>800</group>
<ID>ESK</ID>
<Type>S</Type>
<Customer>1</Customer>
<currency>USD</currency>
<Port>
<ec>X</ec>
<np>
<A>FIRST</A>
<B>ES</B>
<C>GOR</C>
<D>2021</D>
<E>-1000</E>
</np>
<np>
<A>TEST</A>
<B>ES</B>
<C>RUN</C>
<D>202303</D>
<E>202303</E>
<F>C</F>
<G>3200</G>
<H>32</H>
</np>
</Port>
</folder>
<folder>
<group>900</group>
<ID>ABC</ID>
<Type>D</Type>
<Customer>1</Customer>
<currency>USD</currency>
<Port>
<ec>X</ec>
<np>
<A>CAT</A>
<B>ES</B>
<C>GO</C>
<D>202303</D>
<E>-500</E>
</np>
</Port>
</folder>
</Time>
</File>
这是我当前的工作解决方案。这个解决方案在小型XML文件上运行良好,但对于具有这种XML格式的大型文件来说,它运行得太慢,需要数小时。
URL <- 'H:/testSO.xml'
doc <- read_xml(URL)
df <- xml_find_all(doc, ".//np") %>%
map_df( function(x) {
set_names( c(
xml_find_all( x, "./ancestor::folder/group") %>% xml_text(),
xml_find_all( x, "./ancestor::folder/ID") %>% xml_text(),
xml_find_all( x, ".//A") %>% xml_text(),
xml_find_all( x, ".//B") %>% xml_text(),
xml_find_all( x, ".//C") %>% xml_text(),
xml_find_all( x, ".//D") %>% xml_text(),
xml_find_all( x, ".//E") %>% xml_text()),
c( "group","id", "A", "B", "C","D","E") ) %>%
as.list() %>%
flatten_df()
}) %>%
type_convert()
head(df)
谢谢!
英文:
I have a XML file that has nesting that is different from other examples I have seen. Here's the format it takes. I am interested in the data in the np nodes, but I also want the group and ID info in each row.
I have a working solution below, but the actual file is very large with thousands of <folder> nodes. The solution takes hours and hours to run on the large file, although it works on the sample code below.
My question is - is there a faster way to get the dataframe I want?
<File>
<Time>
<date>20220301</date>
<Name>1</Name>
<folder>
<group>800</group>
<ID>ESK</ID>
<Type>S</Type>
<Customer>1</Customer>
<currency>USD</currency>
<Port>
<ec>X</ec>
<np>
<A>FIRST</A>
<B>ES</B>
<C>GOR</C>
<D>2021</D>
<E>-1000</E>
</np>
<np>
<A>TEST</A>
<B>ES</B>
<C>RUN</C>
<D>202303</D>
<E>202303</E>
<F>C</F>
<G>3200</G>
<H>32</H>
</np>
</Port>
</folder>
<folder>
<group>900</group>
<ID>ABC</ID>
<Type>D</Type>
<Customer>1</Customer>
<currency>USD</currency>
<Port>
<ec>X</ec>
<np>
<A>CAT</A>
<B>ES</B>
<C>GO</C>
<D>202303</D>
<E>-500</E>
</np>
</Port>
</folder>
</Time>
</File>
Here is my current working solution. This works on small xml files, but for large files with that xml format it is too slow and takes hours to run.
URL <- 'H:/testSO.xml'
doc <- read_xml(URL)
df <-
xml_find_all(doc, ".//np") %>%
map_df( function(x) {
set_names( c(
xml_find_all( x, "./ancestor::folder/group") %>% xml_text(),
xml_find_all( x, "./ancestor::folder/ID") %>% xml_text(),
xml_find_all( x, ".//A") %>% xml_text(),
xml_find_all( x, ".//B") %>% xml_text(),
xml_find_all( x, ".//C") %>% xml_text(),
xml_find_all( x, ".//D") %>% xml_text(),
xml_find_all( x, ".//E") %>% xml_text()),
#set the column names
c( "group","id", "A", "B", "C","D","E") ) %>%
as.list() %>% #make list
flatten_df()
}) %>%
type_convert()
head(df)
Thanks!
答案1
得分: 1
您可以利用xml2库的向量化功能,避免使用map_df
循环,这应该会显著提高速度。
此例程查找所有的np节点,将所请求的信息提取到一系列向量中,然后使用这些结果创建一个数据框。
library(xml2)
library(dplyr)
npnodes <- xml_find_all(doc, ".//np")
group <- npnodes %>% xml_find_first( "./ancestor::folder/group") %>% xml_text()
id <- npnodes %>% xml_find_first("./ancestor::folder/ID") %>% xml_text()
A <- npnodes %>% xml_find_first(".//A") %>% xml_text()
B <- npnodes %>% xml_find_first(".//B") %>% xml_text()
C <- npnodes %>% xml_find_first(".//C") %>% xml_text()
D <- npnodes %>% xml_find_first(".//D") %>% xml_text()
E <- npnodes %>% xml_find_first(".//E") %>% xml_text()
answer <- data.frame(group, id, A, B, C, D, E)
answer
group id A B C D E
1 800 ESK FIRST ES GOR 2021 -1000
2 800 ESK TEST ES RUN 202303 202303
3 900 ABC CAT ES GO 202303 -500
希望这个翻译对您有帮助。
英文:
You can take advantage of the vectorization of the xml2 library and avoid the map_df
loop, this should provide a significant improvement in speed.
This routine finds all of the np nodes, extracts the requested information into a series of vectors, and then creates a data frame with the results.
library(xml2)
library(dplyr)
npnodes <- xml_find_all(doc, ".//np")
group <- npnodes %>% xml_find_first( "./ancestor::folder/group") %>% xml_text()
id <- npnodes %>% xml_find_first("./ancestor::folder/ID") %>% xml_text()
A <- npnodes %>% xml_find_first(".//A") %>% xml_text()
B <- npnodes %>% xml_find_first(".//B") %>% xml_text()
C <- npnodes %>% xml_find_first(".//C") %>% xml_text()
D <- npnodes %>% xml_find_first(".//D") %>% xml_text()
E <- npnodes %>% xml_find_first(".//E") %>% xml_text()
answer <- data.frame(group,id, A, B, C, D, E)
answer
group id A B C D E
1 800 ESK FIRST ES GOR 2021 -1000
2 800 ESK TEST ES RUN 202303 202303
3 900 ABC CAT ES GO 202303 -500
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论