英文:
Run Java XML parser with number of Erlang processes
问题
我在一个并发和分布式编程课程中有一个项目。
在这门课程中,我们使用Erlang。
我需要从一个XML文件中使用一些数据库,该文件已经有一个用Java编写的解析器(这是XML文件和解析器的链接:https://dblp.org/faq/1474681.html)。
XML文件大小为2.5GB,所以我理解第一步是使用我将在Erlang中创建的一些进程,这些进程将解析XML,每个进程将解析XML的一部分。
问题是,这是我第一次做这样的事情(结合Erlang和Java,并解析一个非常大的XML文件),所以我不确定如何解决这个问题 - 在开始解析之前将XML分成块?以某种方式为每个解析XML的进程设置开始和结束?
只是为了澄清一下 - 这门课程是关于Erlang和在Erlang中使用进程,所以我必须使用它(因为我确信有Java多线程的解决方案)。
我将非常感谢任何想法或帮助!
谢谢!
英文:
I have a project in a concurrent and distributed programming course.
In this course we use Erlang.
I need to use some database from an XML file, that already has a parser written in java (this is the link for the XML and the parser: https://dblp.org/faq/1474681.html).
The XML file is 2.5GB, so I understand that the first step is to use a number of processes that I will create in erlang that will parse the XML and each process will parse a chunk of the XML.
The thing is that this is the first time I'm doing something like that (combine erlang and java, and parse a really big XML file), So I'm not sure how to approach this problem - divide the XML to chunks before I start to parse him? Somehow set start and end for each process that parses the XML?
Just to clarify - the course is about erlang and using processes in erlang, so I must use it (because I'm sure that there are java multi-threading solutions).
I will really appreciate any ideas or help!
Thanks!
答案1
得分: 1
你可以在Erlang中完成这个任务,而不使用Java。在处理之前,你不需要完全读取文件。你应该使用支持XML流API的XML解析器。我建议使用fast_xml,它非常快速(它使用C函数来解析XML)。
在初始化流解析器状态后,在一个循环(递归函数)中,你应该逐块读取文件(例如每个块1024字节),然后将每个块传递给解析器。如果解析器找到新的XML元素,它会以Erlang消息的形式将它们发送到你的回调进程。在你的回调进程中,你可以生成更多的进程来处理每个XML元素。
英文:
You can do it in Erlang without using Java. You do not need to read file completely before processing. You should use an XML parser which supports XML streaming API. I recommend to use fast_xml which is too fast (it uses C functions to parse XML).
After initializing stream parser state, in a loop (recursive function) you should read file chunk by chunk (for example 1024 byte each chunk) and give each chunk to parser. If parser finds new XML elements, it will send them to your callback process in form of erlang messages. In your callback process you can spawn more processes to work on each XML element.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论