2020年8月6日 16:34:19go评论104阅读模式

英文:

Run Java XML parser with number of Erlang processes

问题

我在一个并发和分布式编程课程中有一个项目。

在这门课程中，我们使用Erlang。

我需要从一个XML文件中使用一些数据库，该文件已经有一个用Java编写的解析器（这是XML文件和解析器的链接：https://dblp.org/faq/1474681.html）。
XML文件大小为2.5GB，所以我理解第一步是使用我将在Erlang中创建的一些进程，这些进程将解析XML，每个进程将解析XML的一部分。

问题是，这是我第一次做这样的事情（结合Erlang和Java，并解析一个非常大的XML文件），所以我不确定如何解决这个问题 - 在开始解析之前将XML分成块？以某种方式为每个解析XML的进程设置开始和结束？

只是为了澄清一下 - 这门课程是关于Erlang和在Erlang中使用进程，所以我必须使用它（因为我确信有Java多线程的解决方案）。

我将非常感谢任何想法或帮助！
谢谢！

英文:

I have a project in a concurrent and distributed programming course.

In this course we use Erlang.

I need to use some database from an XML file, that already has a parser written in java (this is the link for the XML and the parser: https://dblp.org/faq/1474681.html).
The XML file is 2.5GB, so I understand that the first step is to use a number of processes that I will create in erlang that will parse the XML and each process will parse a chunk of the XML.

The thing is that this is the first time I'm doing something like that (combine erlang and java, and parse a really big XML file), So I'm not sure how to approach this problem - divide the XML to chunks before I start to parse him? Somehow set start and end for each process that parses the XML?

Just to clarify - the course is about erlang and using processes in erlang, so I must use it (because I'm sure that there are java multi-threading solutions).

I will really appreciate any ideas or help!
Thanks!

答案1

得分: 1

你可以在Erlang中完成这个任务，而不使用Java。在处理之前，你不需要完全读取文件。你应该使用支持XML流API的XML解析器。我建议使用fast_xml，它非常快速（它使用C函数来解析XML）。
在初始化流解析器状态后，在一个循环（递归函数）中，你应该逐块读取文件（例如每个块1024字节），然后将每个块传递给解析器。如果解析器找到新的XML元素，它会以Erlang消息的形式将它们发送到你的回调进程。在你的回调进程中，你可以生成更多的进程来处理每个XML元素。

英文:

You can do it in Erlang without using Java. You do not need to read file completely before processing. You should use an XML parser which supports XML streaming API. I recommend to use fast_xml which is too fast (it uses C functions to parse XML).
After initializing stream parser state, in a loop (recursive function) you should read file chunk by chunk (for example 1024 byte each chunk) and give each chunk to parser. If parser finds new XML elements, it will send them to your callback process in form of erlang messages. In your callback process you can spawn more processes to work on each XML element.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

运行Java XML解析器，使用Erlang进程数量。

问题

答案1

动态数组堆栈

扁平化二叉树为链表（Java）_为什么这个递归代码不起作用？

从SpringBoot消费SOAP服务 – @RequestBody获取null值

Spring Boot 冷启动在 AWS Lambda 上花费的时间太长，并且启动过程会执行两次。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。