2023年5月24日 22:07:25go评论74阅读模式

英文:

How to extract text from very large XML files in Python without interrupting tags while parsing incrementally?

问题

在尝试了第一个答案后，我仍然遇到了在中断标签处破碎的文本（就像以前一样）。供参考，以下是代码：

    for event, element in ET.iterparse(path):
        if element.tag == "idsText":
            # move sentences from stack
            if sents:
                cat_texts.extend(sents)
                visited[current_doc] = visited.get(current_doc, 0) + 1
                # another function call
                annotate(cat_texts, filename, current_doc, visited[current_doc])
                # reset cat_texts
                cat_texts = []
            # set new current document's name
            current_doc = element.get("n")
            sents = []
        # new sentence starts
        elif element.tag == "s" and type(element.text) == str:
            if element.text.strip():
                sentence = ' '.join(element.itertext())
                new_sent.extend(nltk.word_tokenize(sentence, language='german'))
                sents.append(new_sent)
                new_sent = []

        element.clear()

我有一些非常大的XML文件（> 6 GB），想要从中提取文本。问题是文本被其他标签中断。这是一个虚拟文件，以帮助你了解：

<doc>
    <s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>

我想要获得这个：

Here you can see a reference in the text.

我使用这段代码（尽管我在这里省略了具体内容）：

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(path):
    if element.tag == "doc":
        # 在此处执行某些操作

    elif element.tag == "s" and type(element.text) == str:
        if element.text.strip():
            # 再次在此处执行某些操作

    element.clear()

这段代码应用于虚拟文件将产生这个：

Here you can see a

我知道itertext()会产生我想要的输出：

import xml.etree.ElementTree as ET
myxml = '<doc><s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s></doc>'
tree = ET.fromstring(myxml)
print(''.join(tree.itertext()))

输出：

Here you can see a reference in the text.

但我想不出一种将其与iterparse()（或任何其他增量解析方法）结合使用的方法。因为由于其大小，我无法将XML一次性解析成树。并且逐步进行意味着在解析元素标签s时，元素标签s之后的标签（在这种情况下是<ref>）尚未解析。

有没有一种方法可以在增量解析时获取元素内的所有文本并去掉标签？

非常感谢！

英文:

EDIT: After I tried the first answer, I still get text broken at interrupting tags (just as before). For reference, here is the code:

    for event, element in ET.iterparse(path):
        if element.tag == &quot;idsText&quot;:
            # move sentences from stack
            if sents:
                cat_texts.extend(sents)
                visited[current_doc] = visited.get(current_doc, 0) + 1
                # another function call
                annotate(cat_texts, filename, current_doc, visited[current_doc])
                # reset cat_texts
                cat_texts = []
            # set new current document&#39;s name
            current_doc = element.get(&quot;n&quot;)
            sents = []
        # new sentence starts
        elif element.tag == &quot;s&quot; and type(element.text) == str:
            if element.text.strip():
                sentence = &#39; &#39;.join(element.itertext())
                new_sent.extend(nltk.word_tokenize(sentence, language=&#39;german&#39;))
                sents.append(new_sent)
                new_sent = []

        element.clear()

I have some very large xml files (> 6 GB) and want to extract text from them. The problem is that the text is interrupted by other tags. Here is a dummy file to get you an idea:

&lt;doc&gt;
    &lt;s&gt; Here you can see a &lt;ref target=&quot;SOME_URL&quot; targOrder=&quot;u&quot;&gt;reference&lt;/ref&gt; in the text. &lt;/s&gt;
&lt;/doc&gt;

I want to get this:

Here you can see a reference in the text.

I use this code (though I leave out the specifics here):

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(path):
    if element.tag == &quot;doc&quot;:
        # do something here            
    
    elif element.tag == &quot;s&quot; and type(element.text) == str:
        if element.text.strip():
            # again do something here

    element.clear()

This code applied to the dummy file would produce this:

Here you can see a

I know that itertext() would produce the output I want:

import xml.etree.ElementTree as ET
myxml = &#39;&lt;doc&gt;&lt;s&gt; Here you can see a &lt;ref target=&quot;SOME_URL&quot; targOrder=&quot;u&quot;&gt;reference&lt;/ref&gt; in the text. &lt;/s&gt;&lt;/doc&gt;&#39;
tree = ET.fromstring(myxml)
print(&#39;&#39;.join(tree.itertext()))

Output:

Here you can see a reference in the text.

But I can't think of a way to combine this with iterparse() (or any other incremental parsing method). Because I can't parse the xml into a tree all at once due to its size. And doing it incrementally means that itertext() won't work because the following tag (<ref in this case) isn't parsed yet when the element with the tag <s> is parsed.

Is there a way to get all the text inside an element and stripping the tags when parsing incrementally?

Thank you very much!

答案1

得分: 1

你可以使用itertext方法递归迭代元素中包含的所有文本内容。如果我们像这样重写你的代码：

import xml.etree.ElementTree as ET
for event, element in ET.iterparse('data.xml'):
    if element.tag == 's':
        print(' '.join(element.itertext()))

然后给定你的示例输入，我们会得到以下输出：

这里你可以看到文本中的一个引用。

英文:

You can use the itertext method to recursively iterate over all the text content contained in an element. If we rewrite your code like this:

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(&#39;data.xml&#39;):
        if element.tag == &#39;s&#39;:
            print(&#39; &#39;.join(element.itertext()))

Then given your sample input we get the following output:

 Here you can see a  reference  in the text.

答案2

得分: 0

以下是翻译好的内容：

如果您想要查看标签的内容：

```python
import xml.etree.ElementTree as ET
import io
from html.parser import HTMLParser

xml = """<doc>
    <s> 这里您可以在文本中看到一个<ref target="SOME_URL" targOrder="u">引用</ref>。 </s>
</doc>"""

infile = io.StringIO(xml)

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if len(attrs) != 0:
            print("遇到了开始标签:", tag, attrs)
        else:
            print("遇到了开始标签:", tag)
            
    def handle_endtag(self, tag):
        print("遇到了结束标签:", tag)

    def handle_data(self, data):
        print("遇到了一些数据:", data)

parser = MyHTMLParser()

for event, element in ET.iterparse(infile, events=("end",)):
    if event == "end" and element.tag == 's':
        print(ET.tostring(element).decode("utf-8"))
        print(parser.feed(ET.tostring(element).decode("utf-8")))

输出结果：

<s> 这里您可以在文本中看到一个<ref target="SOME_URL" targOrder="u">引用</ref>。 </s>

遇到了开始标签: s
遇到了一些数据:  这里您可以在文本中看到一个
遇到了开始标签: ref [('target', 'SOME_URL'), ('targorder', 'u')]
遇到了一些数据: 引用
遇到了结束标签: ref
遇到了一些数据:  in the text. 
遇到了结束标签: s
遇到了一些数据: None


希望这有所帮助！如果您有任何其他翻译需求，请随时告诉我。

<details>
<summary>英文:</summary>

If you would like to see also the tag’s:

import xml.etree.ElementTree as ET
import io
from html.parser import HTMLParser

xml="""<doc>
<s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>"""

infile = io.StringIO(xml)

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if len(attrs) != 0:
print("Encountered a start tag:", tag, attrs)
else:
print("Encountered a start tag:", tag)

def handle_endtag(self, tag):
    print(&quot;Encountered an end tag :&quot;, tag)

def handle_data(self, data):
    print(&quot;Encountered some data  :&quot;, data)

parser = MyHTMLParser()

for event, element in ET.iterparse(infile, events=("end",)):
if event == "end" and element.tag == 's':
print(ET.tostring(element).decode("utf-8"))
print(parser.feed(ET.tostring(element).decode("utf-8")))

Output:

<s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>

Encountered a start tag: s
Encountered some data : Here you can see a
Encountered a start tag: ref [('target', 'SOME_URL'), ('targorder', 'u')]
Encountered some data : reference
Encountered an end tag : ref
Encountered some data : in the text.
Encountered an end tag : s
Encountered some data : None


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to extract text from very large XML files in Python without interrupting tags while parsing incrementally?

问题

答案1

答案2

“找出三角形中依赖整数的未知元素 ‘c3’ 的值”

MSSQL pyodbc 插入失败，因为计算机表示范围不足（8字节）。

customtkinter 在 VS Code 中未被识别。

如何使用“with”进行Python 3的读写操作？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论