How to extract text from very large XML files in Python without interrupting tags while parsing incrementally?

huangapple go评论66阅读模式
英文:

How to extract text from very large XML files in Python without interrupting tags while parsing incrementally?

问题

在尝试了第一个答案后,我仍然遇到了在中断标签处破碎的文本(就像以前一样)。供参考,以下是代码:

    for event, element in ET.iterparse(path):
        if element.tag == "idsText":
            # move sentences from stack
            if sents:
                cat_texts.extend(sents)
                visited[current_doc] = visited.get(current_doc, 0) + 1
                # another function call
                annotate(cat_texts, filename, current_doc, visited[current_doc])
                # reset cat_texts
                cat_texts = []
            # set new current document's name
            current_doc = element.get("n")
            sents = []
        # new sentence starts
        elif element.tag == "s" and type(element.text) == str:
            if element.text.strip():
                sentence = ' '.join(element.itertext())
                new_sent.extend(nltk.word_tokenize(sentence, language='german'))
                sents.append(new_sent)
                new_sent = []

        element.clear()

我有一些非常大的XML文件(> 6 GB),想要从中提取文本。问题是文本被其他标签中断。这是一个虚拟文件,以帮助你了解:

<doc>
    <s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>

我想要获得这个:

Here you can see a reference in the text.

我使用这段代码(尽管我在这里省略了具体内容):

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(path):
    if element.tag == "doc":
        # 在此处执行某些操作

    elif element.tag == "s" and type(element.text) == str:
        if element.text.strip():
            # 再次在此处执行某些操作

    element.clear()

这段代码应用于虚拟文件将产生这个:

Here you can see a

我知道itertext()会产生我想要的输出:

import xml.etree.ElementTree as ET
myxml = '<doc><s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s></doc>'
tree = ET.fromstring(myxml)
print(''.join(tree.itertext()))

输出:

Here you can see a reference in the text.

但我想不出一种将其与iterparse()(或任何其他增量解析方法)结合使用的方法。因为由于其大小,我无法将XML一次性解析成树。并且逐步进行意味着在解析元素标签s时,元素标签s之后的标签(在这种情况下是<ref>)尚未解析。

有没有一种方法可以在增量解析时获取元素内的所有文本并去掉标签?

非常感谢!

英文:

EDIT: After I tried the first answer, I still get text broken at interrupting tags (just as before). For reference, here is the code:

    for event, element in ET.iterparse(path):
        if element.tag == &quot;idsText&quot;:
            # move sentences from stack
            if sents:
                cat_texts.extend(sents)
                visited[current_doc] = visited.get(current_doc, 0) + 1
                # another function call
                annotate(cat_texts, filename, current_doc, visited[current_doc])
                # reset cat_texts
                cat_texts = []
            # set new current document&#39;s name
            current_doc = element.get(&quot;n&quot;)
            sents = []
        # new sentence starts
        elif element.tag == &quot;s&quot; and type(element.text) == str:
            if element.text.strip():
                sentence = &#39; &#39;.join(element.itertext())
                new_sent.extend(nltk.word_tokenize(sentence, language=&#39;german&#39;))
                sents.append(new_sent)
                new_sent = []

        element.clear()

I have some very large xml files (&gt; 6 GB) and want to extract text from them. The problem is that the text is interrupted by other tags. Here is a dummy file to get you an idea:

&lt;doc&gt;
    &lt;s&gt; Here you can see a &lt;ref target=&quot;SOME_URL&quot; targOrder=&quot;u&quot;&gt;reference&lt;/ref&gt; in the text. &lt;/s&gt;
&lt;/doc&gt;

I want to get this:

Here you can see a reference in the text.

I use this code (though I leave out the specifics here):

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(path):
    if element.tag == &quot;doc&quot;:
        # do something here            
    
    elif element.tag == &quot;s&quot; and type(element.text) == str:
        if element.text.strip():
            # again do something here

    element.clear()

This code applied to the dummy file would produce this:

Here you can see a

I know that itertext() would produce the output I want:

import xml.etree.ElementTree as ET
myxml = &#39;&lt;doc&gt;&lt;s&gt; Here you can see a &lt;ref target=&quot;SOME_URL&quot; targOrder=&quot;u&quot;&gt;reference&lt;/ref&gt; in the text. &lt;/s&gt;&lt;/doc&gt;&#39;
tree = ET.fromstring(myxml)
print(&#39;&#39;.join(tree.itertext()))

Output:

Here you can see a reference in the text.

But I can't think of a way to combine this with iterparse() (or any other incremental parsing method). Because I can't parse the xml into a tree all at once due to its size. And doing it incrementally means that itertext() won't work because the following tag (&lt;ref in this case) isn't parsed yet when the element with the tag <s> is parsed.

Is there a way to get all the text inside an element and stripping the tags when parsing incrementally?

Thank you very much!

答案1

得分: 1

你可以使用itertext方法递归迭代元素中包含的所有文本内容。如果我们像这样重写你的代码:

import xml.etree.ElementTree as ET
for event, element in ET.iterparse('data.xml'):
    if element.tag == 's':
        print(' '.join(element.itertext()))

然后给定你的示例输入,我们会得到以下输出:

这里你可以看到文本中的一个引用。
英文:

You can use the itertext method to recursively iterate over all the text content contained in an element. If we rewrite your code like this:

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(&#39;data.xml&#39;):
        if element.tag == &#39;s&#39;:
            print(&#39; &#39;.join(element.itertext()))

Then given your sample input we get the following output:

 Here you can see a  reference  in the text. 

答案2

得分: 0

以下是翻译好的内容:

如果您想要查看标签的内容

```python
import xml.etree.ElementTree as ET
import io
from html.parser import HTMLParser

xml = """<doc>
    <s> 这里您可以在文本中看到一个<ref target="SOME_URL" targOrder="u">引用</ref>。 </s>
</doc>"""

infile = io.StringIO(xml)

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if len(attrs) != 0:
            print("遇到了开始标签:", tag, attrs)
        else:
            print("遇到了开始标签:", tag)
            
    def handle_endtag(self, tag):
        print("遇到了结束标签:", tag)

    def handle_data(self, data):
        print("遇到了一些数据:", data)

parser = MyHTMLParser()

for event, element in ET.iterparse(infile, events=("end",)):
    if event == "end" and element.tag == 's':
        print(ET.tostring(element).decode("utf-8"))
        print(parser.feed(ET.tostring(element).decode("utf-8")))

输出结果:

<s> 这里您可以在文本中看到一个<ref target="SOME_URL" targOrder="u">引用</ref> </s>

遇到了开始标签: s
遇到了一些数据:  这里您可以在文本中看到一个
遇到了开始标签: ref [('target', 'SOME_URL'), ('targorder', 'u')]
遇到了一些数据: 引用
遇到了结束标签: ref
遇到了一些数据:  in the text. 
遇到了结束标签: s
遇到了一些数据: None

希望这有所帮助!如果您有任何其他翻译需求,请随时告诉我。

<details>
<summary>英文:</summary>

If you would like to see also the tag’s:

import xml.etree.ElementTree as ET
import io
from html.parser import HTMLParser

xml="""<doc>
<s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>"""

infile = io.StringIO(xml)

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if len(attrs) != 0:
print("Encountered a start tag:", tag, attrs)
else:
print("Encountered a start tag:", tag)

def handle_endtag(self, tag):
    print(&quot;Encountered an end tag :&quot;, tag)

def handle_data(self, data):
    print(&quot;Encountered some data  :&quot;, data)

parser = MyHTMLParser()

for event, element in ET.iterparse(infile, events=("end",)):
if event == "end" and element.tag == 's':
print(ET.tostring(element).decode("utf-8"))
print(parser.feed(ET.tostring(element).decode("utf-8")))

Output:

<s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>

Encountered a start tag: s
Encountered some data : Here you can see a
Encountered a start tag: ref [('target', 'SOME_URL'), ('targorder', 'u')]
Encountered some data : reference
Encountered an end tag : ref
Encountered some data : in the text.
Encountered an end tag : s
Encountered some data : None


</details>



huangapple
  • 本文由 发表于 2023年5月24日 22:07:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76324455.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定