使用root.iter在Python中解析XML文件时,不会列出文本。

huangapple go评论95阅读模式
英文:

Parsing xml file with Python using root.iter does not list text

问题

我正在尝试使用Python解析一个xml文件。我想要识别出位于指定xml标记之间的文本。

我运行的代码是

import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.iter('w'):
    print(w.text)

XML文件如下。它是一个结构相对松散的复杂文件(为了此查询的目的,我对其进行了简化),但显然有一个"w"标记,应该被代码捕获。

谢谢。

<?xml version="1.0" encoding="UTF-8"?>

<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://www.talkbank.org/ns/talkbank"
      xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
      Media="020012" Mediatypes="audio"
            DesignType="long"
            ActivityType="toyplay"
            GroupType="TD"
      PID="11312/c-00018213-1"
      Version="2.20.0"
      Lang="eng"
      Options="bullets"
      Corpus="xxxx"
      Date="xxxx-xx-xx"
      >
  <Participants>
    <participant
      id="MOT"
    name="Mother"
      role="Mother"
      language="eng"
      sex="female"
    />
  </Participants>
  <comment type="Date">15-APR-1999</comment>
  <u who="INV" uID="u0">
    <w untranscribed="untranscribed">www</w>
    <t type="p"></t>
    <media
      start="7.639"
      end="9.648"
      unit="s"
    />
    <a type="addressee">MOT</a>
  </u>
  <u who="MOT" uID="u1">
    <w untranscribed="untranscribed">www</w>
    <t type="p"></t>
    <media
      start="7.640"
      end="9.455"
      unit="s"
    />
    <a type="addressee">INV</a>
  </u>
  <u who="CHI" uID="u2">
    <w untranscribed="unintelligible">xxx</w>
    <w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos></mw><stem>choo_choos</stem></mor></w>
    <t type="p"><mor type="mor"><mt type="p"/></t>
    <postcode>I</postcode>
    <media
      start="10.987"
      end="12.973"
      unit="s"
    />
    <a type="comments">looking at pictures of trains</a>
  </u>

  </CHAT>

英文:

I am trying to use Python to parse an xml file. I would like to identify text which occurs between specified xml tags.

The code I am running is


import xml.etree.ElementTree as ET
tree = ET.parse(&#39;020012_doctored.xml&#39;)
root = tree.getroot()
for w in root.iter(&#39;w&#39;):
    print(w.text)

The xml file is as follows. It's a complex file with quite a loose structure, which combines elements of sequence and hierarchy (and I have simplified it for the purposes of this query), but there clearly is a "w" tag, which should be getting picked up by the code.

Thanks.

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;

&lt;CHAT xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;
      xmlns=&quot;http://www.talkbank.org/ns/talkbank&quot;
      xsi:schemaLocation=&quot;http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd&quot;
      Media=&quot;020012&quot; Mediatypes=&quot;audio&quot;
            DesignType=&quot;long&quot;
            ActivityType=&quot;toyplay&quot;
            GroupType=&quot;TD&quot;
      PID=&quot;11312/c-00018213-1&quot;
      Version=&quot;2.20.0&quot;
      Lang=&quot;eng&quot;
      Options=&quot;bullets&quot;
      Corpus=&quot;xxxx&quot;
      Date=&quot;xxxx-xx-xx&quot;
      &gt;
  &lt;Participants&gt;
    &lt;participant
      id=&quot;MOT&quot;
    name=&quot;Mother&quot;
      role=&quot;Mother&quot;
      language=&quot;eng&quot;
      sex=&quot;female&quot;
    /&gt;
  &lt;/Participants&gt;
  &lt;comment type=&quot;Date&quot;&gt;15-APR-1999&lt;/comment&gt;
  &lt;u who=&quot;INV&quot; uID=&quot;u0&quot;&gt;
    &lt;w untranscribed=&quot;untranscribed&quot;&gt;www&lt;/w&gt;
    &lt;t type=&quot;p&quot;&gt;&lt;/t&gt;
    &lt;media
      start=&quot;7.639&quot;
      end=&quot;9.648&quot;
      unit=&quot;s&quot;
    /&gt;
    &lt;a type=&quot;addressee&quot;&gt;MOT&lt;/a&gt;
  &lt;/u&gt;
  &lt;u who=&quot;MOT&quot; uID=&quot;u1&quot;&gt;
    &lt;w untranscribed=&quot;untranscribed&quot;&gt;www&lt;/w&gt;
    &lt;t type=&quot;p&quot;&gt;&lt;/t&gt;
    &lt;media
      start=&quot;7.640&quot;
      end=&quot;9.455&quot;
      unit=&quot;s&quot;
    /&gt;
    &lt;a type=&quot;addressee&quot;&gt;INV&lt;/a&gt;
  &lt;/u&gt;
  &lt;u who=&quot;CHI&quot; uID=&quot;u2&quot;&gt;
    &lt;w untranscribed=&quot;unintelligible&quot;&gt;xxx&lt;/w&gt;
    &lt;w formType=&quot;family-specific&quot;&gt;choo_choos&lt;mor type=&quot;mor&quot;&gt;&lt;mw&gt;&lt;pos&gt;&lt;c&gt;fam&lt;/c&gt;&lt;/pos&gt;&lt;stem&gt;choo_choos&lt;/stem&gt;&lt;/mw&gt;&lt;gra type=&quot;gra&quot; index=&quot;1&quot; head=&quot;0&quot; relation=&quot;INCROOT&quot;/&gt;&lt;/mor&gt;&lt;/w&gt;
    &lt;t type=&quot;p&quot;&gt;&lt;mor type=&quot;mor&quot;&gt;&lt;mt type=&quot;p&quot;/&gt;&lt;gra type=&quot;gra&quot; index=&quot;2&quot; head=&quot;1&quot; relation=&quot;PUNCT&quot;/&gt;&lt;/mor&gt;&lt;/t&gt;
    &lt;postcode&gt;I&lt;/postcode&gt;
    &lt;media
      start=&quot;10.987&quot;
      end=&quot;12.973&quot;
      unit=&quot;s&quot;
    /&gt;
    &lt;a type=&quot;comments&quot;&gt;looking at pictures of trains&lt;/a&gt;
  &lt;/u&gt;

  &lt;/CHAT&gt;

答案1

得分: 1

我认为你需要在命名空间前加上前缀:

for w in root.iter("{http://www.talkbank.org/ns/talkbank}w"):
    print(w.text)

你可能想查看这个问题以获取更多关于命名空间的类似问题:

question: https://stackoverflow.com/questions/14853243/parsing-xml-with-namespace-in-python-via-elementtree

英文:

I think you have to prepend the namespace:

for w in root.iter(&quot;{http://www.talkbank.org/ns/talkbank}w&quot;):
print(w.text)

You might want to checkout this question for more similar problem with namespaces.

答案2

得分: 1

你还可以定义命名空间以供进一步使用,并使用 iterfind

NS = {'ww': 'http://www.talkbank.org/ns/talkbank'}
for w in root.iterfind('.//ww:w', NS):
    print(w.text)

结果将是:

www
www
xxx
choo_choos
英文:

You can also define the namespace for further usage and use iterfind:

NS = { &#39;ww&#39; : &#39;http://www.talkbank.org/ns/talkbank&#39; }
for w in root.iterfind(&#39;.//ww:w&#39;,NS):
print(w.text)

Result would be

www
www
xxx
choo_choos

答案3

得分: 1

你的 XML 包含命名空间和嵌套标签,都在一个 <w> 标签内。我稍微修改了你的代码:

import xml.etree.ElementTree as ET

tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.findall('.//{*}w'):
    print("".join(w.itertext()))

输出:

www
www
xxx
choo_choosfamchoo_choos
英文:

Your xml has namespaces and nested tag into one <w> tag. I changed your code a little bit:

import xml.etree.ElementTree as ET
tree = ET.parse(&#39;020012_doctored.xml&#39;)
root = tree.getroot()
for w in root.findall(&#39;.//{*}w&#39;):
print(&quot;&quot;.join(w.itertext()))

Output:

www
www
xxx
choo_choosfamchoo_choos

huangapple
  • 本文由 发表于 2023年7月31日 23:29:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76805082.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定