使用root.iter在Python中解析XML文件时,不会列出文本。

huangapple go评论111阅读模式
英文:

Parsing xml file with Python using root.iter does not list text

问题

我正在尝试使用Python解析一个xml文件。我想要识别出位于指定xml标记之间的文本。

我运行的代码是

  1. import xml.etree.ElementTree as ET
  2. tree = ET.parse('020012_doctored.xml')
  3. root = tree.getroot()
  4. for w in root.iter('w'):
  5. print(w.text)

XML文件如下。它是一个结构相对松散的复杂文件(为了此查询的目的,我对其进行了简化),但显然有一个"w"标记,应该被代码捕获。

谢谢。

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  3. xmlns="http://www.talkbank.org/ns/talkbank"
  4. xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
  5. Media="020012" Mediatypes="audio"
  6. DesignType="long"
  7. ActivityType="toyplay"
  8. GroupType="TD"
  9. PID="11312/c-00018213-1"
  10. Version="2.20.0"
  11. Lang="eng"
  12. Options="bullets"
  13. Corpus="xxxx"
  14. Date="xxxx-xx-xx"
  15. >
  16. <Participants>
  17. <participant
  18. id="MOT"
  19. name="Mother"
  20. role="Mother"
  21. language="eng"
  22. sex="female"
  23. />
  24. </Participants>
  25. <comment type="Date">15-APR-1999</comment>
  26. <u who="INV" uID="u0">
  27. <w untranscribed="untranscribed">www</w>
  28. <t type="p"></t>
  29. <media
  30. start="7.639"
  31. end="9.648"
  32. unit="s"
  33. />
  34. <a type="addressee">MOT</a>
  35. </u>
  36. <u who="MOT" uID="u1">
  37. <w untranscribed="untranscribed">www</w>
  38. <t type="p"></t>
  39. <media
  40. start="7.640"
  41. end="9.455"
  42. unit="s"
  43. />
  44. <a type="addressee">INV</a>
  45. </u>
  46. <u who="CHI" uID="u2">
  47. <w untranscribed="unintelligible">xxx</w>
  48. <w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos></mw><stem>choo_choos</stem></mor></w>
  49. <t type="p"><mor type="mor"><mt type="p"/></t>
  50. <postcode>I</postcode>
  51. <media
  52. start="10.987"
  53. end="12.973"
  54. unit="s"
  55. />
  56. <a type="comments">looking at pictures of trains</a>
  57. </u>
  58. </CHAT>
英文:

I am trying to use Python to parse an xml file. I would like to identify text which occurs between specified xml tags.

The code I am running is

  1. import xml.etree.ElementTree as ET
  2. tree = ET.parse(&#39;020012_doctored.xml&#39;)
  3. root = tree.getroot()
  4. for w in root.iter(&#39;w&#39;):
  5. print(w.text)

The xml file is as follows. It's a complex file with quite a loose structure, which combines elements of sequence and hierarchy (and I have simplified it for the purposes of this query), but there clearly is a "w" tag, which should be getting picked up by the code.

Thanks.

  1. &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
  2. &lt;CHAT xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;
  3. xmlns=&quot;http://www.talkbank.org/ns/talkbank&quot;
  4. xsi:schemaLocation=&quot;http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd&quot;
  5. Media=&quot;020012&quot; Mediatypes=&quot;audio&quot;
  6. DesignType=&quot;long&quot;
  7. ActivityType=&quot;toyplay&quot;
  8. GroupType=&quot;TD&quot;
  9. PID=&quot;11312/c-00018213-1&quot;
  10. Version=&quot;2.20.0&quot;
  11. Lang=&quot;eng&quot;
  12. Options=&quot;bullets&quot;
  13. Corpus=&quot;xxxx&quot;
  14. Date=&quot;xxxx-xx-xx&quot;
  15. &gt;
  16. &lt;Participants&gt;
  17. &lt;participant
  18. id=&quot;MOT&quot;
  19. name=&quot;Mother&quot;
  20. role=&quot;Mother&quot;
  21. language=&quot;eng&quot;
  22. sex=&quot;female&quot;
  23. /&gt;
  24. &lt;/Participants&gt;
  25. &lt;comment type=&quot;Date&quot;&gt;15-APR-1999&lt;/comment&gt;
  26. &lt;u who=&quot;INV&quot; uID=&quot;u0&quot;&gt;
  27. &lt;w untranscribed=&quot;untranscribed&quot;&gt;www&lt;/w&gt;
  28. &lt;t type=&quot;p&quot;&gt;&lt;/t&gt;
  29. &lt;media
  30. start=&quot;7.639&quot;
  31. end=&quot;9.648&quot;
  32. unit=&quot;s&quot;
  33. /&gt;
  34. &lt;a type=&quot;addressee&quot;&gt;MOT&lt;/a&gt;
  35. &lt;/u&gt;
  36. &lt;u who=&quot;MOT&quot; uID=&quot;u1&quot;&gt;
  37. &lt;w untranscribed=&quot;untranscribed&quot;&gt;www&lt;/w&gt;
  38. &lt;t type=&quot;p&quot;&gt;&lt;/t&gt;
  39. &lt;media
  40. start=&quot;7.640&quot;
  41. end=&quot;9.455&quot;
  42. unit=&quot;s&quot;
  43. /&gt;
  44. &lt;a type=&quot;addressee&quot;&gt;INV&lt;/a&gt;
  45. &lt;/u&gt;
  46. &lt;u who=&quot;CHI&quot; uID=&quot;u2&quot;&gt;
  47. &lt;w untranscribed=&quot;unintelligible&quot;&gt;xxx&lt;/w&gt;
  48. &lt;w formType=&quot;family-specific&quot;&gt;choo_choos&lt;mor type=&quot;mor&quot;&gt;&lt;mw&gt;&lt;pos&gt;&lt;c&gt;fam&lt;/c&gt;&lt;/pos&gt;&lt;stem&gt;choo_choos&lt;/stem&gt;&lt;/mw&gt;&lt;gra type=&quot;gra&quot; index=&quot;1&quot; head=&quot;0&quot; relation=&quot;INCROOT&quot;/&gt;&lt;/mor&gt;&lt;/w&gt;
  49. &lt;t type=&quot;p&quot;&gt;&lt;mor type=&quot;mor&quot;&gt;&lt;mt type=&quot;p&quot;/&gt;&lt;gra type=&quot;gra&quot; index=&quot;2&quot; head=&quot;1&quot; relation=&quot;PUNCT&quot;/&gt;&lt;/mor&gt;&lt;/t&gt;
  50. &lt;postcode&gt;I&lt;/postcode&gt;
  51. &lt;media
  52. start=&quot;10.987&quot;
  53. end=&quot;12.973&quot;
  54. unit=&quot;s&quot;
  55. /&gt;
  56. &lt;a type=&quot;comments&quot;&gt;looking at pictures of trains&lt;/a&gt;
  57. &lt;/u&gt;
  58. &lt;/CHAT&gt;

答案1

得分: 1

我认为你需要在命名空间前加上前缀:

  1. for w in root.iter("{http://www.talkbank.org/ns/talkbank}w"):
  2. print(w.text)

你可能想查看这个问题以获取更多关于命名空间的类似问题:

question: https://stackoverflow.com/questions/14853243/parsing-xml-with-namespace-in-python-via-elementtree

英文:

I think you have to prepend the namespace:

  1. for w in root.iter(&quot;{http://www.talkbank.org/ns/talkbank}w&quot;):
  2. print(w.text)

You might want to checkout this question for more similar problem with namespaces.

答案2

得分: 1

你还可以定义命名空间以供进一步使用,并使用 iterfind

  1. NS = {'ww': 'http://www.talkbank.org/ns/talkbank'}
  2. for w in root.iterfind('.//ww:w', NS):
  3. print(w.text)

结果将是:

  1. www
  2. www
  3. xxx
  4. choo_choos
英文:

You can also define the namespace for further usage and use iterfind:

  1. NS = { &#39;ww&#39; : &#39;http://www.talkbank.org/ns/talkbank&#39; }
  2. for w in root.iterfind(&#39;.//ww:w&#39;,NS):
  3. print(w.text)

Result would be

  1. www
  2. www
  3. xxx
  4. choo_choos

答案3

得分: 1

你的 XML 包含命名空间和嵌套标签,都在一个 <w> 标签内。我稍微修改了你的代码:

  1. import xml.etree.ElementTree as ET
  2. tree = ET.parse('020012_doctored.xml')
  3. root = tree.getroot()
  4. for w in root.findall('.//{*}w'):
  5. print("".join(w.itertext()))

输出:

  1. www
  2. www
  3. xxx
  4. choo_choosfamchoo_choos
英文:

Your xml has namespaces and nested tag into one <w> tag. I changed your code a little bit:

  1. import xml.etree.ElementTree as ET
  2. tree = ET.parse(&#39;020012_doctored.xml&#39;)
  3. root = tree.getroot()
  4. for w in root.findall(&#39;.//{*}w&#39;):
  5. print(&quot;&quot;.join(w.itertext()))

Output:

  1. www
  2. www
  3. xxx
  4. choo_choosfamchoo_choos

huangapple
  • 本文由 发表于 2023年7月31日 23:29:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76805082.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定