英文:
Parsing xml file with Python using root.iter does not list text
问题
我正在尝试使用Python解析一个xml文件。我想要识别出位于指定xml标记之间的文本。
我运行的代码是
import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.iter('w'):
print(w.text)
XML文件如下。它是一个结构相对松散的复杂文件(为了此查询的目的,我对其进行了简化),但显然有一个"w"标记,应该被代码捕获。
谢谢。
<?xml version="1.0" encoding="UTF-8"?>
<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.talkbank.org/ns/talkbank"
xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
Media="020012" Mediatypes="audio"
DesignType="long"
ActivityType="toyplay"
GroupType="TD"
PID="11312/c-00018213-1"
Version="2.20.0"
Lang="eng"
Options="bullets"
Corpus="xxxx"
Date="xxxx-xx-xx"
>
<Participants>
<participant
id="MOT"
name="Mother"
role="Mother"
language="eng"
sex="female"
/>
</Participants>
<comment type="Date">15-APR-1999</comment>
<u who="INV" uID="u0">
<w untranscribed="untranscribed">www</w>
<t type="p"></t>
<media
start="7.639"
end="9.648"
unit="s"
/>
<a type="addressee">MOT</a>
</u>
<u who="MOT" uID="u1">
<w untranscribed="untranscribed">www</w>
<t type="p"></t>
<media
start="7.640"
end="9.455"
unit="s"
/>
<a type="addressee">INV</a>
</u>
<u who="CHI" uID="u2">
<w untranscribed="unintelligible">xxx</w>
<w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos></mw><stem>choo_choos</stem></mor></w>
<t type="p"><mor type="mor"><mt type="p"/></t>
<postcode>I</postcode>
<media
start="10.987"
end="12.973"
unit="s"
/>
<a type="comments">looking at pictures of trains</a>
</u>
</CHAT>
英文:
I am trying to use Python to parse an xml file. I would like to identify text which occurs between specified xml tags.
The code I am running is
import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.iter('w'):
print(w.text)
The xml file is as follows. It's a complex file with quite a loose structure, which combines elements of sequence and hierarchy (and I have simplified it for the purposes of this query), but there clearly is a "w" tag, which should be getting picked up by the code.
Thanks.
<?xml version="1.0" encoding="UTF-8"?>
<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.talkbank.org/ns/talkbank"
xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
Media="020012" Mediatypes="audio"
DesignType="long"
ActivityType="toyplay"
GroupType="TD"
PID="11312/c-00018213-1"
Version="2.20.0"
Lang="eng"
Options="bullets"
Corpus="xxxx"
Date="xxxx-xx-xx"
>
<Participants>
<participant
id="MOT"
name="Mother"
role="Mother"
language="eng"
sex="female"
/>
</Participants>
<comment type="Date">15-APR-1999</comment>
<u who="INV" uID="u0">
<w untranscribed="untranscribed">www</w>
<t type="p"></t>
<media
start="7.639"
end="9.648"
unit="s"
/>
<a type="addressee">MOT</a>
</u>
<u who="MOT" uID="u1">
<w untranscribed="untranscribed">www</w>
<t type="p"></t>
<media
start="7.640"
end="9.455"
unit="s"
/>
<a type="addressee">INV</a>
</u>
<u who="CHI" uID="u2">
<w untranscribed="unintelligible">xxx</w>
<w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos><stem>choo_choos</stem></mw><gra type="gra" index="1" head="0" relation="INCROOT"/></mor></w>
<t type="p"><mor type="mor"><mt type="p"/><gra type="gra" index="2" head="1" relation="PUNCT"/></mor></t>
<postcode>I</postcode>
<media
start="10.987"
end="12.973"
unit="s"
/>
<a type="comments">looking at pictures of trains</a>
</u>
</CHAT>
答案1
得分: 1
我认为你需要在命名空间前加上前缀:
for w in root.iter("{http://www.talkbank.org/ns/talkbank}w"):
print(w.text)
你可能想查看这个问题以获取更多关于命名空间的类似问题:
question: https://stackoverflow.com/questions/14853243/parsing-xml-with-namespace-in-python-via-elementtree
英文:
I think you have to prepend the namespace:
for w in root.iter("{http://www.talkbank.org/ns/talkbank}w"):
print(w.text)
You might want to checkout this question for more similar problem with namespaces.
答案2
得分: 1
你还可以定义命名空间以供进一步使用,并使用 iterfind
:
NS = {'ww': 'http://www.talkbank.org/ns/talkbank'}
for w in root.iterfind('.//ww:w', NS):
print(w.text)
结果将是:
www
www
xxx
choo_choos
英文:
You can also define the namespace for further usage and use iterfind
:
NS = { 'ww' : 'http://www.talkbank.org/ns/talkbank' }
for w in root.iterfind('.//ww:w',NS):
print(w.text)
Result would be
www
www
xxx
choo_choos
答案3
得分: 1
你的 XML 包含命名空间和嵌套标签,都在一个 <w>
标签内。我稍微修改了你的代码:
import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.findall('.//{*}w'):
print("".join(w.itertext()))
输出:
www
www
xxx
choo_choosfamchoo_choos
英文:
Your xml has namespaces and nested tag into one <w> tag. I changed your code a little bit:
import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.findall('.//{*}w'):
print("".join(w.itertext()))
Output:
www
www
xxx
choo_choosfamchoo_choos
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论