英文:
lxml .text returns None when string contains tags
问题
I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg>
elements. Whenever <seg>
element contains serialized tags, I get None
object instead of a string.
Code that returns None
:
source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
Sample content of <seg>
element that causes the issue:
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Expected value of string variable source_segment
:
<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />
I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
cause it is a None
object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg')
, I get this:
b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>\n '
Sample XML content:
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
<prop type="x-Note:SingleString"></prop>
<prop type="x-Recognizers">RecognizeAll</prop>
<prop type="x-IncludesContextContent">True</prop>
<prop type="x-TMName">XXXXXXXX</prop>
<prop type="x-TokenizerFlags">DefaultFlags</prop>
<prop type="x-WordCountFlags">DefaultFlags</prop>
</header>
<body>
<tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
<prop type="x-LastUsedBy">XXXXXXXX</prop>
<prop type="x-Context">0, 0</prop>
<prop type="x-Origin">TM</prop>
<prop type="x-ConfirmationLevel">Translated</prop>
<prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
<prop type="x-Note:SingleString">XXXXXXXX</prop>
<tuv xml:lang="en-GB">
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
<tuv xml:lang="lt-LT">
<seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
</tu>
</body>
</tmx>
How do I extract the string from <seg>
element when it contains serialized tags?
英文:
I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg>
elements. Whenever <seg>
element contains serialized tags, I get None
object instead of a string.
Code that returns None
:
source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
Sample content of <seg>
element that causes the issue:
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Expected value of string variable source_segment
:
<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />
I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
cause it is a None
object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg')
, I get this:
b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>\n '
Sample XML content:
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
<prop type="x-Note:SingleString"></prop>
<prop type="x-Recognizers">RecognizeAll</prop>
<prop type="x-IncludesContextContent">True</prop>
<prop type="x-TMName">XXXXXXXX</prop>
<prop type="x-TokenizerFlags">DefaultFlags</prop>
<prop type="x-WordCountFlags">DefaultFlags</prop>
</header>
<body>
<tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
<prop type="x-LastUsedBy">XXXXXXXX</prop>
<prop type="x-Context">0, 0</prop>
<prop type="x-Origin">TM</prop>
<prop type="x-ConfirmationLevel">Translated</prop>
<prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
<prop type="x-Note:SingleString">XXXXXXXX</prop>
<tuv xml:lang="en-GB">
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
<tuv xml:lang="lt-LT">
<seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
</tu>
</body>
</tmx>
How do I extract the string from <seg>
element when it contains serialized tags?
答案1
得分: 1
以下是您要的代码的翻译部分:
你可以遍历<seg>
,这取决于你的兴趣:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('seg.xml')
root = tree.getroot()
def elem_to_string(child):
print("将您的愿望作为字符串输出", ET.tostring(child).decode())
data = []
for elem in root:
if elem.tag == "body":
for child in elem.findall(".//seg"):
elem_to_string(child)
for sub_c in child.iter():
print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
data.append(row)
df = pd.DataFrame(data)
print(df.to_string())
输出:
0 1 2 3
0 seg {} None
1 bpt {'i': '1', 'type': '14', 'x': '1'} None Coded glass plate
2 ept {'i': '1'} None None
3 ph {'x': '4', 'type': '33'} None None
4 seg {} None
5 bpt {'i': '1', 'type': '14', 'x': '1'} None YYYYYYYYYYYYY
6 ept {'i': '1'} None None
7 ph {'x': '4', 'type': '33'} None None
作为字符串的可选输出:
将您的愿望作为字符串输出 <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
将您的愿望作为字符串输出 <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
英文:
You can iterate about <seg>
, depends on what you are interested in:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('seg.xml')
root = tree.getroot()
def elem_to_string(child):
print("Your wish as a string", ET.tostring(child).decode())
data = []
for elem in root:
if elem.tag == "body":
for child in elem.findall(".//seg"):
elem_to_string(child)
for sub_c in child.iter():
print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
data.append(row)
df = pd.DataFrame(data)
print(df.to_string())
Output:
0 1 2 3
0 seg {} None \n
1 bpt {'i': '1', 'type': '14', 'x': '1'} None Coded glass plate
2 ept {'i': '1'} None None
3 ph {'x': '4', 'type': '33'} None None
4 seg {} None \n
5 bpt {'i': '1', 'type': '14', 'x': '1'} None YYYYYYYYYYYYY
6 ept {'i': '1'} None None
7 ph {'x': '4', 'type': '33'} None None
Optional as a string:
Your wish as a string <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Your wish as a string <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
答案2
得分: 0
以下是您要翻译的代码部分:
# 最好的方法是将父子元素转换为字符串,传递参数 'encoding=str',以避免将字节对象解码为字符串并保留UTF-8符号。然后从生成的字符串中使用正则表达式来匹配 <seg></seg> 标签。
import re
from lxml import etree as ET
root = ET.parse('seg.xml').getroot()
seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
seg_string = ET.tostring(seg_elem, encoding=str)
# 正则表达式来去除 <seg> 标签
seg_pattern = r'(?<=>).*?(?=<\/seg>)'
# 去除 <seg> 标签
final_string = re.search(seg_pattern, seg_string).group()
英文:
The best approach I found is to convert the parent child to a string, passing parameter 'encoding=str' to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the <seg></seg> tags from the resulting string.
import re
from lxml import etree as ET
root = ET.parse('seg.xml').getroot()
seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
seg_string = ET.tostring(seg_elem, encoding=str)
# Regex to strip <seg> tags
seg_pattern = '(?<=<seg>).*?(?=</seg>)'
# Strip <seg> tags
final_string = re.search(seg_pattern, seg_string).group()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论