lxml .text在字符串中包含标签时返回None。

huangapple go评论95阅读模式
英文:

lxml .text returns None when string contains tags

问题

I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg> elements. Whenever <seg> element contains serialized tags, I get None object instead of a string.

Code that returns None:

  1. source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text

Sample content of <seg> element that causes the issue:

  1. <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>

Expected value of string variable source_segment:

  1. <bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />

I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text cause it is a None object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg'), I get this:

  1. b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>\n '

Sample XML content:

  1. <?xml version="1.0" encoding="utf-8"?>
  2. <tmx version="1.4">
  3. <header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
  4. <prop type="x-Note:SingleString"></prop>
  5. <prop type="x-Recognizers">RecognizeAll</prop>
  6. <prop type="x-IncludesContextContent">True</prop>
  7. <prop type="x-TMName">XXXXXXXX</prop>
  8. <prop type="x-TokenizerFlags">DefaultFlags</prop>
  9. <prop type="x-WordCountFlags">DefaultFlags</prop>
  10. </header>
  11. <body>
  12. <tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
  13. <prop type="x-LastUsedBy">XXXXXXXX</prop>
  14. <prop type="x-Context">0, 0</prop>
  15. <prop type="x-Origin">TM</prop>
  16. <prop type="x-ConfirmationLevel">Translated</prop>
  17. <prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
  18. <prop type="x-Note:SingleString">XXXXXXXX</prop>
  19. <tuv xml:lang="en-GB">
  20. <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
  21. </tuv>
  22. <tuv xml:lang="lt-LT">
  23. <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
  24. </tuv>
  25. </tu>
  26. </body>
  27. </tmx>

How do I extract the string from <seg> element when it contains serialized tags?

英文:

I am a traversing complex XML file with millions of TU nodes and extracting strings from &lt;seg&gt; elements. Whenever &lt;seg&gt; element contains serialized tags, I get None object instead of a string.

Code that returns None:

  1. source_segment = ET.parse(file).getroot().find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;).text

Sample content of &lt;seg&gt; element that causes the issue:

  1. &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;

Expected value of string variable source_segment:

  1. &lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;

I cant serialize ET.parse(file).getroot().find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;).text cause it is a None object. If I serialize only part ET.parse(file).getroot().find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;), I get this:

  1. b&#39;&lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;\n &#39;

Sample XML content:

  1. &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
  2. &lt;tmx version=&quot;1.4&quot;&gt;
  3. &lt;header creationtool=&quot;XXXXXXXX&quot; creationtoolversion=&quot;100&quot; o-tmf=&quot;XXXXXXXX&quot; datatype=&quot;xml&quot; segtype=&quot;sentence&quot; adminlang=&quot;en-GB&quot; srclang=&quot;en-GB&quot; creationdate=&quot;XXXXXXXX&quot; creationid=&quot;XXXXXXXX&quot;&gt;
  4. &lt;prop type=&quot;x-Note:SingleString&quot;&gt;&lt;/prop&gt;
  5. &lt;prop type=&quot;x-Recognizers&quot;&gt;RecognizeAll&lt;/prop&gt;
  6. &lt;prop type=&quot;x-IncludesContextContent&quot;&gt;True&lt;/prop&gt;
  7. &lt;prop type=&quot;x-TMName&quot;&gt;XXXXXXXX&lt;/prop&gt;
  8. &lt;prop type=&quot;x-TokenizerFlags&quot;&gt;DefaultFlags&lt;/prop&gt;
  9. &lt;prop type=&quot;x-WordCountFlags&quot;&gt;DefaultFlags&lt;/prop&gt;
  10. &lt;/header&gt;
  11. &lt;body&gt;
  12. &lt;tu creationdate=&quot;XXXXXXXX&quot; creationid=&quot;XXXXXXXX&quot; changedate=&quot;XXXXXXXX&quot; changeid=&quot;XXXXXXXX&quot; lastusagedate=&quot;XXXXXXXX&quot; usagecount=&quot;1&quot;&gt;
  13. &lt;prop type=&quot;x-LastUsedBy&quot;&gt;XXXXXXXX&lt;/prop&gt;
  14. &lt;prop type=&quot;x-Context&quot;&gt;0, 0&lt;/prop&gt;
  15. &lt;prop type=&quot;x-Origin&quot;&gt;TM&lt;/prop&gt;
  16. &lt;prop type=&quot;x-ConfirmationLevel&quot;&gt;Translated&lt;/prop&gt;
  17. &lt;prop type=&quot;x-StructureContext:MultipleString&quot;&gt;sdl:cdata&lt;/prop&gt;
  18. &lt;prop type=&quot;x-Note:SingleString&quot;&gt;XXXXXXXX&lt;/prop&gt;
  19. &lt;tuv xml:lang=&quot;en-GB&quot;&gt;
  20. &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;
  21. &lt;/tuv&gt;
  22. &lt;tuv xml:lang=&quot;lt-LT&quot;&gt;
  23. &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;YYYYYYYYYYYYY&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;
  24. &lt;/tuv&gt;
  25. &lt;/tu&gt;
  26. &lt;/body&gt;
  27. &lt;/tmx&gt;

How do I extract the string from &lt;seg&gt; element when it contains serialized tags?

答案1

得分: 1

以下是您要的代码的翻译部分:

你可以遍历&lt;seg&gt;,这取决于你的兴趣:

  1. import xml.etree.ElementTree as ET
  2. import pandas as pd
  3. tree = ET.parse('seg.xml')
  4. root = tree.getroot()
  5. def elem_to_string(child):
  6. print("将您的愿望作为字符串输出", ET.tostring(child).decode())
  7. data = []
  8. for elem in root:
  9. if elem.tag == "body":
  10. for child in elem.findall(".//seg"):
  11. elem_to_string(child)
  12. for sub_c in child.iter():
  13. print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
  14. row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
  15. data.append(row)
  16. df = pd.DataFrame(data)
  17. print(df.to_string())

输出:

  1. 0 1 2 3
  2. 0 seg {} None
  3. 1 bpt {'i': '1', 'type': '14', 'x': '1'} None Coded glass plate
  4. 2 ept {'i': '1'} None None
  5. 3 ph {'x': '4', 'type': '33'} None None
  6. 4 seg {} None
  7. 5 bpt {'i': '1', 'type': '14', 'x': '1'} None YYYYYYYYYYYYY
  8. 6 ept {'i': '1'} None None
  9. 7 ph {'x': '4', 'type': '33'} None None

作为字符串的可选输出:

  1. 将您的愿望作为字符串输出 <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
  2. 将您的愿望作为字符串输出 <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
英文:

You can iterate about &lt;seg&gt;, depends on what you are interested in:

  1. import xml.etree.ElementTree as ET
  2. import pandas as pd
  3. tree = ET.parse(&#39;seg.xml&#39;)
  4. root = tree.getroot()
  5. def elem_to_string(child):
  6. print(&quot;Your wish as a string&quot;, ET.tostring(child).decode())
  7. data = []
  8. for elem in root:
  9. if elem.tag == &quot;body&quot;:
  10. for child in elem.findall(&quot;.//seg&quot;):
  11. elem_to_string(child)
  12. for sub_c in child.iter():
  13. print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
  14. row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
  15. data.append(row)
  16. df = pd.DataFrame(data)
  17. print(df.to_string())

Output:

  1. 0 1 2 3
  2. 0 seg {} None \n
  3. 1 bpt {&#39;i&#39;: &#39;1&#39;, &#39;type&#39;: &#39;14&#39;, &#39;x&#39;: &#39;1&#39;} None Coded glass plate
  4. 2 ept {&#39;i&#39;: &#39;1&#39;} None None
  5. 3 ph {&#39;x&#39;: &#39;4&#39;, &#39;type&#39;: &#39;33&#39;} None None
  6. 4 seg {} None \n
  7. 5 bpt {&#39;i&#39;: &#39;1&#39;, &#39;type&#39;: &#39;14&#39;, &#39;x&#39;: &#39;1&#39;} None YYYYYYYYYYYYY
  8. 6 ept {&#39;i&#39;: &#39;1&#39;} None None
  9. 7 ph {&#39;x&#39;: &#39;4&#39;, &#39;type&#39;: &#39;33&#39;} None None

Optional as a string:

  1. Your wish as a string &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;
  2. Your wish as a string &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;YYYYYYYYYYYYY&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;

答案2

得分: 0

以下是您要翻译的代码部分:

  1. # 最好的方法是将父子元素转换为字符串,传递参数 'encoding=str',以避免将字节对象解码为字符串并保留UTF-8符号。然后从生成的字符串中使用正则表达式来匹配 <seg></seg> 标签。
  2. import re
  3. from lxml import etree as ET
  4. root = ET.parse('seg.xml').getroot()
  5. seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
  6. seg_string = ET.tostring(seg_elem, encoding=str)
  7. # 正则表达式来去除 <seg> 标签
  8. seg_pattern = r'(?<=>).*?(?=<\/seg>)'
  9. # 去除 <seg> 标签
  10. final_string = re.search(seg_pattern, seg_string).group()
英文:

The best approach I found is to convert the parent child to a string, passing parameter 'encoding=str' to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the <seg></seg> tags from the resulting string.

  1. import re
  2. from lxml import etree as ET
  3. root = ET.parse(&#39;seg.xml&#39;).getroot()
  4. seg_elem = root.find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;)
  5. seg_string = ET.tostring(seg_elem, encoding=str)
  6. # Regex to strip &lt;seg&gt; tags
  7. seg_pattern = &#39;(?&lt;=&lt;seg&gt;).*?(?=&lt;/seg&gt;)&#39;
  8. # Strip &lt;seg&gt; tags
  9. final_string = re.search(seg_pattern, seg_string).group()

huangapple
  • 本文由 发表于 2023年2月6日 16:45:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75359058.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定