lxml .text在字符串中包含标签时返回None。

huangapple go评论59阅读模式
英文:

lxml .text returns None when string contains tags

问题

I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg> elements. Whenever <seg> element contains serialized tags, I get None object instead of a string.

Code that returns None:

source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text

Sample content of <seg> element that causes the issue:

<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>

Expected value of string variable source_segment:

<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />

I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text cause it is a None object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg'), I get this:

b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>\n      '

Sample XML content:

<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
  <header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
    <prop type="x-Note:SingleString"></prop>
    <prop type="x-Recognizers">RecognizeAll</prop>
    <prop type="x-IncludesContextContent">True</prop>
    <prop type="x-TMName">XXXXXXXX</prop>
    <prop type="x-TokenizerFlags">DefaultFlags</prop>
    <prop type="x-WordCountFlags">DefaultFlags</prop>
  </header>
  <body>
    <tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
      <prop type="x-LastUsedBy">XXXXXXXX</prop>
      <prop type="x-Context">0, 0</prop>
      <prop type="x-Origin">TM</prop>
      <prop type="x-ConfirmationLevel">Translated</prop>
      <prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
      <prop type="x-Note:SingleString">XXXXXXXX</prop>
      <tuv xml:lang="en-GB">
        <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
      </tuv>
      <tuv xml:lang="lt-LT">
        <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
      </tuv>
    </tu>
  </body>
</tmx>

How do I extract the string from <seg> element when it contains serialized tags?

英文:

I am a traversing complex XML file with millions of TU nodes and extracting strings from &lt;seg&gt; elements. Whenever &lt;seg&gt; element contains serialized tags, I get None object instead of a string.

Code that returns None:

source_segment = ET.parse(file).getroot().find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;).text

Sample content of &lt;seg&gt; element that causes the issue:

&lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;

Expected value of string variable source_segment:

&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;

I cant serialize ET.parse(file).getroot().find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;).text cause it is a None object. If I serialize only part ET.parse(file).getroot().find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;), I get this:

b&#39;&lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;\n      &#39;

Sample XML content:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;tmx version=&quot;1.4&quot;&gt;
&lt;header creationtool=&quot;XXXXXXXX&quot; creationtoolversion=&quot;100&quot; o-tmf=&quot;XXXXXXXX&quot; datatype=&quot;xml&quot; segtype=&quot;sentence&quot; adminlang=&quot;en-GB&quot; srclang=&quot;en-GB&quot; creationdate=&quot;XXXXXXXX&quot; creationid=&quot;XXXXXXXX&quot;&gt;
&lt;prop type=&quot;x-Note:SingleString&quot;&gt;&lt;/prop&gt;
&lt;prop type=&quot;x-Recognizers&quot;&gt;RecognizeAll&lt;/prop&gt;
&lt;prop type=&quot;x-IncludesContextContent&quot;&gt;True&lt;/prop&gt;
&lt;prop type=&quot;x-TMName&quot;&gt;XXXXXXXX&lt;/prop&gt;
&lt;prop type=&quot;x-TokenizerFlags&quot;&gt;DefaultFlags&lt;/prop&gt;
&lt;prop type=&quot;x-WordCountFlags&quot;&gt;DefaultFlags&lt;/prop&gt;
&lt;/header&gt;
&lt;body&gt;
&lt;tu creationdate=&quot;XXXXXXXX&quot; creationid=&quot;XXXXXXXX&quot; changedate=&quot;XXXXXXXX&quot; changeid=&quot;XXXXXXXX&quot; lastusagedate=&quot;XXXXXXXX&quot; usagecount=&quot;1&quot;&gt;
&lt;prop type=&quot;x-LastUsedBy&quot;&gt;XXXXXXXX&lt;/prop&gt;
&lt;prop type=&quot;x-Context&quot;&gt;0, 0&lt;/prop&gt;
&lt;prop type=&quot;x-Origin&quot;&gt;TM&lt;/prop&gt;
&lt;prop type=&quot;x-ConfirmationLevel&quot;&gt;Translated&lt;/prop&gt;
&lt;prop type=&quot;x-StructureContext:MultipleString&quot;&gt;sdl:cdata&lt;/prop&gt;
&lt;prop type=&quot;x-Note:SingleString&quot;&gt;XXXXXXXX&lt;/prop&gt;
&lt;tuv xml:lang=&quot;en-GB&quot;&gt;
&lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;
&lt;/tuv&gt;
&lt;tuv xml:lang=&quot;lt-LT&quot;&gt;
&lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;YYYYYYYYYYYYY&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;
&lt;/tuv&gt;
&lt;/tu&gt;
&lt;/body&gt;
&lt;/tmx&gt;

How do I extract the string from &lt;seg&gt; element when it contains serialized tags?

答案1

得分: 1

以下是您要的代码的翻译部分:

你可以遍历&lt;seg&gt;,这取决于你的兴趣:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('seg.xml')
root = tree.getroot()

def elem_to_string(child):
    print("将您的愿望作为字符串输出", ET.tostring(child).decode())

data = []
for elem in root:
    if elem.tag == "body":
        for child in elem.findall(".//seg"):
            elem_to_string(child)
            for sub_c in child.iter():
                print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
                row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
                data.append(row)

df = pd.DataFrame(data)
print(df.to_string())

输出:

         0                                   1     2                  3
0  seg                                  {}  None           
1  bpt  {'i': '1', 'type': '14', 'x': '1'}  None  Coded glass plate
2  ept                          {'i': '1'}  None               None
3   ph            {'x': '4', 'type': '33'}  None               None
4  seg                                  {}  None           
5  bpt  {'i': '1', 'type': '14', 'x': '1'}  None      YYYYYYYYYYYYY
6  ept                          {'i': '1'}  None               None
7   ph            {'x': '4', 'type': '33'}  None               None

作为字符串的可选输出:

将您的愿望作为字符串输出 <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
将您的愿望作为字符串输出 <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
英文:

You can iterate about &lt;seg&gt;, depends on what you are interested in:

import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse(&#39;seg.xml&#39;)
root = tree.getroot()
def elem_to_string(child):
print(&quot;Your wish as a string&quot;, ET.tostring(child).decode())
data = []
for elem in root:
if elem.tag == &quot;body&quot;:
for child in elem.findall(&quot;.//seg&quot;):
elem_to_string(child)
for sub_c in child.iter():
print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
data.append(row)
df = pd.DataFrame(data)
print(df.to_string())

Output:

     0                                   1     2                  3
0  seg                                  {}  None           \n      
1  bpt  {&#39;i&#39;: &#39;1&#39;, &#39;type&#39;: &#39;14&#39;, &#39;x&#39;: &#39;1&#39;}  None  Coded glass plate
2  ept                          {&#39;i&#39;: &#39;1&#39;}  None               None
3   ph            {&#39;x&#39;: &#39;4&#39;, &#39;type&#39;: &#39;33&#39;}  None               None
4  seg                                  {}  None           \n      
5  bpt  {&#39;i&#39;: &#39;1&#39;, &#39;type&#39;: &#39;14&#39;, &#39;x&#39;: &#39;1&#39;}  None      YYYYYYYYYYYYY
6  ept                          {&#39;i&#39;: &#39;1&#39;}  None               None
7   ph            {&#39;x&#39;: &#39;4&#39;, &#39;type&#39;: &#39;33&#39;}  None               None

Optional as a string:

Your wish as a string &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;Coded glass plate&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;
Your wish as a string &lt;seg&gt;&lt;bpt i=&quot;1&quot; type=&quot;14&quot; x=&quot;1&quot; /&gt;YYYYYYYYYYYYY&lt;ept i=&quot;1&quot; /&gt;&lt;ph x=&quot;4&quot; type=&quot;33&quot; /&gt;&lt;/seg&gt;

答案2

得分: 0

以下是您要翻译的代码部分:

# 最好的方法是将父子元素转换为字符串,传递参数 'encoding=str',以避免将字节对象解码为字符串并保留UTF-8符号。然后从生成的字符串中使用正则表达式来匹配 <seg></seg> 标签。

import re
from lxml import etree as ET

root = ET.parse('seg.xml').getroot()

seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')

seg_string = ET.tostring(seg_elem, encoding=str)

# 正则表达式来去除 <seg> 标签
seg_pattern = r'(?<=>).*?(?=<\/seg>)'
# 去除 <seg> 标签
final_string = re.search(seg_pattern, seg_string).group()
英文:

The best approach I found is to convert the parent child to a string, passing parameter 'encoding=str' to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the <seg></seg> tags from the resulting string.

import re
from lxml import etree as ET
root = ET.parse(&#39;seg.xml&#39;).getroot()
seg_elem = root.find(&#39;body&#39;).findall(&#39;tu&#39;)[0].findall(&#39;tuv&#39;)[0].find(&#39;seg&#39;)
seg_string = ET.tostring(seg_elem, encoding=str)
# Regex to strip &lt;seg&gt; tags
seg_pattern = &#39;(?&lt;=&lt;seg&gt;).*?(?=&lt;/seg&gt;)&#39;
# Strip &lt;seg&gt; tags
final_string = re.search(seg_pattern, seg_string).group()

huangapple
  • 本文由 发表于 2023年2月6日 16:45:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75359058.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定