如何在Python中删除XML文件中的多行然后保存它

huangapple go评论65阅读模式
英文:

How to remove several lines in xml file and then save it in Python

问题

I've translated the code part for you:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)

contents = path.read_text()

xml_lines = [
    'first',
    'second',
]

lines = contents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f'Line: "{line.strip()}" has been removed!')

print(f"\n\n{removed_lines} lines have been removed!")

path.write_text(str(lines))

Please note that modifying a list while iterating through it can lead to unexpected behavior. You may want to consider using a different approach, such as creating a new list for lines you want to keep, to avoid potential issues.

英文:

I want to remove all lines that contain all words in the 'xml_lines' list. I created this script:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)

conntents = path.read_text()

xml_lines = [
    'first',
    'second',
]

lines = conntents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f'Line: "{line.strip()}" has been removed!')

print(f"\n\n{removed_lines} lines have been removded!")

path.write_text(str(lines))

At the and I have a file that does not look like xml. Can anyone help?

Example (before):

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        first="2d2md"
        second="m3d39d93">
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        first="hfdfherbre"
        second="m3d39d93">
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama"
        first="th54b4"
        second="45b45gt45h">
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

if script finds any line that contain 'first' or 'second', the entire line should be removed:

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        >
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        >
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama">
        >
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

This is only an example, entire xml file consists of 9999999 lines...

答案1

得分: 1

考虑XSLT这个专用语言,旨在转换XML文件。具体来说,一个标识模板和空模板可以在整个文档中移除所需的属性,而无需单个for循环。Python的lxml第三方包可以运行XSLT 1.0脚本。

XSLT (另存为 .xsl 文件,一种特殊的XML文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <!-- 身份转换 -->
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- 空模板以移除内容 -->
    <xsl:template match="@first|@second"/>
</xsl:stylesheet>

在线演示

Python

import lxml.etree as lx

# 解析XML和XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")

# 配置并运行转换器
transformer = lx.XSLT(style)
result = transformer(doc)

# 输出到文件
result.write_output("Output.xml")
英文:

Consider XSLT the special-purpose language designed to transform XML files. Specifically, an identity template and empty template can remove the needed attributes across entire document without a single for loop. Python's lxml third-party package can run XSLT 1.0 scripts.

XSLT (save as .xsl file, a special XML file)

&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; encoding=&quot;utf-8&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:strip-space elements=&quot;*&quot;/&gt;
    
    &lt;!-- IDENTITY TRANSFORM --&gt;
    &lt;xsl:template match=&quot;@* | node()&quot;&gt;
        &lt;xsl:copy&gt;
            &lt;xsl:apply-templates select=&quot;@* | node()&quot;/&gt;
        &lt;/xsl:copy&gt;
    &lt;/xsl:template&gt;

    &lt;!-- EMPTY TEMPLATE TO REMOVE CONTENT --&gt;
    &lt;xsl:template match=&quot;@first|@second&quot;/&gt;
&lt;/xsl:stylesheet&gt;

<kbd>Online Demo</kbd>

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse(&quot;Input.xml&quot;)
style = lx.parse(&quot;Style.xsl&quot;)

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
result.write_output(&quot;Output.xml&quot;)

答案2

得分: 0

你可以按照这个答案中描述的方法来做一些简单的事情,基本上使用xpath和lxml(可能还有其他方法可以实现相同的效果):

from lxml import etree
doc = etree.parse("your xml file")

to_drop = ["first", "second"]
for td in to_drop:
    for target in doc.xpath('//*'):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

输出应该是你期望的输出。

英文:

You could do something simple along the lines described in this answer, basically using xpath and lxml (and there may be other ways to do the same):

from lxml import etree
doc = etree.parse(&quot;your xml file&quot;)

to_drop = [&quot;first&quot;,&quot;second&quot;]
for td in to_drop:
    for target in doc.xpath(&#39;//*&#39;):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

Output should be your expected output.

答案3

得分: 0

对于大型 XML 文件,您可以使用 iterparse() 并操作属性值:

import xml.etree.ElementTree as ET

filename = "outfile.xml"
with open(filename, 'wb') as out:
    out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))

attrib_list = ['first','second']

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open("outfile.xml", 'ab') as out:
            out.write(ET.tostring(elem))
            
with open(filename, 'ab') as out:
    out.write(str.encode('</data>'))

输出:

<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E" />
    <neighbor name="Switzerland" direction="W" />
  </country>
  <tiger name="Singapore">
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N" />
  </tiger>
  <car name="Panama">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W" />
    <neighbor name="Colombia" direction="E" />
  </car>
</data>

您可以使用 pop()del() 来从标签元素中删除属性。

英文:

For huge xml files you can use iterparse() and manipulate the attribute values:

import xml.etree.ElementTree as ET

filename = &quot;outfile.xml&quot;
with open(filename, &#39;wb&#39;) as out:
    out.write(str.encode(&#39;&lt;?xml version=&quot;1.0&quot;?&gt;\n&lt;data&gt;\n&#39;))

attrib_list = [&#39;first&#39;,&#39;second&#39;]

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse(&quot;pop_del.xml&quot;, events=(&quot;start&quot;,&quot;end&quot;)):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() &amp; set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open(&quot;outfile.xml&quot;, &#39;ab&#39;) as out:
            out.write(ET.tostring(elem))
            
with open(filename, &#39;ab&#39;) as out:
    out.write(str.encode(&#39;&lt;/data&gt;&#39;))

Output:

&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;data&gt;
  &lt;country name=&quot;Liechtenstein&quot;&gt;
    &lt;rank updated=&quot;yes&quot;&gt;2&lt;/rank&gt;
    &lt;year&gt;2008&lt;/year&gt;
    &lt;gdppc&gt;141100&lt;/gdppc&gt;
    &lt;neighbor name=&quot;Austria&quot; direction=&quot;E&quot; /&gt;
    &lt;neighbor name=&quot;Switzerland&quot; direction=&quot;W&quot; /&gt;
  &lt;/country&gt;
  &lt;tiger name=&quot;Singapore&quot;&gt;
    &lt;rank updated=&quot;yes&quot;&gt;5&lt;/rank&gt;
    &lt;year&gt;2011&lt;/year&gt;
    &lt;gdppc&gt;59900&lt;/gdppc&gt;
    &lt;neighbor name=&quot;Malaysia&quot; direction=&quot;N&quot; /&gt;
  &lt;/tiger&gt;
  &lt;car name=&quot;Panama&quot;&gt;
    &lt;rank updated=&quot;yes&quot;&gt;69&lt;/rank&gt;
    &lt;year&gt;2011&lt;/year&gt;
    &lt;gdppc&gt;13600&lt;/gdppc&gt;
    &lt;neighbor name=&quot;Costa Rica&quot; direction=&quot;W&quot; /&gt;
    &lt;neighbor name=&quot;Colombia&quot; direction=&quot;E&quot; /&gt;
  &lt;/car&gt;
&lt;/data&gt;

You can use pop() or del() to remove a attribute from tag element.

huangapple
  • 本文由 发表于 2023年5月29日 23:53:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76358732.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定