2023年5月29日 23:53:34go评论73阅读模式

英文:

How to remove several lines in xml file and then save it in Python

问题

I've translated the code part for you:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)

contents = path.read_text()

xml_lines = [
    'first',
    'second',
]

lines = contents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f'Line: "{line.strip()}" has been removed!')

print(f"\n\n{removed_lines} lines have been removed!")

path.write_text(str(lines))

Please note that modifying a list while iterating through it can lead to unexpected behavior. You may want to consider using a different approach, such as creating a new list for lines you want to keep, to avoid potential issues.

英文:

I want to remove all lines that contain all words in the 'xml_lines' list. I created this script:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = &#39;./.content.xml&#39;
path = Path(filename)

conntents = path.read_text()

xml_lines = [
    &#39;first&#39;,
    &#39;second&#39;,
]

lines = conntents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f&#39;Line: &quot;{line.strip()}&quot; has been removed!&#39;)

print(f&quot;\n\n{removed_lines} lines have been removded!&quot;)

path.write_text(str(lines))

At the and I have a file that does not look like xml. Can anyone help?

Example (before):

&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;data&gt;
    &lt;country
        name=&quot;Liechtenstein&quot;
        first=&quot;2d2md&quot;
        second=&quot;m3d39d93&quot;&gt;
            &lt;rank updated=&quot;yes&quot;&gt;2&lt;/rank&gt;
            &lt;year&gt;2008&lt;/year&gt;
            &lt;gdppc&gt;141100&lt;/gdppc&gt;
            &lt;neighbor name=&quot;Austria&quot; direction=&quot;E&quot;/&gt;
            &lt;neighbor name=&quot;Switzerland&quot; direction=&quot;W&quot;/&gt;
    &lt;/country&gt;
    &lt;tiger
        name=&quot;Singapore&quot;
        first=&quot;hfdfherbre&quot;
        second=&quot;m3d39d93&quot;&gt;
            &lt;rank updated=&quot;yes&quot;&gt;5&lt;/rank&gt;
            &lt;year&gt;2011&lt;/year&gt;
            &lt;gdppc&gt;59900&lt;/gdppc&gt;
            &lt;neighbor name=&quot;Malaysia&quot; direction=&quot;N&quot;/&gt;
    &lt;/tiger&gt;
    &lt;car
        name=&quot;Panama&quot;
        first=&quot;th54b4&quot;
        second=&quot;45b45gt45h&quot;&gt;
            &lt;rank updated=&quot;yes&quot;&gt;69&lt;/rank&gt;
            &lt;year&gt;2011&lt;/year&gt;
            &lt;gdppc&gt;13600&lt;/gdppc&gt;
            &lt;neighbor name=&quot;Costa Rica&quot; direction=&quot;W&quot;/&gt;
            &lt;neighbor name=&quot;Colombia&quot; direction=&quot;E&quot;/&gt;
    &lt;/car&gt;
&lt;/data&gt;

if script finds any line that contain 'first' or 'second', the entire line should be removed:

&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;data&gt;
    &lt;country
        name=&quot;Liechtenstein&quot;
        &gt;
            &lt;rank updated=&quot;yes&quot;&gt;2&lt;/rank&gt;
            &lt;year&gt;2008&lt;/year&gt;
            &lt;gdppc&gt;141100&lt;/gdppc&gt;
            &lt;neighbor name=&quot;Austria&quot; direction=&quot;E&quot;/&gt;
            &lt;neighbor name=&quot;Switzerland&quot; direction=&quot;W&quot;/&gt;
    &lt;/country&gt;
    &lt;tiger
        name=&quot;Singapore&quot;
        &gt;
            &lt;rank updated=&quot;yes&quot;&gt;5&lt;/rank&gt;
            &lt;year&gt;2011&lt;/year&gt;
            &lt;gdppc&gt;59900&lt;/gdppc&gt;
            &lt;neighbor name=&quot;Malaysia&quot; direction=&quot;N&quot;/&gt;
    &lt;/tiger&gt;
    &lt;car
        name=&quot;Panama&quot;&gt;
        &gt;
            &lt;rank updated=&quot;yes&quot;&gt;69&lt;/rank&gt;
            &lt;year&gt;2011&lt;/year&gt;
            &lt;gdppc&gt;13600&lt;/gdppc&gt;
            &lt;neighbor name=&quot;Costa Rica&quot; direction=&quot;W&quot;/&gt;
            &lt;neighbor name=&quot;Colombia&quot; direction=&quot;E&quot;/&gt;
    &lt;/car&gt;
&lt;/data&gt;

This is only an example, entire xml file consists of 9999999 lines...

答案1

得分: 1

考虑XSLT这个专用语言，旨在转换XML文件。具体来说，一个标识模板和空模板可以在整个文档中移除所需的属性，而无需单个for循环。Python的lxml第三方包可以运行XSLT 1.0脚本。

XSLT (另存为 .xsl 文件，一种特殊的XML文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <!-- 身份转换 -->
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- 空模板以移除内容 -->
    <xsl:template match="@first|@second"/>
</xsl:stylesheet>

在线演示

Python

import lxml.etree as lx

# 解析XML和XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")

# 配置并运行转换器
transformer = lx.XSLT(style)
result = transformer(doc)

# 输出到文件
result.write_output("Output.xml")

英文:

Consider XSLT the special-purpose language designed to transform XML files. Specifically, an identity template and empty template can remove the needed attributes across entire document without a single for loop. Python's lxml third-party package can run XSLT 1.0 scripts.

XSLT (save as .xsl file, a special XML file)

&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; encoding=&quot;utf-8&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:strip-space elements=&quot;*&quot;/&gt;
    
    &lt;!-- IDENTITY TRANSFORM --&gt;
    &lt;xsl:template match=&quot;@* | node()&quot;&gt;
        &lt;xsl:copy&gt;
            &lt;xsl:apply-templates select=&quot;@* | node()&quot;/&gt;
        &lt;/xsl:copy&gt;
    &lt;/xsl:template&gt;

    &lt;!-- EMPTY TEMPLATE TO REMOVE CONTENT --&gt;
    &lt;xsl:template match=&quot;@first|@second&quot;/&gt;
&lt;/xsl:stylesheet&gt;

<kbd>Online Demo</kbd>

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse(&quot;Input.xml&quot;)
style = lx.parse(&quot;Style.xsl&quot;)

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
result.write_output(&quot;Output.xml&quot;)

答案2

得分: 0

你可以按照这个答案中描述的方法来做一些简单的事情，基本上使用xpath和lxml（可能还有其他方法可以实现相同的效果）：

from lxml import etree
doc = etree.parse("your xml file")

to_drop = ["first", "second"]
for td in to_drop:
    for target in doc.xpath('//*'):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

输出应该是你期望的输出。

英文:

You could do something simple along the lines described in this answer, basically using xpath and lxml (and there may be other ways to do the same):

from lxml import etree
doc = etree.parse(&quot;your xml file&quot;)

to_drop = [&quot;first&quot;,&quot;second&quot;]
for td in to_drop:
    for target in doc.xpath(&#39;//*&#39;):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

Output should be your expected output.

答案3

得分: 0

对于大型 XML 文件，您可以使用 iterparse() 并操作属性值：

import xml.etree.ElementTree as ET

filename = "outfile.xml"
with open(filename, 'wb') as out:
    out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))

attrib_list = ['first','second']

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open("outfile.xml", 'ab') as out:
            out.write(ET.tostring(elem))
            
with open(filename, 'ab') as out:
    out.write(str.encode('</data>'))

输出：

<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E" />
    <neighbor name="Switzerland" direction="W" />
  </country>
  <tiger name="Singapore">
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N" />
  </tiger>
  <car name="Panama">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W" />
    <neighbor name="Colombia" direction="E" />
  </car>
</data>

您可以使用 pop() 或 del() 来从标签元素中删除属性。

英文:

For huge xml files you can use iterparse() and manipulate the attribute values:

import xml.etree.ElementTree as ET

filename = &quot;outfile.xml&quot;
with open(filename, &#39;wb&#39;) as out:
    out.write(str.encode(&#39;&lt;?xml version=&quot;1.0&quot;?&gt;\n&lt;data&gt;\n&#39;))

attrib_list = [&#39;first&#39;,&#39;second&#39;]

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse(&quot;pop_del.xml&quot;, events=(&quot;start&quot;,&quot;end&quot;)):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() &amp; set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open(&quot;outfile.xml&quot;, &#39;ab&#39;) as out:
            out.write(ET.tostring(elem))
            
with open(filename, &#39;ab&#39;) as out:
    out.write(str.encode(&#39;&lt;/data&gt;&#39;))

Output:

&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;data&gt;
  &lt;country name=&quot;Liechtenstein&quot;&gt;
    &lt;rank updated=&quot;yes&quot;&gt;2&lt;/rank&gt;
    &lt;year&gt;2008&lt;/year&gt;
    &lt;gdppc&gt;141100&lt;/gdppc&gt;
    &lt;neighbor name=&quot;Austria&quot; direction=&quot;E&quot; /&gt;
    &lt;neighbor name=&quot;Switzerland&quot; direction=&quot;W&quot; /&gt;
  &lt;/country&gt;
  &lt;tiger name=&quot;Singapore&quot;&gt;
    &lt;rank updated=&quot;yes&quot;&gt;5&lt;/rank&gt;
    &lt;year&gt;2011&lt;/year&gt;
    &lt;gdppc&gt;59900&lt;/gdppc&gt;
    &lt;neighbor name=&quot;Malaysia&quot; direction=&quot;N&quot; /&gt;
  &lt;/tiger&gt;
  &lt;car name=&quot;Panama&quot;&gt;
    &lt;rank updated=&quot;yes&quot;&gt;69&lt;/rank&gt;
    &lt;year&gt;2011&lt;/year&gt;
    &lt;gdppc&gt;13600&lt;/gdppc&gt;
    &lt;neighbor name=&quot;Costa Rica&quot; direction=&quot;W&quot; /&gt;
    &lt;neighbor name=&quot;Colombia&quot; direction=&quot;E&quot; /&gt;
  &lt;/car&gt;
&lt;/data&gt;

You can use pop() or del() to remove a attribute from tag element.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Python中删除XML文件中的多行然后保存它

问题

答案1

答案2

答案3

如何使用Scrapy Playwright设置页面的视口大小？

如何向 pandas 工具包代理添加对话记忆？

获取由Flask应用程序中的Celery创建的Redis中任务的所有键列表。

可以将一个大型字典列表转换为字符串，然后在Python中再次转换为列表吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论