英文:
How to remove several lines in xml file and then save it in Python
问题
I've translated the code part for you:
from pathlib import Path
# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)
contents = path.read_text()
xml_lines = [
'first',
'second',
]
lines = contents.splitlines()
removed_lines = 0
for line in lines:
for xml_line in xml_lines:
if xml_line in line:
lines.remove(line)
removed_lines += 1
print(f'Line: "{line.strip()}" has been removed!')
print(f"\n\n{removed_lines} lines have been removed!")
path.write_text(str(lines))
Please note that modifying a list while iterating through it can lead to unexpected behavior. You may want to consider using a different approach, such as creating a new list for lines you want to keep, to avoid potential issues.
英文:
I want to remove all lines that contain all words in the 'xml_lines' list. I created this script:
from pathlib import Path
# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)
conntents = path.read_text()
xml_lines = [
'first',
'second',
]
lines = conntents.splitlines()
removed_lines = 0
for line in lines:
for xml_line in xml_lines:
if xml_line in line:
lines.remove(line)
removed_lines += 1
print(f'Line: "{line.strip()}" has been removed!')
print(f"\n\n{removed_lines} lines have been removded!")
path.write_text(str(lines))
At the and I have a file that does not look like xml. Can anyone help?
Example (before):
<?xml version="1.0"?>
<data>
<country
name="Liechtenstein"
first="2d2md"
second="m3d39d93">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<tiger
name="Singapore"
first="hfdfherbre"
second="m3d39d93">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</tiger>
<car
name="Panama"
first="th54b4"
second="45b45gt45h">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</car>
</data>
if script finds any line that contain 'first' or 'second', the entire line should be removed:
<?xml version="1.0"?>
<data>
<country
name="Liechtenstein"
>
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<tiger
name="Singapore"
>
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</tiger>
<car
name="Panama">
>
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</car>
</data>
This is only an example, entire xml file consists of 9999999 lines...
答案1
得分: 1
考虑XSLT这个专用语言,旨在转换XML文件。具体来说,一个标识模板和空模板可以在整个文档中移除所需的属性,而无需单个for
循环。Python的lxml
第三方包可以运行XSLT 1.0脚本。
XSLT (另存为 .xsl 文件,一种特殊的XML文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- 身份转换 -->
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<!-- 空模板以移除内容 -->
<xsl:template match="@first|@second"/>
</xsl:stylesheet>
Python
import lxml.etree as lx
# 解析XML和XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")
# 配置并运行转换器
transformer = lx.XSLT(style)
result = transformer(doc)
# 输出到文件
result.write_output("Output.xml")
英文:
Consider XSLT the special-purpose language designed to transform XML files. Specifically, an identity template and empty template can remove the needed attributes across entire document without a single for
loop. Python's lxml
third-party package can run XSLT 1.0 scripts.
XSLT (save as .xsl file, a special XML file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<!-- EMPTY TEMPLATE TO REMOVE CONTENT -->
<xsl:template match="@first|@second"/>
</xsl:stylesheet>
<kbd>Online Demo</kbd>
Python
import lxml.etree as lx
# PARSE XML AND XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")
# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)
# OUTPUT TO FILE
result.write_output("Output.xml")
答案2
得分: 0
你可以按照这个答案中描述的方法来做一些简单的事情,基本上使用xpath和lxml(可能还有其他方法可以实现相同的效果):
from lxml import etree
doc = etree.parse("your xml file")
to_drop = ["first", "second"]
for td in to_drop:
for target in doc.xpath('//*'):
target.attrib.pop(td, None)
print(etree.tostring(doc).decode())
输出应该是你期望的输出。
英文:
You could do something simple along the lines described in this answer, basically using xpath and lxml (and there may be other ways to do the same):
from lxml import etree
doc = etree.parse("your xml file")
to_drop = ["first","second"]
for td in to_drop:
for target in doc.xpath('//*'):
target.attrib.pop(td, None)
print(etree.tostring(doc).decode())
Output should be your expected output.
答案3
得分: 0
对于大型 XML 文件,您可以使用 iterparse()
并操作属性值:
import xml.etree.ElementTree as ET
filename = "outfile.xml"
with open(filename, 'wb') as out:
out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))
attrib_list = ['first','second']
def removekey(d, keys):
r = dict(d)
for key, value in keys.items():
del r[key]
return r
for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
if len(n) != 0:
elem.attrib = removekey(elem.attrib, n)
with open("outfile.xml", 'ab') as out:
out.write(ET.tostring(elem))
with open(filename, 'ab') as out:
out.write(str.encode('</data>'))
输出:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</country>
<tiger name="Singapore">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N" />
</tiger>
<car name="Panama">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" />
<neighbor name="Colombia" direction="E" />
</car>
</data>
您可以使用 pop()
或 del()
来从标签元素中删除属性。
英文:
For huge xml files you can use iterparse()
and manipulate the attribute values:
import xml.etree.ElementTree as ET
filename = "outfile.xml"
with open(filename, 'wb') as out:
out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))
attrib_list = ['first','second']
def removekey(d, keys):
r = dict(d)
for key, value in keys.items():
del r[key]
return r
for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
if len(n) != 0:
elem.attrib = removekey(elem.attrib, n)
with open("outfile.xml", 'ab') as out:
out.write(ET.tostring(elem))
with open(filename, 'ab') as out:
out.write(str.encode('</data>'))
Output:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</country>
<tiger name="Singapore">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N" />
</tiger>
<car name="Panama">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" />
<neighbor name="Colombia" direction="E" />
</car>
</data>
You can use pop()
or del()
to remove a attribute from tag element.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论