英文:
Replacing all text in a HTML with BeautifulSoup4, while keeping the original DOM structure
问题
Here is the translated code part you requested:
我正在尝试使用Python中的Beautifulsoup4替换HTML文档中的所有文本,包括那些既包含文本又包含其他元素的元素。例如,我想将`<div>text1<strong>text2</strong>text3</div>`变成`<div>text1_changed<strong>text2_changed</strong>text3_changed</div>`。
我知道有一个相关的线程https://stackoverflow.com/questions/42040730/faster-way-of-replacing-text-in-all-dom-elements,但是这个线程使用了Javascript,所以其中使用的函数在Python中不可用。我想要使用原生Python实现相同的目标。
如果所有标签都包含标签或文本(rand_text函数返回一个随机字符串),我已经有能够工作的代码:
```python
from bs4 import BeautifulSoup as bs
def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
if el.string == None:
pass
else:
replacement = rand_text(el.text)
el.string.replace_with(replacement)
return soup
然而,当元素的"string"属性为None时,这段代码在上面的示例中不起作用,因为该元素既包含其他元素又包含文本。
我还尝试过,如果"string"属性为None,则创建一个新元素,然后替换整个元素:
from bs4 import BeautifulSoup as bs
def anonymize2(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
new_el = soup.new_tag(el.name)
new_el.attrs = el.attrs
for sub_el in el.contents:
new_el.append(sub_el)
new_el.string = replacement
parent = el.parent
if parent:
if new_el not in soup:
soup.append(new_el)
parent.replace_with(new_el)
return soup
然而,这段代码会引发错误"ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree."。我认为我之所以会遇到这个错误,是因为算法已经替换了要替换的元素的父元素。
我应该实现什么逻辑来解决这个问题?或者如何使用不同的方法实现我的原始目标?
<details>
<summary>英文:</summary>
I am trying to replace all text in a HTML document using Beautifulsoap4 in Python, including elements that have both text and other elements inside them. For instance I want
`<div>text1<strong>text2</strong>text3</div>` to become
`<div>text1_changed<strong>text2_changed</strong>text3_changed</div>`.
I am aware of the thread https://stackoverflow.com/questions/42040730/faster-way-of-replacing-text-in-all-dom-elements, however this uses Javascript, so the functions used are not available in Python. I would like to accomplish the same goal using native Python.
I have code that already works if all tags contain either tags or text (the rand_text function returns a random string):
from bs4 import BeautifulSoup as bs
def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
if el.string == None:
pass
else:
replacement = rand_text(el.text)
el.string.replace_with(replacement)
return soup
However this code will not work in the above example, when the element's "string" attribute is None, because it has both other elements and text inside.
I have also tried creating a new element if the "string" attribute is None and then replace the entire element:
from bs4 import BeautifulSoup as bs
def anonymize2(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
new_el = soup.new_tag(el.name)
new_el.attrs = el.attrs
for sub_el in el.contents:
new_el.append(sub_el)
new_el.string = replacement
parent = el.parent
if parent:
if new_el not in soup:
soup.append(new_el)
parent.replace_with(new_el)
return soup
however this one gives the error "*ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree.*"
I think I am getting this error, because the algorithm already replaced the parent of the element it is trying to replace.
What logic could I implement to fix this?
Or how could I accomplish my original goal using a different method?
</details>
# 答案1
**得分**: 0
你可以遍历元素的 `contents` 并检查每个项目是否为字符串,然后替换字符串。
```python
from bs4 import BeautifulSoup as bs
def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
for sub_el in el.contents:
if isinstance(sub_el, str):
sub_el.replace_with(rand_text(sub_el))
return soup
# 用于测试目的定义的。请用您自己的逻辑替换这个。
def rand_text(text):
return text + "_changed"
html = "<div>text1<strong>text2</strong>text3</div>"
print(randomize(html))
输出:
<div>text1_changed<strong>text2_changed</strong>text3_changed</div>
英文:
You can iterate over the contents
of the element and check if each item is a string and then replace the string.
from bs4 import BeautifulSoup as bs
def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
for sub_el in el.contents:
if isinstance(sub_el, str):
sub_el.replace_with(rand_text(sub_el))
return soup
# defined for testing purposes. Replace this with your own logic
def rand_text(text):
return text + "_changed"
html = "<div>text1<strong>text2</strong>text3</div>"
print(randomize(html))
Outputs:
<div>text1_changed<strong>text2_changed</strong>text3_changed</div>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论