用BeautifulSoup4替换HTML中的所有文本,同时保持原始DOM结构。

huangapple go评论104阅读模式
英文:

Replacing all text in a HTML with BeautifulSoup4, while keeping the original DOM structure

问题

Here is the translated code part you requested:

  1. 我正在尝试使用Python中的Beautifulsoup4替换HTML文档中的所有文本包括那些既包含文本又包含其他元素的元素例如我想将`<div>text1<strong>text2</strong>text3</div>`变成`<div>text1_changed<strong>text2_changed</strong>text3_changed</div>`
  2. 我知道有一个相关的线程https://stackoverflow.com/questions/42040730/faster-way-of-replacing-text-in-all-dom-elements但是这个线程使用了Javascript所以其中使用的函数在Python中不可用我想要使用原生Python实现相同的目标
  3. 如果所有标签都包含标签或文本rand_text函数返回一个随机字符串),我已经有能够工作的代码
  4. ```python
  5. from bs4 import BeautifulSoup as bs
  6. def randomize(html):
  7. soup = bs(html, features='html.parser')
  8. elements = soup.find_all()
  9. for el in elements:
  10. if el.string == None:
  11. pass
  12. else:
  13. replacement = rand_text(el.text)
  14. el.string.replace_with(replacement)
  15. return soup

然而,当元素的"string"属性为None时,这段代码在上面的示例中不起作用,因为该元素既包含其他元素又包含文本。

我还尝试过,如果"string"属性为None,则创建一个新元素,然后替换整个元素:

  1. from bs4 import BeautifulSoup as bs
  2. def anonymize2(html):
  3. soup = bs(html, features='html.parser')
  4. elements = soup.find_all()
  5. for el in elements:
  6. replacement = rand_text(el.text)
  7. if el.string:
  8. el.string.replace_with(replacement)
  9. else:
  10. new_el = soup.new_tag(el.name)
  11. new_el.attrs = el.attrs
  12. for sub_el in el.contents:
  13. new_el.append(sub_el)
  14. new_el.string = replacement
  15. parent = el.parent
  16. if parent:
  17. if new_el not in soup:
  18. soup.append(new_el)
  19. parent.replace_with(new_el)
  20. return soup

然而,这段代码会引发错误"ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree."。我认为我之所以会遇到这个错误,是因为算法已经替换了要替换的元素的父元素。

我应该实现什么逻辑来解决这个问题?或者如何使用不同的方法实现我的原始目标?

  1. <details>
  2. <summary>英文:</summary>
  3. I am trying to replace all text in a HTML document using Beautifulsoap4 in Python, including elements that have both text and other elements inside them. For instance I want
  4. `&lt;div&gt;text1&lt;strong&gt;text2&lt;/strong&gt;text3&lt;/div&gt;` to become
  5. `&lt;div&gt;text1_changed&lt;strong&gt;text2_changed&lt;/strong&gt;text3_changed&lt;/div&gt;`.
  6. I am aware of the thread https://stackoverflow.com/questions/42040730/faster-way-of-replacing-text-in-all-dom-elements, however this uses Javascript, so the functions used are not available in Python. I would like to accomplish the same goal using native Python.
  7. I have code that already works if all tags contain either tags or text (the rand_text function returns a random string):

from bs4 import BeautifulSoup as bs

def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()

  1. for el in elements:
  2. if el.string == None:
  3. pass
  4. else:
  5. replacement = rand_text(el.text)
  6. el.string.replace_with(replacement)
  7. return soup
  1. However this code will not work in the above example, when the element&#39;s &quot;string&quot; attribute is None, because it has both other elements and text inside.
  2. I have also tried creating a new element if the &quot;string&quot; attribute is None and then replace the entire element:

from bs4 import BeautifulSoup as bs

def anonymize2(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
new_el = soup.new_tag(el.name)
new_el.attrs = el.attrs
for sub_el in el.contents:
new_el.append(sub_el)
new_el.string = replacement
parent = el.parent
if parent:
if new_el not in soup:
soup.append(new_el)
parent.replace_with(new_el)
return soup

  1. however this one gives the error &quot;*ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree.*&quot;
  2. I think I am getting this error, because the algorithm already replaced the parent of the element it is trying to replace.
  3. What logic could I implement to fix this?
  4. Or how could I accomplish my original goal using a different method?
  5. </details>
  6. # 答案1
  7. **得分**: 0
  8. 你可以遍历元素的 `contents` 并检查每个项目是否为字符串,然后替换字符串。
  9. ```python
  10. from bs4 import BeautifulSoup as bs
  11. def randomize(html):
  12. soup = bs(html, features='html.parser')
  13. elements = soup.find_all()
  14. for el in elements:
  15. replacement = rand_text(el.text)
  16. if el.string:
  17. el.string.replace_with(replacement)
  18. else:
  19. for sub_el in el.contents:
  20. if isinstance(sub_el, str):
  21. sub_el.replace_with(rand_text(sub_el))
  22. return soup
  23. # 用于测试目的定义的。请用您自己的逻辑替换这个。
  24. def rand_text(text):
  25. return text + "_changed"
  26. html = "<div>text1<strong>text2</strong>text3</div>"
  27. print(randomize(html))

输出:

  1. <div>text1_changed<strong>text2_changed</strong>text3_changed</div>
英文:

You can iterate over the contents of the element and check if each item is a string and then replace the string.

  1. from bs4 import BeautifulSoup as bs
  2. def randomize(html):
  3. soup = bs(html, features=&#39;html.parser&#39;)
  4. elements = soup.find_all()
  5. for el in elements:
  6. replacement = rand_text(el.text)
  7. if el.string:
  8. el.string.replace_with(replacement)
  9. else:
  10. for sub_el in el.contents:
  11. if isinstance(sub_el, str):
  12. sub_el.replace_with(rand_text(sub_el))
  13. return soup
  14. # defined for testing purposes. Replace this with your own logic
  15. def rand_text(text):
  16. return text + &quot;_changed&quot;
  17. html = &quot;&lt;div&gt;text1&lt;strong&gt;text2&lt;/strong&gt;text3&lt;/div&gt;&quot;
  18. print(randomize(html))

Outputs:

  1. &lt;div&gt;text1_changed&lt;strong&gt;text2_changed&lt;/strong&gt;text3_changed&lt;/div&gt;

huangapple
  • 本文由 发表于 2023年5月18日 06:34:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276604.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定