2023年5月18日 06:34:40go评论57阅读模式

英文:

Replacing all text in a HTML with BeautifulSoup4, while keeping the original DOM structure

问题

Here is the translated code part you requested:

我正在尝试使用Python中的Beautifulsoup4替换HTML文档中的所有文本，包括那些既包含文本又包含其他元素的元素。例如，我想将`&lt;div&gt;text1&lt;strong&gt;text2&lt;/strong&gt;text3&lt;/div&gt;`变成`&lt;div&gt;text1_changed&lt;strong&gt;text2_changed&lt;/strong&gt;text3_changed&lt;/div&gt;`。

我知道有一个相关的线程https://stackoverflow.com/questions/42040730/faster-way-of-replacing-text-in-all-dom-elements，但是这个线程使用了Javascript，所以其中使用的函数在Python中不可用。我想要使用原生Python实现相同的目标。

如果所有标签都包含标签或文本（rand_text函数返回一个随机字符串），我已经有能够工作的代码：
```python
from bs4 import BeautifulSoup as bs

def randomize(html):
    soup = bs(html, features='html.parser')
    elements = soup.find_all()

    for el in elements:
        if el.string == None:
            pass
        else:
            replacement = rand_text(el.text)
        el.string.replace_with(replacement)
    return soup

然而，当元素的"string"属性为None时，这段代码在上面的示例中不起作用，因为该元素既包含其他元素又包含文本。

我还尝试过，如果"string"属性为None，则创建一个新元素，然后替换整个元素：

from bs4 import BeautifulSoup as bs

def anonymize2(html):
    soup = bs(html, features='html.parser')
    elements = soup.find_all()
    for el in elements:
        replacement = rand_text(el.text)
        if el.string:
            el.string.replace_with(replacement)
        else:
            new_el = soup.new_tag(el.name)
            new_el.attrs = el.attrs
            for sub_el in el.contents:
                new_el.append(sub_el)
            new_el.string = replacement
            parent = el.parent
            if parent:
                if new_el not in soup:
                    soup.append(new_el)
                parent.replace_with(new_el)
    return soup

然而，这段代码会引发错误"ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree."。我认为我之所以会遇到这个错误，是因为算法已经替换了要替换的元素的父元素。

我应该实现什么逻辑来解决这个问题？或者如何使用不同的方法实现我的原始目标？


<details>
<summary>英文:</summary>

I am trying to replace all text in a HTML document using Beautifulsoap4 in Python, including elements that have both text and other elements inside them. For instance I want  
`&lt;div&gt;text1&lt;strong&gt;text2&lt;/strong&gt;text3&lt;/div&gt;` to become  
`&lt;div&gt;text1_changed&lt;strong&gt;text2_changed&lt;/strong&gt;text3_changed&lt;/div&gt;`.

I am aware of the thread https://stackoverflow.com/questions/42040730/faster-way-of-replacing-text-in-all-dom-elements, however this uses Javascript, so the functions used are not available in Python. I would like to accomplish the same goal using native Python.

I have code that already works if all tags contain either tags or text (the rand_text function returns a random string):

from bs4 import BeautifulSoup as bs

def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()

for el in elements:
    if el.string == None:
        pass
    else:
        replacement = rand_text(el.text)
    el.string.replace_with(replacement)
return soup

However this code will not work in the above example, when the element&#39;s &quot;string&quot; attribute is None, because it has both other elements and text inside.

I have also tried creating a new element if the &quot;string&quot; attribute is None and then replace the entire element:

from bs4 import BeautifulSoup as bs

def anonymize2(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
new_el = soup.new_tag(el.name)
new_el.attrs = el.attrs
for sub_el in el.contents:
new_el.append(sub_el)
new_el.string = replacement
parent = el.parent
if parent:
if new_el not in soup:
soup.append(new_el)
parent.replace_with(new_el)
return soup


however this one gives the error &quot;*ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree.*&quot;  
I think I am getting this error, because the algorithm already replaced the parent of the element it is trying to replace.  

What logic could I implement to fix this?  
Or how could I accomplish my original goal using a different method?

</details>


# 答案1
**得分**: 0

你可以遍历元素的 `contents` 并检查每个项目是否为字符串，然后替换字符串。

```python
from bs4 import BeautifulSoup as bs

def randomize(html):
    soup = bs(html, features='html.parser')
    elements = soup.find_all()

    for el in elements:
        replacement = rand_text(el.text)
        if el.string:
            el.string.replace_with(replacement)
        else:
            for sub_el in el.contents:
                if isinstance(sub_el, str):
                    sub_el.replace_with(rand_text(sub_el))
    return soup

# 用于测试目的定义的。请用您自己的逻辑替换这个。
def rand_text(text):
    return text + "_changed"

html = "<div>text1<strong>text2</strong>text3</div>"
print(randomize(html))

输出:

<div>text1_changed<strong>text2_changed</strong>text3_changed</div>

英文:

You can iterate over the contents of the element and check if each item is a string and then replace the string.

from bs4 import BeautifulSoup as bs

def randomize(html):
    soup = bs(html, features=&#39;html.parser&#39;)
    elements = soup.find_all()

    for el in elements:
        replacement = rand_text(el.text)
        if el.string:
            el.string.replace_with(replacement)
        else:
            for sub_el in el.contents:
                if isinstance(sub_el, str):
                    sub_el.replace_with(rand_text(sub_el))
    return soup

# defined for testing purposes. Replace this with your own logic
def rand_text(text):
    return text + &quot;_changed&quot;



html = &quot;&lt;div&gt;text1&lt;strong&gt;text2&lt;/strong&gt;text3&lt;/div&gt;&quot;
print(randomize(html))

Outputs:

&lt;div&gt;text1_changed&lt;strong&gt;text2_changed&lt;/strong&gt;text3_changed&lt;/div&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用BeautifulSoup4替换HTML中的所有文本，同时保持原始DOM结构。

问题

二叉搜索树中特定范围内节点值的总和

将Snowflake表中的值存储到Python变量中

如何在chaquopy中更改内置库文件？

Multiple dispatch – "function requires at least 1 positional argument"

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论