Python – 仅在 p 标签之间替换 HTML 中的一些文本

huangapple go评论71阅读模式
英文:

Python - replace some text in html only between p-tags

问题

以下是您要翻译的部分:

  1. 我正在寻找一种解决方案来执行类似于这样的操作:
1. 我有HTML文本
<h2>一些标题 1</h2>
<p>一些文本 1一些文本 2</p>
<p>一些文本 3一些文本 4</p>
...
<h2>一些标题 2</h2>
<p>一些文本 5一些文本 6</p>
<p>一些文本 7一些文本 n</p>

<pre>一些代码</pre>

(...)
  1. 我需要替换一些单词和句子,但仅在 p 标签之间进行替换。
  2. 有什么想法?

我需要这样做来改进我尝试开发的函数。该函数从WordPress加载发布数据,然后需要将发布文本中的某些短语包装在HTML-strong标签中。在发布数据中有许多HTML标签。我只需要对包含在 p 标签(段落)之间的文本部分进行更改。

post_text 变量是我需要替换(实际上用强调的HTML标签包围起来)的单词或文本块。

英文:

I'm searching for a solution to do smth. like this:

1. I have html text(s):
&lt;h2&gt; some heading 1&lt;/h2&gt;
&lt;p&gt; some text 1, some text 2&lt;/p&gt;
&lt;p&gt; some text 3, some text 4&lt;/p&gt;
...
&lt;h2&gt;some heading 2&lt;/h2&gt;
&lt;p&gt; some text 5, some text 6&lt;/p&gt;
&lt;p&gt; some text 7, some text n&lt;/p&gt;

&lt;pre&gt; some code &lt;/pre&gt;

(...)
  1. I need to replace some words and sentences, but only between p - tags
  2. Any ideas?

I need this to improve function, which i try develop.
Function loads publication data from WordPress,
Then some phrases in publication text need to be sorrounded in html-strong tags.
In publication data there are many of html tags. I need to add changes only to parts of text which are sorrounded in p-tags (paragraphs)

post_text variable is word or chunk that i need to replace (really sorround into Strong-html tags)

 def wp_bold_post_text(wordpress_url, wordpress_header, object_type, id, post_text):

    api_url = wordpress_url + f&#39;wp-json/wp/v2/{object_type}/{id}&#39;
    data = {}  # {&#39;status&#39;: &#39;inherit, publish, auto-draft, draft, trash, private, pending&#39;}
    response = requests.get(api_url, headers=wordpress_header, json=data)

    # publication change
    publication_json = response.json()
    old_publication_str = publication_json[&quot;content&quot;][&quot;rendered&quot;]
    new_publication_str = old_publication_str.replace(post_text, &quot;&lt;strong&gt;&quot; + post_text + &quot;&lt;/strong&gt;&quot;, 1)

    # publication
    api_url = wordpress_url + f&#39;wp-json/wp/v2/{object_type}/{id}&#39;
    data = {&#39;content&#39;: new_publication_str}
    response = requests.post(api_url, headers=wordpress_header, json=data)

    return print(response)  # (response.json())  # [&quot;content&quot;][&quot;rendered&quot;]

答案1

得分: 1

你想要的是替换两个字符串标识之间的文本,即开始标识 &#39;&lt;p&gt;&#39; 和结束标识 &#39;&lt;/p&gt;&#39; 之间的文本。

为了实现这个目标,你可以首先编写一个函数来识别开始标识。使用 find 字符串方法。text.find(&#39;&lt;p&gt;&#39;) 将返回字符串 &#39;&lt;p&gt;&#39; 的第一个索引,即 &lt; 的索引。由于你想知道的是 之间的 文本,你应该将 &gt; 后的索引作为开始索引存储,因此:

start_index = result + len(&#39;&lt;p&gt;&#39;)

然后,你希望找到结束标识。它位于你找到开始标识的地方之后。为此,你应该使用 index 字符串方法的第二个参数。就像这样:

end_index = text.index(&#39;&lt;/p&gt;&#39;)

findindex 本质上是做同样的事情,但当找不到项目索引时,它们的区别就出现了。find 返回 -1,而 index 引发一个错误。

如果你想断定是否找到了某个东西,你应该使用 index,因为它会警告你并停止程序。如果你想以编程方式处理这种不便,你应该使用 find(避免使用 try except 块)。在文件的末尾的最后一个开始标识后,你总是找不到开始标识,但是在找到打开标签后,总是 期望找到 html p 标签。

根据所说的,find_start_and_end_idx 函数看起来像这样:

def find_start_and_end_idx(start_referential, end_referential, text):
    start_idx = text.find(start_referential) - len(start_referential)
    if start_idx < -1:
        return  # 返回 None
    end_idx = text.index(end_referential, start_idx)
    return start_idx, end_idx

要使用返回的索引,你将它们放在字符串后面,像这样 text[start_idx:end_idx]。这返回一个以 start_idx 为第一个索引,以 end_idx - 1 为最后索引的文本。

从开始到结束不包括结束,它是到达结束。

就是在 p 标签之间的内容。

替换:

start, end = find_start_and_end_idx(&#39;&lt;p&gt;&#39;, &#39;&lt;/p&gt;&#39;, text)
replaced_text = text[start:end].replace(old_text, new_text)
text = text[:start] + replaced_text + text[end:]

现在我们只需要为整个 HTML 文档重现它。为此,我们可以使用一个 while True 循环来替换所有可能的 p 文本,在没有找到开始标识时返回编辑后的字符串。

def replace_in_between(text, open_, close, old_text, new_text) -> str:
    """替换两个识别字符串之间的文本。

    :param text: 包含所有数据的文本
    :param open_: 开始字符串,在你的情况下为 &#39;&lt;p&gt;&#39;
    :param close: 结束字符串,在你的情况下为 &#39;&lt;/p&gt;&#39;
    :param old_text: 要替换的子字符串
    :param new_text: 要用来替换子字符串的字符串

    这将在 `text` 内部更改所有实例 `text_to_replace`。
    """

    def find_start_and_end_idx(start_referential, end_referential, text):
        start_idx = text.find(start_referential) - len(start_referential)
        if start_idx < 0:
            return  # 返回 None
        end_idx = text.index(end_referential, start_idx)
        return start_idx, end_idx

    text_idx = 0
    while True:
        result = find_start_and_end_idx(open_, close, text[text_idx:])
        if result is None:
            break
        start = result[0] + text_idx  # 文本索引在 text[text_idx:] 中被切掉了
        end = result[1] + text_idx    # 所以你需要在这里将它加回来

        replaced_text = text[start:end].replace(old_text, new_text)
        text = text[:start] + replaced_text + text[end:]
        # 现在我们更改了 text_idx,所以 while 循环会找到一个新的 p 标签
        text_idx = start + len(result) + len(close)
    return text

这个函数能够按预期工作。

英文:

What you want is to replace text between two string referentials, being the start-referential &#39;&lt;p&gt;&#39; and your end-referential &#39;&lt;/p&gt;&#39;.

To get this result, you can start by writing a function that identifies the start-referential. Use the find string method. text.find(&#39;&lt;p&gt;&#39;) will return the first index of the string &#39;&lt;p&gt;&#39;, being the index of &lt;. Since you want to know what text is between, you should store the index after &gt; as the start index, so:

start_index = result + len(&#39;&lt;p&gt;&#39;)

Then you want to find the end-referential. It is after where you found the start-referential. For this you should use the index string method second argument. Like:

end_index = text.index(&#39;&lt;/p&gt;&#39;)

find and index do essentialy the same thing, but the difference comes when no item index is found. find returns -1 and index raises an error.

If you want to assert something is found, you use index because it will warn you and stop the program. If you want to handle the inconvenience programaticaly, you should use find (avoiding try except blocks). You will always not find a start-referential in the end of the file after the last one, but html p tags are always expected after you find an opening tag.

Using what was said, the find_start_and_end_idx function looks like this:

def find_start_and_end_idx(start_referential, end_referential, text):
    start_idx = text.find(start_referential) - len(start_referential)
    if start_idx &lt; -1:
        return  # returns None
    end_idx = text.index(end_referential, start_idx)
    return start_idx, end_idx

To use the indexes returned, you put them after the string like text[start_idx:end_idx]. This returns a text with the first index as start_idx and the last as end_idx - 1.

> From start to end does not include end, it is up to end.

Exactly what's between the p tags.

Replacing:

start, end = find_start_and_end_idx(&#39;&lt;p&gt;&#39;, &#39;&lt;/p&gt;&#39;, text)
replaced_text = text[start:end].replace(old_text, new_text)
text = text[:start] + replaced_text + text[end:]

Now we just need to reproduce it for the whole html document.
For that we can use a while True for replacing all possible p text and after no start-referential is found return the edited string.

def replace_in_between(text, open_, close, old_text, new_text) -&gt; str:
    &quot;&quot;&quot;Replaces text in between two recognized strings.

    :param text: The text that contains all data
    :param open_: The opening string, in your case &#39;&lt;p&gt;&#39;
    :param close: The closing string, in you case &#39;&lt;/p&gt;&#39;
    :param old_text: The substring that you wish to replace
    :param new_text: The string to replace the substring with

    This will change all instances of `text_to_replace` inside `text`.
    &quot;&quot;&quot;

    def find_start_and_end_idx(start_referential, end_referential, text):
        start_idx = text.find(start_referential) - len(start_referential)
        if start_idx &lt; 0:
            return  # returns None
        end_idx = text.index(end_referential, start_idx)
        return start_idx, end_idx

    text_idx = 0
    while True:
        result = find_start_and_end_idx(open_, close, text[text_idx:])
        if result is None:
            break
        start = result[0] + text_idx  # The text index was cut out in text[text_idx:]
        end = result[1] + text_idx    # So you add it up back here

        replaced_text = text[start:end].replace(old_text, new_text)
        text = text[:start] + replaced_text + text[end:]
        # Now we change the text_idx for the while loop find a new p tag
        text_idx = start + len(result) + len(close)
    return text  

This works as intended.

huangapple
  • 本文由 发表于 2023年2月6日 05:42:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75355719.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定