英文:
Python - replace some text in html only between p-tags
问题
以下是您要翻译的部分:
- 我正在寻找一种解决方案来执行类似于这样的操作:
1. 我有HTML文本:
<h2>一些标题 1</h2>
<p>一些文本 1,一些文本 2</p>
<p>一些文本 3,一些文本 4</p>
...
<h2>一些标题 2</h2>
<p>一些文本 5,一些文本 6</p>
<p>一些文本 7,一些文本 n</p>
<pre>一些代码</pre>
(...)
- 我需要替换一些单词和句子,但仅在 p 标签之间进行替换。
- 有什么想法?
我需要这样做来改进我尝试开发的函数。该函数从WordPress加载发布数据,然后需要将发布文本中的某些短语包装在HTML-strong标签中。在发布数据中有许多HTML标签。我只需要对包含在 p 标签(段落)之间的文本部分进行更改。
post_text 变量是我需要替换(实际上用强调的HTML标签包围起来)的单词或文本块。
英文:
I'm searching for a solution to do smth. like this:
1. I have html text(s):
<h2> some heading 1</h2>
<p> some text 1, some text 2</p>
<p> some text 3, some text 4</p>
...
<h2>some heading 2</h2>
<p> some text 5, some text 6</p>
<p> some text 7, some text n</p>
<pre> some code </pre>
(...)
- I need to replace some words and sentences, but only between p - tags
- Any ideas?
I need this to improve function, which i try develop.
Function loads publication data from WordPress,
Then some phrases in publication text need to be sorrounded in html-strong tags.
In publication data there are many of html tags. I need to add changes only to parts of text which are sorrounded in p-tags (paragraphs)
post_text variable is word or chunk that i need to replace (really sorround into Strong-html tags)
def wp_bold_post_text(wordpress_url, wordpress_header, object_type, id, post_text):
api_url = wordpress_url + f'wp-json/wp/v2/{object_type}/{id}'
data = {} # {'status': 'inherit, publish, auto-draft, draft, trash, private, pending'}
response = requests.get(api_url, headers=wordpress_header, json=data)
# publication change
publication_json = response.json()
old_publication_str = publication_json["content"]["rendered"]
new_publication_str = old_publication_str.replace(post_text, "<strong>" + post_text + "</strong>", 1)
# publication
api_url = wordpress_url + f'wp-json/wp/v2/{object_type}/{id}'
data = {'content': new_publication_str}
response = requests.post(api_url, headers=wordpress_header, json=data)
return print(response) # (response.json()) # ["content"]["rendered"]
答案1
得分: 1
你想要的是替换两个字符串标识之间的文本,即开始标识 '<p>'
和结束标识 '</p>'
之间的文本。
为了实现这个目标,你可以首先编写一个函数来识别开始标识。使用 find
字符串方法。text.find('<p>')
将返回字符串 '<p>'
的第一个索引,即 <
的索引。由于你想知道的是 之间的 文本,你应该将 >
后的索引作为开始索引存储,因此:
start_index = result + len('<p>')
然后,你希望找到结束标识。它位于你找到开始标识的地方之后。为此,你应该使用 index
字符串方法的第二个参数。就像这样:
end_index = text.index('</p>')
find
和 index
本质上是做同样的事情,但当找不到项目索引时,它们的区别就出现了。find
返回 -1,而 index
引发一个错误。
如果你想断定是否找到了某个东西,你应该使用 index
,因为它会警告你并停止程序。如果你想以编程方式处理这种不便,你应该使用 find
(避免使用 try except 块)。在文件的末尾的最后一个开始标识后,你总是找不到开始标识,但是在找到打开标签后,总是 期望找到 html p 标签。
根据所说的,find_start_and_end_idx
函数看起来像这样:
def find_start_and_end_idx(start_referential, end_referential, text):
start_idx = text.find(start_referential) - len(start_referential)
if start_idx < -1:
return # 返回 None
end_idx = text.index(end_referential, start_idx)
return start_idx, end_idx
要使用返回的索引,你将它们放在字符串后面,像这样 text[start_idx:end_idx]
。这返回一个以 start_idx
为第一个索引,以 end_idx - 1
为最后索引的文本。
从开始到结束不包括结束,它是到达结束。
就是在 p 标签之间的内容。
替换:
start, end = find_start_and_end_idx('<p>', '</p>', text)
replaced_text = text[start:end].replace(old_text, new_text)
text = text[:start] + replaced_text + text[end:]
现在我们只需要为整个 HTML 文档重现它。为此,我们可以使用一个 while True
循环来替换所有可能的 p 文本,在没有找到开始标识时返回编辑后的字符串。
def replace_in_between(text, open_, close, old_text, new_text) -> str:
"""替换两个识别字符串之间的文本。
:param text: 包含所有数据的文本
:param open_: 开始字符串,在你的情况下为 '<p>'
:param close: 结束字符串,在你的情况下为 '</p>'
:param old_text: 要替换的子字符串
:param new_text: 要用来替换子字符串的字符串
这将在 `text` 内部更改所有实例 `text_to_replace`。
"""
def find_start_and_end_idx(start_referential, end_referential, text):
start_idx = text.find(start_referential) - len(start_referential)
if start_idx < 0:
return # 返回 None
end_idx = text.index(end_referential, start_idx)
return start_idx, end_idx
text_idx = 0
while True:
result = find_start_and_end_idx(open_, close, text[text_idx:])
if result is None:
break
start = result[0] + text_idx # 文本索引在 text[text_idx:] 中被切掉了
end = result[1] + text_idx # 所以你需要在这里将它加回来
replaced_text = text[start:end].replace(old_text, new_text)
text = text[:start] + replaced_text + text[end:]
# 现在我们更改了 text_idx,所以 while 循环会找到一个新的 p 标签
text_idx = start + len(result) + len(close)
return text
这个函数能够按预期工作。
英文:
What you want is to replace text between two string referentials, being the start-referential '<p>'
and your end-referential '</p>'
.
To get this result, you can start by writing a function that identifies the start-referential. Use the find
string method. text.find('<p>')
will return the first index of the string '<p>'
, being the index of <
. Since you want to know what text is between, you should store the index after >
as the start index, so:
start_index = result + len('<p>')
Then you want to find the end-referential. It is after where you found the start-referential. For this you should use the index
string method second argument. Like:
end_index = text.index('</p>')
find
and index
do essentialy the same thing, but the difference comes when no item index is found. find
returns -1 and index
raises an error.
If you want to assert something is found, you use index
because it will warn you and stop the program. If you want to handle the inconvenience programaticaly, you should use find
(avoiding try except blocks). You will always not find a start-referential in the end of the file after the last one, but html p tags are always expected after you find an opening tag.
Using what was said, the find_start_and_end_idx
function looks like this:
def find_start_and_end_idx(start_referential, end_referential, text):
start_idx = text.find(start_referential) - len(start_referential)
if start_idx < -1:
return # returns None
end_idx = text.index(end_referential, start_idx)
return start_idx, end_idx
To use the indexes returned, you put them after the string like text[start_idx:end_idx]
. This returns a text with the first index as start_idx
and the last as end_idx - 1
.
> From start to end does not include end, it is up to end.
Exactly what's between the p tags.
Replacing:
start, end = find_start_and_end_idx('<p>', '</p>', text)
replaced_text = text[start:end].replace(old_text, new_text)
text = text[:start] + replaced_text + text[end:]
Now we just need to reproduce it for the whole html document.
For that we can use a while True
for replacing all possible p text and after no start-referential is found return the edited string.
def replace_in_between(text, open_, close, old_text, new_text) -> str:
"""Replaces text in between two recognized strings.
:param text: The text that contains all data
:param open_: The opening string, in your case '<p>'
:param close: The closing string, in you case '</p>'
:param old_text: The substring that you wish to replace
:param new_text: The string to replace the substring with
This will change all instances of `text_to_replace` inside `text`.
"""
def find_start_and_end_idx(start_referential, end_referential, text):
start_idx = text.find(start_referential) - len(start_referential)
if start_idx < 0:
return # returns None
end_idx = text.index(end_referential, start_idx)
return start_idx, end_idx
text_idx = 0
while True:
result = find_start_and_end_idx(open_, close, text[text_idx:])
if result is None:
break
start = result[0] + text_idx # The text index was cut out in text[text_idx:]
end = result[1] + text_idx # So you add it up back here
replaced_text = text[start:end].replace(old_text, new_text)
text = text[:start] + replaced_text + text[end:]
# Now we change the text_idx for the while loop find a new p tag
text_idx = start + len(result) + len(close)
return text
This works as intended.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论