将无效的HTML标记替换为<和>为&lt;和&gt;

huangapple go评论71阅读模式
英文:

Replace invalid html tags with < and > to &lt; and &gt;

问题

当通过机器人发送电报消息时,会使用HTML格式,这意味着当尝试发送包含<argument>的消息时会出错,您需要将这些尖括号替换为&amp;lt;&amp;gt;,但在大段文本中每次写它们都很不方便。我想尝试创建一个正则表达式,它将自动替换这些内容,但不会触及有效的HTML标记,例如:

有效标记示例,无需替换:

&lt;a href=&quot;tg://user?id=1&quot;&gt;Bot&lt;/a&gt;

需要替换的无效标记:

&lt;argument&gt;

这是我尝试创建的代码,但最终没有成功:

import re

def replace_invalid_tags(html_string):
    invalid_tag_pattern = r'&lt;[^a-zA-Z/!?](.*?)&gt;'
    fixed_html = re.sub(invalid_tag_pattern, r'&amp;lt;&amp;gt;', html_string)
    return fixed_html

html_string = '&lt;a href=&quot;#&quot;&gt;Link&lt;/a&gt; &lt;argument1&gt; &lt;argument2&gt;'
fixed_html = replace_invalid_tags(html_string)
print(fixed_html)

希望这对您有帮助。

英文:

When sending a message to telegram with the help of a bot, html formatting is used, which means when trying to send message with <argument> there will be an error, and you need to replace these arrows with &amp;lt; and &amp;gt; but writing them every time in a pile of text is just not convenient as hell, I would like to try to make a regular that will replace such things automatically but do not touch valid html tags for example:

Example of a valid tag that does not need to be replaced

&lt;a href=&quot;tg://user?id=1&quot;&gt;Bot&lt;/a&gt;

Invalid tag that needs to be replaced

&lt;argument&gt;

Here is the code I tried to make, but it doesn't work in the end

import re

def replace_invalid_tags(html_string):
    invalid_tag_pattern = r&#39;&lt;[^a-zA-Z/!?](.*?)&gt;&#39;
    fixed_html = re.sub(invalid_tag_pattern, r&#39;&amp;lt;&amp;gt;&#39;, html_string)
    return fixed_html

html_string = &#39;&lt;a href=&quot;#&quot;&gt;Link&lt;/a&gt; &lt;argument1&gt; &lt;argument2&gt;&#39;
fixed_html = replace_invalid_tags(html_string)
print(fixed_html)

答案1

得分: 1

我个人建议您使用一个Python的HTML解析或消毒库来完成这项任务。正则表达式很好,我也喜欢它们。但在某些情况下,我更倾向于使用经过充分测试的专门用于解决该问题的库。

我不是Python程序员,但大多数时候我是PHP程序员。在良好的CMS项目中,您可以添加一些消毒库,比如HTMLPurifier,并定义规则。

在您的情况下,一些标记应该被转换为HTML实体,以便正常显示文本,而在其他一些情况下,标记应该保持原样。当然,某些属性和特定的标记也应该被删除(例如:&lt;img onload=&quot;alert(&#39;xss attack&#39;)&quot;&lt;script&gt;alert(&#39;bad&#39;)&lt;/script&gt;)。这就是解析器或消毒库会做得更安全的地方。

让我们假设这些标记是允许的:

  • &lt;a&gt; 具有 href 属性。可能不应该允许其他任何属性。通常,我会删除 style=&quot;font-size: 100px&quot;
  • &lt;strong&gt;&lt;em&gt; 没有属性。那么旧的 &lt;b&gt;&lt;i&gt; 标记呢?我会将它们分别转换为 &lt;strong&gt;&lt;em&gt;,因为它们可能对可读性有用,但在Telegram中不允许。

所有其他标记应该转换为 &lt;var&gt;(如果允许的话),并将内容转换为HTML特殊字符(&lt; 变成 &amp;lt;&gt; 变成 &amp;gt;)如您所述。如果需要的话,也许可以安全地处理其他转换。

在Python中,我看到您可以使用html-sanitizer库

我看到可以定义一些预处理函数,通常用于根据需要转换一些标记。这意味着您可以创建一个函数,它将所有未经授权的标记转换为 &lt;var&gt;&lt;pre&gt; 标记,并填充其内容为找到的标记的转义等价HTML。已经存在一些预先构建的预处理函数,比如 bold_span_to_strong() ,所以有一些示例可以解决您的问题。

用于查找无效标记的纯正则表达式解决方案可以如下:

&lt;\s*/?\s*(?!(?:a|strong|em)\b)\w+[^&gt;]*&gt;

演示链接在这里:https://regex101.com/r/xiZl1n/1

我接受可选的空格,用于闭合标记的斜杠,然后使用负面先行断言来避免匹配您想接受的标记。我在有效标记之后添加了单词边界 \b ,以避免它接受以 "a" 字符开头的 &lt;argument&gt;。我只想匹配完整的单词。

然后您可以决定如何处理所有匹配项。如果您直接想将 &lt; 替换为 &amp;lt;,可以这样做:

https://regex101.com/r/xiZl1n/3

编辑:处理 ->, >.<, =>

我仍然坚信解析器是最佳选择。但您问是否可以更改正则表达式以处理更多情况。我个人认为一个正则表达式可能无法胜任。它肯定也会不安全。

但正如我评论的那样,您可以尝试分几个步骤进行某些操作:

  1. 如果您认为有必要的话,将 &lt;i&gt; 标记替换为 &lt;em&gt;,将 &lt;b&gt; 标记替换为 &lt;strong&gt;

  2. 找到所有有效标记,如 &lt;a&gt;, &lt;strong&gt;, &lt;em&gt;,并将它们替换为 [a], [strong][em]
    可以使用以下模式实现:

    &lt;\s*(/?)\s*(?=(?:a|strong|em)\b)(\w+[^&gt;]*)&gt;
    

    并将其替换为 [\1\2]。演示链接在这里:https://regex101.com/r/xiZl1n/4

  3. 现在您可以将 &lt; 替换为 &amp;lt;,将 &gt; 替换为 &amp;gt;

  4. 将接受的标记转换回正确的HTML。与步骤1中的正则表达式相同,但使用方括号代替:

    \[\s*(/?)\s*((?:a|strong|em)\b)([^\]]*)\]
    

    并将其替换为 `<\1\2\

英文:

I would personally suggest you use a Python HTML parsing or sanitizing
library to do the job. Regular expressions are great and I love them. But
in some cases I prefer using well-tested libraries which are especially
built for the purpose of the problem.

I'm not a Python programmer, but mostly a PHP programmer. In good
CMS projects, you can add some sanitizing libraries such as
HTMLPurifier and define rules.

In your case, some tags should be transformed to HTML entities, in order
to display as normal text and in some other cases the tags have to be
left as they are. Certainly some attributes and specific tags should be
also dropped (ex: &lt;img onload=&quot;alert(&#39;xss attack&#39;)&quot; or
&lt;script&gt;alert(&#39;bad&#39;)&lt;/script&gt;. This is where parsers or sanitizing
libs will do a safer job.

Let's say that these tags are allowed:

  • &lt;a&gt; with the href attribute. Probably no other attributes should
    be allowed. Typically, I would drop style=&quot;font-size: 100px&quot;.
  • &lt;strong&gt; and &lt;em&gt; without attributes. What about the old &lt;b&gt; and
    &lt;i&gt; tags? I would transform them to respectively &lt;strong&gt; and &lt;em&gt; as they might be useful for readability but not allowed in
    Telegram.

All other tags should be converted to &lt;var&gt; (if allowed) with the content
converted to HTML special chars (&lt; to &amp;lt; and &gt; to &amp;gt;) as you
mentionned. Maybe it would be safe to handle other conversions if needed.

In Python, I see that you can use the
html-sanitizer library.

I see that one can define some preprocessor functions, typically
to convert some tags if needed. This means that you could create a
function that will convert all unauthorized tags to a &lt;var&gt; or
&lt;pre&gt; tag and fill its content with the escaped equivalent HTML of
the found tag. Some pre-built preprocessor functions already exist,
such as bold_span_to_strong() so there are some examples to solve your problem.

A pure regex solution to find invalid tags could be done with this:

&lt;\s*/?\s*(?!(?:a|strong|em)\b)\w+[^&gt;]*&gt;

Demo here: https://regex101.com/r/xiZl1n/1

I'm accepting optional spaces, the slash for the closing tag and then
using a negative lookahead to avoid matching the tags you want to accept.
I added the word-boundary \b after the valid tags to avoid that
it accepts &lt;argument&gt; which starts with an "a" char. I want to match
the full word only.

You can then decide what to do with all your matches. If you directly
want to replace &lt; by &amp;lt; then you can do this:

https://regex101.com/r/xiZl1n/3

EDIT: handle -&gt;, &gt;.&lt;, =&gt;, etc

I'm still convinced that a parser is the best choice. But you asked
if the regex could be changed to handle more cases. I personally
don't think one regex can do it. It will certainly be unsafe too.

But as I commented, you could try something in several steps:

  1. Convert &lt;i&gt; tags to &lt;em&gt; and &lt;b&gt; tags to &lt;strong&gt; if
    you think it's worth it.

  2. Find all valid tags, such as &lt;a&gt;, &lt;strong&gt;, &lt;em&gt; and
    replace them by [a], [strong] and respectively [em].
    This can be done with a pattern such as:

    &lt;\s*(/?)\s*(?=(?:a|strong|em)\b)(\w+[^&gt;]*)&gt;
    

    and replace it by [\1\2]. In action:
    https://regex101.com/r/xiZl1n/4

  3. Now you can replace &lt; by &amp;lt; and &gt; by &amp;gt;:

  4. Transform back the valid tags to correct HTML. It will be
    the same regex as in step 1, but with brackets instead:

    \[\s*(/?)\s*((?:a|strong|em)\b)([^\]]*)\]
    

    and replace it by &lt;\1\2\3&gt;. In action:
    https://regex101.com/r/xiZl1n/8

    In this step, the capturing group n°3 contains all the
    tag attributes. This is where you could filter it and only
    accept some specific ones, such as href, id, title
    but remove all the others (ex: class, style, onclick).

It may be important to enable the i flag to be case-insensitive.
This is what it could look like in JavaScript:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const input = `&lt;a href=&quot;https://www.example.com/page&quot;&gt;Example page&lt;/a&gt; is a &lt;strong&gt;valid&lt;/strong&gt; tag.
&lt; EM class=&quot;something&quot;&gt; is also allowed&lt;/em&gt;
But &lt;param&gt; or &lt;argument&gt; should be escaped.
This also: &lt;br/&gt; &lt;
br /&gt; &lt;img onload=&quot;alert(&#39;xss&#39;)&quot; src=&quot;http://images.com/nice-cat.jpg&quot; alt=&quot;Nice cat&quot; /&gt;
&lt;script type=&quot;text/javascript&quot;&gt;alert(&#39;Bad JS code&#39;)&lt;/` + `script&gt;
&lt;a href=&quot;javascript:alert(&#39;XSS attack&#39;)&quot; onclick=&quot;alert(&#39;xss&#39;)&quot;&gt;A bad link&lt;/a&gt;
&lt;a href = http://test.com title=&quot;This is just a test&quot;&gt;test.com&lt;/a&gt;

Turn left &lt;- or turn right -&gt;
Also, accept this =&gt; or a smiley &gt;.&lt;

Accept &lt;B&gt;bold&lt;/B&gt; and &lt;i style=&quot;color:green&quot;&gt;italic without style&lt;/i&gt; converted to new tags.
Also strip &lt;b href=&quot;https://www.google.com&quot;&gt;wrong attributes&lt;/b&gt;`;

// Attributes to drop are all design changes done with classes or style
// and all attributes such as onload, onclick, etc.
const regexAttributeToDrop = /^(?:style|class|on.*)$/i;
// The attributes which can have a URL.
const regexAttributeWithURL = /^(?:href|xlink:href|src)$/i;
// Only accept absolute URLs and not bad stuff like javascript:alert(&#39;xss&#39;)
const regexValidURL = /^(https?|ftp|mailto|tel):/i;

/**
 * Filter tag attributes, based on the tag name, if provided.
 *
 * @param string attributes All the attributes of the tag.
 * @param string tagName Optional tag name (in lowercase).
 * @return string The filtered string of attributes.
 */
function filterAttributes(attributes, tagName = &#39;&#39;) {
  // Regex for attribute: $1 = attribute name, $2 = value, $3 = simple/double quote or nothing.
  const regexAttribute = /\s*([\w-]+)\s*=\s*((?:([&quot;&#39;]).*?|[^&quot;&#39;=&lt;&gt;\s]+))/g;
  
  attributes = attributes.replace(regexAttribute, function (attrMatch, attrName, attrValue, quote) {
    // Don&#39;t keep attributes that can change the rendering or run JavaScript.
    if (name.match(regexAttributeToDrop)) {
      return &#39;&#39;;
    }
    // Not an attribute to drop.
    else {
      // If the attribute is &quot;href&quot; or &quot;xlink:href&quot; then only accept full URLs
      // with the correct protocols and only for &lt;a&gt; tags.
      if (attrName.match(/^(?:xlink:)?href$/i)) {
        // If it&#39;s not a link then drop the attribute.
        if (tagName !== &#39;a&#39;) {
          return &#39;&#39;;
        }
        // The attribute value can be quoted or not so we&#39;ll remove them.
        const url = attrValue.replace(/^[&#39;&quot;]|[&#39;&quot;]$/g, &#39;&#39;);
        // If the URL is valid.
        if (url.match(regexValidURL)) {
          return ` ${attrName}=&quot;${url}&quot;`;
        }
        // Invalid URL: drop href and notify visually.
        else {
          return &#39; class=&quot;invalid-url&quot; title=&quot;Invalid URL&quot;&#39;;
        }
      }
      // All other attributes: nothing special to do.
      else {
        return ` ${attrName}=${attrValue}`;
      }
    }
  });

  // Clean up: trim spaces around. If it&#39;s not empty then just add a space before.
  attributes = attributes.trim();
  if (attributes.length) {
    attributes = &#39; &#39; + attributes;
  }
  
  return attributes;
}

const steps = [
  {
    // Replace &lt;b&gt; by &lt;strong&gt;.
    search: /&lt;\s*(\/?)\s*(b\b)([^&gt;]*)&gt;/gi,
    replace: &#39;&lt;$1strong&gt;&#39; // or &#39;&lt;$1strong$3&gt;&#39; to keep the attributes.
  },
  {
    // Replace &lt;i&gt; by &lt;em&gt;.
    search: /&lt;\s*(\/?)\s*(i\b)([^&gt;]*)&gt;/gi,
    replace: &#39;&lt;$1em&gt;&#39; // or &#39;&lt;$1em$3&gt;&#39; to keep the attributes.
  },
  {
    // Transform accepted HTML tags into bracket tags.
    search: /&lt;\s*(\/?)\s*(?=(?:a|strong|em)\b)(\w+[^&gt;]*)&gt;/gi,
    replace: &#39;[$1$2]&#39;
  },
  {
    // Replace &quot;&lt;&quot; by &quot;&amp;lt;&quot;.
    search: /&lt;/g,
    replace: &#39;&amp;lt;&#39;
  },
  {
    // Replace &quot;&gt;&quot; by &quot;&amp;gt;&quot;.
    search: /&gt;/g,
    replace: &#39;&amp;gt;&#39;
  },
  {
    // Transform the accepted tags back to correct HTML.
    search: /\[\s*(\/?)\s*((?:a|strong|em)\b)([^\]]*)\]/gi,
    // For the replacement, we&#39;ll provide a callback function
    // so that we can alter the attributes if needed.
    replace: function (fullMatch, slash, tagName, attributes) {
      // Convert the tag name to lowercase.
      tagName = tagName.toLowerCase();
      // If the slash group is empty then it&#39;s the opening tag.
      if (slash === &#39;&#39;) {
        attributes = filterAttributes(attributes, tagName);
        return &#39;&lt;&#39; + tagName + attributes + &#39;&gt;&#39;;
      }
      // The closing tag.
      else {
        return &#39;&lt;/&#39; + tagName + &#39;&gt;&#39;;
      }
    }
  },
  {
    // Optional: inject &lt;br /&gt; tags at each new lines to preserve user input.
    search: /(\r?\n)/gm,
    replace: &#39;&lt;br /&gt;$1&#39;
  }
];

let output = input;

steps.forEach((step, i) =&gt; {
  output = output.replace(step.search, step.replace);
});

document.getElementById(&#39;input&#39;).innerText = input;
document.getElementById(&#39;output&#39;).innerText = output;
document.getElementById(&#39;rendered-output&#39;).innerHTML = output;

<!-- language: lang-css -->

body {
  font-family: Georgia, &quot;Times New Roman&quot;, serif;
  margin: 0;
  padding: 0 1em 1em 1em;
}

p:first-child { margin-top: 0; }

pre {
  background: #f8f8f8;
  border: 1px solid gray;
  padding: .5em;
  overflow-x: scroll;
}

.box {
  background: white;
  box-shadow: 0 0 .5em rgba(0, 0, 0, 0.1);
  padding: 1em;
}

.invalid-url {
  color: darkred;
}

<!-- language: lang-html -->

&lt;h1&gt;Quick &amp;amp; dirty HTML filtering with regular expressions&lt;/h1&gt;

&lt;p&gt;Input:&lt;/p&gt;
&lt;pre id=&quot;input&quot;&gt;&lt;/pre&gt;

&lt;p&gt;Output:&lt;/p&gt;
&lt;pre id=&quot;output&quot;&gt;&lt;/pre&gt;

&lt;p&gt;Rendered output:&lt;/p&gt;
&lt;div id=&quot;rendered-output&quot; class=&quot;box&quot;&gt;&lt;/div&gt;

<!-- end snippet -->

huangapple
  • 本文由 发表于 2023年8月4日 01:04:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76830215.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定