从字符串中移除特殊字符和标点符号

huangapple go评论65阅读模式
英文:

remove special characters and punctuations from string

问题

我有一个函数,可以获取帖子中的所有hashtag,并以逗号分隔的形式输出这些单词(因为一个帖子可能有许多hashtag),以便存储在数据库列中。

function getHashtags ($text) {
    // 以空格为分隔符分割字符串
    $text = explode(" ", $text);
    $hashtag = "";
    $hashReg = "/^[a-zA-Z0-9]+$/";
    // 遍历帖子中的每个单词
    foreach ($text as $word) {
        // 第一个字符是#
        $char = substr($word, 0, 1);
        // #后面的单词
        $ref = substr($word, 1);
        // 如果单词的第一个字符是#
        if ($char == "#") {
            // 检查是否只有字母和数字
            if (preg_match ($hashReg, $ref)) {
                // 检查hashtag的长度
                if (strlen($ref) <= 11) {
                    // 设置hashtag
                    $hashtag .= substr($word, 1).",";
                }
            }
        }
    }
    return $hashtag;
}

该函数可以正常工作,例如:

$post = "#rock #music is good";
echo getHashtags($post);

// 输出:rock,music,

然而,如果示例是 $post=" #rock, #music, is good" #rock#music 后面的逗号将使该函数无法正常工作。这也会发生在其他字符如句号、问号等情况下。我尝试过添加 preg_replace('/[^A-Za-z0-9]/', '', $post),但它不起作用。我该如何修复这个问题,以便 #rock, #music,#rock. #music. 仍然输出期望的结果 rock,music

英文:

I have a function that gets all the hashtag words in a post and outputs the words separated by a comma (because a post can have many hashtags) to be stored in a database column.

function getHashtags ($text) {
    // explode on spaces
    $text = explode(&quot; &quot;, $text);
    $hashtag = &quot;&quot;;
    $hashReg = &quot;/^[a-zA-Z0-9]+$/&quot;;
    // for every word in post
    foreach ($text as $word) {
        // 1st character #
        $char = substr($word, 0, 1);
        // word after character #
        $ref = substr($word, 1);
        // if 1st character in word is #
        if ($char == &quot;#&quot;) {
            // check if only letters &amp; numbers
            if (preg_match ($hashReg, $ref)) {
                // check hashtag length
                if (strlen($ref) &lt;= 11) {
                    // set hashtag
                    $hashtag .= substr($word, 1).&quot;,&quot;;
                }
            }
        }
    }
    return $hashtag;
}

The function works well, e.g

$post = &quot;#rock #music is good&quot;;
echo getHashtags($post);

// output: rock,music,

However if the example was $post=&quot;#rock, #music, is good&quot; the comma after #rock and #music will make the function not work, this will also happen with any other characters like fullstops, question marks etc. I have tried adding a preg_replace(&#39;/[^A-Za-z0-9]/&#39;, &#39;&#39;, $post) but it does not work. How can I fix it so that #rock, #music, or #rock. #music. will still output the desired result of rock,music

答案1

得分: 1

以下是您要的代码的中文翻译:

您可以简单地使用preg_replace来删除标签之间的所有字符和空格,然后使用#分割它。

示例:

function getHashtags ($text) {
    $clean = preg_replace("/[^A-Za-z0-9]#/", "#", $text);
    
    $text = explode("#", $clean);

    $hashtag = [];

    foreach ($text as $word) {
        if ($word){
            $hashtag[]= $word;
        }
    }
    
    return implode(',', $hashtag);
}

输出应该是:

getHashtags("#rock, #music, is good, #metal, #is not so good");
=> string(40) "rock,music, is good,metal,is not so good"

希望这对您有所帮助。

英文:

You can simple use preg_replace to remove all characters and spaces between the tags and then explode it with #.

Example:

function getHashtags ($text) {
    $clean = preg_replace(&quot;/[^A-Za-z0-9] #/&quot;, &quot;#&quot;, $text);
    
    $text = explode(&quot;#&quot;, $clean);

    $hashtag = [];

    foreach ($text as $word) {
        if ($word){
            $hashtag[]= $word;
        }
    }
    
    return implode(&#39;,&#39;, $hashtag);
}

Output should be:

getHashtags(&quot;#rock, #music, is good, #metal, #is not so good&quot;);
=&gt; string(40) &quot;rock,music, is good,metal,is not so good&quot;

答案2

得分: 0

为处理由非字母数字字符分隔的标签的情况,您可以修改用于匹配标签的正则表达式。目前,正则表达式 /^[a-zA-Z0-9]+$/ 仅匹配字母数字字符。

您可以更新它以允许出现在 "#" 符号和实际标签单词之间的非字母数字字符。一种方法是使用一个字符类,该字符类匹配任何非空格字符,如下所示:

$hashReg = "/^#[^\s]+/";

以下是修改后的 getHashtags 函数:

function getHashtags($text) {
    // 按空格拆分文本
    $text = explode(" ", $text);
    $hashtags = [];
    $hashReg = "/^#[^\s]+/";
    // 针对每个帖子中的每个单词
    foreach ($text as $word) {
        // 第一个字符是 #
        $char = substr($word, 0, 1);
        // # 字符后的单词
        $ref = substr($word, 1);
        // 如果单词的第一个字符是 #
        if ($char == "#") {
            // 检查标签是否匹配模式
            if (preg_match($hashReg, $word)) {
                // 检查标签长度
                if (strlen($ref) <= 11) {
                    // 将标签添加到数组中
                    $hashtags[] = $ref;
                }
            }
        }
    }
    // 使用逗号连接标签并返回字符串
    return implode(",", $hashtags);
}

希望这对您有所帮助。

英文:

To handle the case where hashtags are separated by non-alphanumeric characters, you can modify the regular expression used to match hashtags. Currently, the regular expression /^[a-zA-Z0-9]+$/ matches only alphanumeric characters.

You can update it to allow for non-alphanumeric characters that might appear between the '#' symbol and the actual hashtag word. One way to do this is to use a character class that matches any non-space character, like this:

$hashReg = &quot;/^#[^\s]+$/&quot;;

Here is the modified getHashtags function:

function getHashtags($text) {
    // explode on spaces
    $text = explode(&quot; &quot;, $text);
    $hashtags = [];
    $hashReg = &quot;/^#[^\s]+$/&quot;;
    // for every word in post
    foreach ($text as $word) {
        // 1st character #
        $char = substr($word, 0, 1);
        // word after character #
        $ref = substr($word, 1);
        // if 1st character in word is #
        if ($char == &quot;#&quot;) {
            // check if hashtag matches pattern
            if (preg_match($hashReg, $word)) {
                // check hashtag length
                if (strlen($ref) &lt;= 11) {
                    // add hashtag to array
                    $hashtags[] = $ref;
                }
            }
        }
    }
    // join hashtags with comma and return as string
    return implode(&quot;,&quot;, $hashtags);
}

huangapple
  • 本文由 发表于 2023年5月10日 22:36:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76219713.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定