英文:
remove special characters and punctuations from string
问题
我有一个函数,可以获取帖子中的所有hashtag,并以逗号分隔的形式输出这些单词(因为一个帖子可能有许多hashtag),以便存储在数据库列中。
function getHashtags ($text) {
    // 以空格为分隔符分割字符串
    $text = explode(" ", $text);
    $hashtag = "";
    $hashReg = "/^[a-zA-Z0-9]+$/";
    // 遍历帖子中的每个单词
    foreach ($text as $word) {
        // 第一个字符是#
        $char = substr($word, 0, 1);
        // #后面的单词
        $ref = substr($word, 1);
        // 如果单词的第一个字符是#
        if ($char == "#") {
            // 检查是否只有字母和数字
            if (preg_match ($hashReg, $ref)) {
                // 检查hashtag的长度
                if (strlen($ref) <= 11) {
                    // 设置hashtag
                    $hashtag .= substr($word, 1).",";
                }
            }
        }
    }
    return $hashtag;
}
该函数可以正常工作,例如:
$post = "#rock #music is good";
echo getHashtags($post);
// 输出:rock,music,
然而,如果示例是 $post=" #rock, #music, is good" ,#rock 和 #music 后面的逗号将使该函数无法正常工作。这也会发生在其他字符如句号、问号等情况下。我尝试过添加 preg_replace('/[^A-Za-z0-9]/', '', $post),但它不起作用。我该如何修复这个问题,以便 #rock, #music, 或 #rock. #music. 仍然输出期望的结果 rock,music。
英文:
I have a function that gets all the hashtag words in a post and outputs the words separated by a comma (because a post can have many hashtags) to be stored in a database column.
function getHashtags ($text) {
    // explode on spaces
    $text = explode(" ", $text);
    $hashtag = "";
    $hashReg = "/^[a-zA-Z0-9]+$/";
    // for every word in post
    foreach ($text as $word) {
        // 1st character #
        $char = substr($word, 0, 1);
        // word after character #
        $ref = substr($word, 1);
        // if 1st character in word is #
        if ($char == "#") {
            // check if only letters & numbers
            if (preg_match ($hashReg, $ref)) {
                // check hashtag length
                if (strlen($ref) <= 11) {
                    // set hashtag
                    $hashtag .= substr($word, 1).",";
                }
            }
        }
    }
    return $hashtag;
}
The function works well, e.g
$post = "#rock #music is good";
echo getHashtags($post);
// output: rock,music,
However if the example was $post="#rock, #music, is good" the comma after #rock and #music will make the function not work, this will also happen with any other characters like fullstops, question marks etc. I have tried adding a preg_replace('/[^A-Za-z0-9]/', '', $post) but it does not work. How can I fix it so that #rock, #music, or #rock. #music. will still output the desired result of rock,music
答案1
得分: 1
以下是您要的代码的中文翻译:
您可以简单地使用preg_replace来删除标签之间的所有字符和空格,然后使用#分割它。
示例:
function getHashtags ($text) {
    $clean = preg_replace("/[^A-Za-z0-9]#/", "#", $text);
    
    $text = explode("#", $clean);
    $hashtag = [];
    foreach ($text as $word) {
        if ($word){
            $hashtag[]= $word;
        }
    }
    
    return implode(',', $hashtag);
}
输出应该是:
getHashtags("#rock, #music, is good, #metal, #is not so good");
=> string(40) "rock,music, is good,metal,is not so good"
希望这对您有所帮助。
英文:
You can simple use preg_replace to remove all characters and spaces between the tags and then explode it with #.
Example:
function getHashtags ($text) {
    $clean = preg_replace("/[^A-Za-z0-9] #/", "#", $text);
    
    $text = explode("#", $clean);
    $hashtag = [];
    foreach ($text as $word) {
        if ($word){
            $hashtag[]= $word;
        }
    }
    
    return implode(',', $hashtag);
}
Output should be:
getHashtags("#rock, #music, is good, #metal, #is not so good");
=> string(40) "rock,music, is good,metal,is not so good"
答案2
得分: 0
为处理由非字母数字字符分隔的标签的情况,您可以修改用于匹配标签的正则表达式。目前,正则表达式 /^[a-zA-Z0-9]+$/ 仅匹配字母数字字符。
您可以更新它以允许出现在 "#" 符号和实际标签单词之间的非字母数字字符。一种方法是使用一个字符类,该字符类匹配任何非空格字符,如下所示:
$hashReg = "/^#[^\s]+/";
以下是修改后的 getHashtags 函数:
function getHashtags($text) {
    // 按空格拆分文本
    $text = explode(" ", $text);
    $hashtags = [];
    $hashReg = "/^#[^\s]+/";
    // 针对每个帖子中的每个单词
    foreach ($text as $word) {
        // 第一个字符是 #
        $char = substr($word, 0, 1);
        // # 字符后的单词
        $ref = substr($word, 1);
        // 如果单词的第一个字符是 #
        if ($char == "#") {
            // 检查标签是否匹配模式
            if (preg_match($hashReg, $word)) {
                // 检查标签长度
                if (strlen($ref) <= 11) {
                    // 将标签添加到数组中
                    $hashtags[] = $ref;
                }
            }
        }
    }
    // 使用逗号连接标签并返回字符串
    return implode(",", $hashtags);
}
希望这对您有所帮助。
英文:
To handle the case where hashtags are separated by non-alphanumeric characters, you can modify the regular expression used to match hashtags. Currently, the regular expression /^[a-zA-Z0-9]+$/ matches only alphanumeric characters.
You can update it to allow for non-alphanumeric characters that might appear between the '#' symbol and the actual hashtag word. One way to do this is to use a character class that matches any non-space character, like this:
$hashReg = "/^#[^\s]+$/";
Here is the modified getHashtags function:
function getHashtags($text) {
    // explode on spaces
    $text = explode(" ", $text);
    $hashtags = [];
    $hashReg = "/^#[^\s]+$/";
    // for every word in post
    foreach ($text as $word) {
        // 1st character #
        $char = substr($word, 0, 1);
        // word after character #
        $ref = substr($word, 1);
        // if 1st character in word is #
        if ($char == "#") {
            // check if hashtag matches pattern
            if (preg_match($hashReg, $word)) {
                // check hashtag length
                if (strlen($ref) <= 11) {
                    // add hashtag to array
                    $hashtags[] = $ref;
                }
            }
        }
    }
    // join hashtags with comma and return as string
    return implode(",", $hashtags);
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论