2022年2月28日 22:04:29go评论207阅读模式

英文:

Identify the correct hashtag indexes in tweet messages

问题

我需要识别推特消息中的正确索引（包括各种语言、表情符号等）。

我找不到一个能够返回如下示例中所示位置的解决方案。

import (
	"regexp"
	"testing"
	"github.com/stretchr/testify/require"
)
func TestA(t *testing.T) {
	text := "&#127479;&#127482; [URGENT] Les forces de dissuasion #nucleaire de la #Russie"
	var re = regexp.MustCompile(`#([_A-Za-z0-9]+)`)
	pos := re.FindAllStringIndex(text, -1)
	// FindAllStringIndex 返回
	// [0][43,53]
	// [1][60,67]
    // 这些是期望的位置。
	require.Equal(t, pos[0][0], 37) 
	require.Equal(t, pos[0][1], 47)
	require.Equal(t, pos[1][0], 54)
	require.Equal(t, pos[1][1], 61)
}

英文:

I need to identify the correct indexes in twitter messages (various languages, emojis, etc).

I can't find a solution that returns these positions as shown in the example below.

import (
	&quot;regexp&quot;
	&quot;testing&quot;
	&quot;github.com/stretchr/testify/require&quot;
)
func TestA(t *testing.T) {
	text := &quot;&#127479;&#127482; [URGENT] Les forces de dissuasion #nucleaire de la #Russie&quot;
	var re = regexp.MustCompile(`#([_A-Za-z0-9]+)`)
	pos := re.FindAllStringIndex(text, -1)
	// FindAllStringIndex returns
	// [0][43,53]
	// [1][60,67]
    // These are the expected positions.
	require.Equal(t, pos[0][0], 37) 
	require.Equal(t, pos[0][1], 47)
	require.Equal(t, pos[1][0], 54)
	require.Equal(t, pos[1][1], 61)
}

答案1

得分: 2

FindAllStringIndex() 函数返回的是字节的位置，而不是符文的位置。

你需要导入 unicode/utf8 并使用 utf8.RuneCountInString(text[:pos[0][0]]) 等等，而不是使用 pos[0][0]，以确保你计算的是 Unicode 代码点而不仅仅是字节：

// 你可以编辑这段代码！
// 点击这里开始输入。
package main
import (
	"regexp"
	"testing"
	"unicode/utf8"
	"github.com/stretchr/testify/require"
)
func TestA(t *testing.T) {
	text := "&#127479;&#127482; [URGENT] Les forces de dissuasion #nucleaire de la #Russie"
	var re = regexp.MustCompile(`#\w+`)
	pos := re.FindAllStringIndex(text, -1)
	require.Equal(t, utf8.RuneCountInString(text[:pos[0][0]]), 37)
	require.Equal(t, utf8.RuneCountInString(text[:pos[0][1]]), 47)
	require.Equal(t, utf8.RuneCountInString(text[:pos[1][0]]), 54)
	require.Equal(t, utf8.RuneCountInString(text[:pos[1][1]]), 61)
}

请参考 Go 演示。

另外，#\w+ 是一个更短的模式，用于匹配一个 #，然后是一个或多个字母、数字或下划线。

英文:

The FindAllStringIndex() function returns the position of bytes, not runes.

You need to import "unicode/utf8" and use utf8.RuneCountInString(text[:pos[0][0]]) and so on instead of pos[0][0] to make sure you count the Unicode code points and not just bytes:

// You can edit this code!
// Click here and start typing.
package main
import (
	&quot;regexp&quot;
	&quot;testing&quot;
	&quot;unicode/utf8&quot;
	&quot;github.com/stretchr/testify/require&quot;
)
func TestA(t *testing.T) {
	text := &quot;&#127479;&#127482; [URGENT] Les forces de dissuasion #nucleaire de la #Russie&quot;
	var re = regexp.MustCompile(`#\w+`)
	pos := re.FindAllStringIndex(text, -1)
	require.Equal(t, utf8.RuneCountInString(text[:pos[0][0]]), 37)
	require.Equal(t, utf8.RuneCountInString(text[:pos[0][1]]), 47)
	require.Equal(t, utf8.RuneCountInString(text[:pos[1][0]]), 54)
	require.Equal(t, utf8.RuneCountInString(text[:pos[1][1]]), 61)
}

See the Go demo.

Also, #\w+ is a a shorter pattern to match a # and then one or more letters, digits or underscores.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

识别推文消息中正确的标签索引。

问题

答案1

Golang如何在不同目录中提供静态文件服务

在Golang中，读取一个没有预定义路由的请求URL路径。

在函数中编写结构字段

Golang数组类型混淆

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。