A Regular Expression to make acronyms with word boundaries and remove characters preceding a word

huangapple go评论82阅读模式
英文:

A Regular Expression to make acronyms with word boundaries and remove characters preceding a word

问题

Go版本

go version go1.16.7 linux/amd64

问题

我正在进行一个关于创建首字母缩写的练习,我选择使用正则表达式来完成。

以下是给我的一些测试用例:

	input:    "Ruby on Rails",
	expected: "ROR"
	
    input:    "GNU Image Manipulation Program",
	expected: "GIMP"

	input:    "Complementary metal-oxide semiconductor",
	expected: "CMOS"

	input:    "Something - I made up from thin air",
	expected: "SIMUFTA"

	input:    "Halley's Comet",
	expected: "HC"

	input:    "The Road _Not_ Taken",
	expected: "TRNT"

下面的代码可以通过许多简单的测试,如果首字母是大写,则提取该字母并生成首字母缩写。

 Portable Network Graphics -> PNG

代码

// Package acronym creates an acronym based on Capitalized Letters
package acronym

import (
	"regexp"
	"strings"
)

// Abbreviate: creates an acronym for a full form string
func Abbreviate(s string) string {
	re := regexp.MustCompile(`\b[A-Za-z]`)
	abbreviation := strings.Join(re.FindAllString(s, -1), "")
	return strings.ToUpper(abbreviation)
}

我唯一失败的测试是

=== RUN   TestAcronym
    acronym_test.go:11: Acronym test [Halley's Comet], expected [HC], actual [HSC]
    acronym_test.go:11: Acronym test [The Road _Not_ Taken], expected [TRNT], actual [TRT]
--- FAIL: TestAcronym (0.00s)

Regex101 Playground

在Regex 101中的Playground链接

问题

我无法弄清楚如何仅编译Halley's Comet测试用例中的HC并获取The Road _Not_ Taken测试用例中的N

我必须保留小写字符[a-z]的一个原因是因为Complementary metal-oxide semiconductor这种情况,以及其他某些测试用例中的小写字符。

我可以在正则表达式编译之前删除诸如-_之类的字符,但我认为这不会使我的函数更通用(而只是为了通过测试而进行的修改)。

我想知道如何删除字符'_,以使首字母缩写函数更健壮?

英文:

Go Version

go version go1.16.7 linux/amd64

Problem

I am going through an Exercise about creating acronyms and I chose to do it with regular expressions.

Some of the test cases given to me are following:

	input:    "Ruby on Rails",
	expected: "ROR"
	
    input:    "GNU Image Manipulation Program",
	expected: "GIMP"

	input:    "Complementary metal-oxide semiconductor",
	expected: "CMOS"

	input:    "Something - I made up from thin air",
	expected: "SIMUFTA"

	input:    "Halley's Comet",
	expected: "HC"

	input:    "The Road _Not_ Taken",
	expected: "TRNT"

The following code is what is able to pass a lot of simple tests where If the First Letter is capital then extract that letter and make an acronym out of it

 Portable Network Graphics -> PNG

Code

// Package acronym creates an acronym based on Capitalized Letters
package acronym

import (
	"regexp"
	"strings"
)

// Abbreviate: creates an acronym for a full form string
func Abbreviate(s string) string {
	re := regexp.MustCompile(`\b[A-Za-z]`)
	abbreviation := strings.Join(re.FindAllString(s, -1), "")
	return strings.ToUpper(abbreviation)
}

The Only tests I am failing are

=== RUN   TestAcronym
    acronym_test.go:11: Acronym test [Halley's Comet], expected [HC], actual [HSC]
    acronym_test.go:11: Acronym test [The Road _Not_ Taken], expected [TRNT], actual [TRT]
--- FAIL: TestAcronym (0.00s)

Regex101 Playground

Link to Playground in Regex 101

Problem

I am unable to figure out how do I compile only the HC for Halley's Comet test case and obtain the N in the The Road _Not_ Taken test case.

One of the reasons I have to keep lower-case characters [a-z] is because of the case Complementary metal-oxide semiconductor and also because of other lower-case characters in certain test cases

I could actually remove such characters such as - or _ before the regexp compilation but I think that would not make my function more generic (rather hack to just past the test)

I wish to know how do I remove the characters ' and _ in order to make the acronym function more robust?

答案1

得分: 1

你可以使用以下代码来创建一个全称字符串的首字母缩写:

// Abbreviate: 为全称字符串创建首字母缩写
func Abbreviate(s string) string {
    var abbreviation = ""
    re := regexp.MustCompile(`\w'\w|(?:_|\b)([A-Za-z])`)
    for _, match := range re.FindAllStringSubmatch(s, -1) {
        abbreviation = abbreviation + match[1] 
    }
    return strings.ToUpper(abbreviation)
}

详细说明:

  • \w'\w - 单词字符,',单词字符(为了避免匹配单词字符之间的 ',如果在连续匹配中出现问题,请替换为 \b'\w
  • | - 或
  • (?:_|\b) - _ 或者单词边界
  • ([A-Za-z]) - 第一组:一个 ASCII 字母(使用 \p{L} 来匹配任何 Unicode 字母)。

查看 Go 示例

英文:

You may use

// Abbreviate: creates an acronym for a full form string
func Abbreviate(s string) string {
    var abbreviation = ""
    re := regexp.MustCompile(`\w'\w|(?:_|\b)([A-Za-z])`)
    for _, match := range re.FindAllStringSubmatch(s, -1) {
        abbreviation = abbreviation + match[1] 
    }
    return strings.ToUpper(abbreviation)
}

See the Go demo. Details:

  • \w'\w - word char, ', word char (to avoid matching ' in between word chars, if you have issues with consequent matches, replace with \b'\w)
  • | - or
  • (?:_|\b) - either _ or word boundary
  • ([A-Za-z]) - Group 1: an ASCII letter (use \p{L} to match any Unicode letter).

huangapple
  • 本文由 发表于 2021年8月12日 04:46:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/68748747.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定