Regex to parse badly formatted polynomials

huangapple go评论82阅读模式
英文:

Regex to parse badly formatted polynomials

问题

我正在使用以下正则表达式:

(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?

来尝试从类似以下方程式中提取所有元素的系数和次数:

y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1

我希望正则表达式忽略错误的 4x^,因为它缺少指数(目前无法实现),并且允许我得到以下最终结果:

((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0))

其中,每个元素的第一个坐标是系数,第二个坐标是次数。目前,上述正则表达式在我将第1和第2组以及第5和第6组分别作为系数和次数时“几乎”能够工作。

它只在错误的 4x^ 上失败,并且感觉非常不简洁,但我对正则表达式还不太熟悉,不确定应该做出哪些改进。

我该如何改进这个正则表达式,并修复 4x^ 被认为是“错误”的问题,但 4x24x^2 都是正确的?

简而言之,我正在尝试解析用户输入的多项式方程,以验证并将方程分解为一系列元素。方程将以字符串形式呈现。

以下是用户被要求格式化字符串的示例:

y = 2.0x^2.5 - 3.1x + 5.2

其中 x 是自变量(不是乘法符号),y 是因变量。

实际上,用户常常犯以下任何一个错误:

  • 忘记包括 y =
  • 在系数中添加 *,例如 y = 2.0*x
  • 使用整数而不是浮点数,例如 y = 5x
  • 在设置次数时漏掉 ^,例如 y = x3
  • 添加或删除任何空格

然而,对于所有这些错误,我认为仍然很容易理解用户想要写的内容。我的意思是,很明显可以确定每个元素的系数和次数。

因此,我想编写一些正则表达式,将输入的字符串正确地分割为单独的元素,并可以获取每个元素的 A(系数)和 B(次数),其中一个元素通常具有形式 Ax^B,而 AB 可以是任意实数。

我设计了以下示例:

y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1

我相信这个示例涵盖了我上面提到的所有潜在问题,以及另一个明显的错误 4x^+2x^2,其中元素 4x^ 缺少次数。

对于这个示例,我希望得到:((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0)),其中 4x^ 被忽略。

我对正则表达式还不太熟悉,但我已经尝试使用 regex101.com 创建了以下正则表达式:

(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?

这个正则表达式似乎几乎能够工作,但存在以下问题:

  • 无法捕获上述示例中的缺少次数的情况 4x^ - 我不确定如何使次数的可选性“条件”取决于 ^ 的存在,同时在 ^ 不存在但次数存在的情况下(例如 y = 4x2)也能工作。
  • 感觉非常不简洁/不优雅,但由于经验不足,我很难看到可以改进的地方。

另请注意,我愉快地忽略了相同次数的重复元素不被求和的问题,例如,我愿意忽略 y = x^2 + x^2,而不将其显示为 y = 2x^2

谢谢您的帮助。

附注:程序将使用Go编写,但我对Go也不太熟悉,所以我首先在Python中进行原型设计。不确定这是否会对正则表达式产生任何影响(我对正则表达式真的很陌生)。

英文:

Shortish version

I am using this regex:

(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?

To try and extract all the element coefficient and order numbers from equations like this:

y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1

I want the regex to ignore the erroneous 4x^ which is missing its power number (doesn't currently do this) and allow me to get to this final result:

((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0))

Where first coordinate is coefficient and second is order for each element. Currently the regex above 'nearly' works if I take groups 1&2 and 5&6 to give me the coefficient and order respectively.

It just falls over on the erroneous 4x^ plus feels extremely inelegant, but I am somewhat noob at regex and am not sure what improvements to make.

How can I improve this regex, and also fix so that 4x^ is considered 'wrong' but 4x2 and 4x^2 are both fine?

tl;dr version

I am trying parse polynomial equations entered by users in order to validate and then decompose the equation into a series of elements. The equations will be presented as strings.

Here is an example of how the users are asked to format their string:

y = 2.0x^2.5 - 3.1x + 5.2

Where x is the independent variable (not a times symbol) and y is the dependent variable.

In reality the users commonly make any of the following mistakes:

  • Forgetting to include y =
  • Adding a * to coefficients such as y = 2.0*x
  • Using integers instead of floats, e.g. y = 5x
  • Missing the ^ when setting the order e.g. y = x3
  • Adding or removing whitespace anywhere

However, for all of these I'd say it's still easily understandable what the user is trying to write. By that I mean it is obvious what the coefficient and order are meant to be for each element.

So what I want to do is write some regex that correctly splits the entered string into separate elements and can get me A (the coefficient) and B (the order) of each element where an element in general is of the form Ax^B and A and B can each be any real number.

I devised the following example:

y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1

Which I believe covers all of the potential issues I outlined above, in addition to one other straight up mistake 4x^+2x^2 is missing the order on the element 4x^.

For this example I'd like to get to: ((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0)) where 4x^ has been ignored.

I am somewhat new to regex but I have made an effort using regex101.com to create the following:

(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?

This appears to nearly work, with the following issues:

  • Does not catch missing order as per example 4x^ given above - I am not sure how to make the optionality of the order number 'conditional' on the presence of ^ whilst also working when ^ is not present but the order number is such as y = 4x2
  • Feels extremely in-concise / inelegant, but being inexperienced I am struggling to see where improvements can be made

Also please note I am happily ignoring the issue of repeated elements with the same order not being summed, e.g. I am happy to ignore y = x^2 + x^2 not appearing as y = 2x^2.

Thank you for any help.

p.s. Program to be written in Go, but I am also somewhat noob at Go so I am first prototyping in Python. Not sure if this will make any difference to the regex (I really am that new to regex).

答案1

得分: 1

以下是翻译好的内容:

以下正则表达式基本上可以实现:

(?P<c1>[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P<e1>-? *\d+(?:\.\d+)?)|(?P<e2>-? *\d+(?:\.\d+)?)?)|(?P<c2>[+-]? *\d+(?:\.\d+)?)

我说基本上是因为这个解决方案将"4x^"的情况视为阶数为1,考虑到要求已经相当宽松,否则忽略这种项将使正则表达式变得更加复杂甚至不可能,因为它会创建无法使用正则表达式解析的歧义。

请注意,缺少的系数/指数将不会被捕获为"1.0",就像你在示例结果中表示的那样,这将在应用正则表达式并将所有空捕获组视为"1"(或指数的"0",具体取决于捕获的组)之后进行。

在regex101.com上,你可以查看这个正则表达式以检查/尝试它的工作原理。

这里有一个在golang中工作的程序,它测试了一些情况:

package main

import (
    "fmt"
    "regexp"
    "strconv"
    "strings"
)

const e = `(?P<c1>[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P<e1>-? *\d+(?:\.\d+)?)|(?P<e2>-? *\d+(?:\.\d+)?)?)|(?P<c2>[+-]? *\d+(?:\.\d+)?)`

var cases = []string{
    "y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1",
    "3.3X^-50",
}

func parse(d float64, ss ...string) float64 {
    for _, s := range ss {
        if s != "" {
            c, _ := strconv.ParseFloat(strings.Replace(s, " ", "", -1), 64)
            return c
        }
    }
    return d
}

func main() {
    re := regexp.MustCompile(e)
    for i, c := range cases {
        fmt.Printf("testing case %v: %q\n", i, c)
        ms := re.FindAllStringSubmatch(c, -1)
        if ms == nil {
            fmt.Println("no match")
            continue
        }
        for i, m := range ms {
            fmt.Printf("  match %v: %q\n", i, m[0])
            c := parse(1.0, m[1], m[4])
            de := 1.0
            if m[4] != "" {
                de = 0.0
            }
            e := parse(de, m[2], m[3])
            fmt.Printf("    c: %v\n", c)
            fmt.Printf("    e: %v\n", e)
        }
    }
}

输出结果为:

testing case 0: "y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1"
  match 0: "x"
    c: 1
    e: 1
  match 1: "+3.3X^-50"
    c: 3.3
    e: -50
  match 2: "+ 15x25.5"
    c: 15
    e: 25.5
  match 3: "- 4x"
    c: -4
    e: 1
  match 4: "+2x^2"
    c: 2
    e: 2
  match 5: "+3*x-2.5"
    c: 3
    e: -2.5
  match 6: "+1.1"
    c: 1.1
    e: 0
testing case 1: "3.3X^-50"
  match 0: "3.3X^-50"
    c: 3.3
    e: -50

在golang playground上,你可以尝试这个程序

英文:

The following regex will mostly do:

(?P&lt;c1&gt;[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P&lt;e1&gt;-? *\d+(?:\.\d+)?)|(?P&lt;e2&gt;-? *\d+(?:\.\d+)?)?)|(?P&lt;c2&gt;[+-]? *\d+(?:\.\d+)?)

I say mostly because this solution takes the "4x^" case as having order 1, given the requirements are already pretty lenient and otherwise trying to ignore such term makes the RE much much more complicated or even impossible because it creates an ambiguity which can not be parsed with a RE.

Please note that absent coeficients/exponents will not be captured as '1.0' as you represent in your example result, that will have to be done after applying the regex and taking all empty capture groups as '1' (or '0' for the exponent depending on the captured groups).

Here you have the regex in regex101.com for checking/trying how it works.

And here a working program in golang which tests a couple of cases:

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
	&quot;strconv&quot;
	&quot;strings&quot;
)

const e = `(?P&lt;c1&gt;[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P&lt;e1&gt;-? *\d+(?:\.\d+)?)|(?P&lt;e2&gt;-? *\d+(?:\.\d+)?)?)|(?P&lt;c2&gt;[+-]? *\d+(?:\.\d+)?)`

var cases = []string{
	&quot;y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1&quot;,
	&quot;3.3X^-50&quot;,
}

func parse(d float64, ss ...string) float64 {
	for _, s := range ss {
		if s != &quot;&quot; {
			c, _ := strconv.ParseFloat(strings.Replace(s, &quot; &quot;, &quot;&quot;, -1), 64)
			return c
		}
	}
	return d
}

func main() {
	re := regexp.MustCompile(e)
	for i, c := range cases {
		fmt.Printf(&quot;testing case %v: %q\n&quot;, i, c)
		ms := re.FindAllStringSubmatch(c, -1)
		if ms == nil {
			fmt.Println(&quot;no match&quot;)
			continue
		}
		for i, m := range ms {
			fmt.Printf(&quot;  match %v: %q\n&quot;, i, m[0])
			c := parse(1.0, m[1], m[4])
			de := 1.0
			if m[4] != &quot;&quot; {
				de = 0.0
			}
			e := parse(de, m[2], m[3])
			fmt.Printf(&quot;    c: %v\n&quot;, c)
			fmt.Printf(&quot;    e: %v\n&quot;, e)
		}
	}
}

Which outputs:

testing case 0: &quot;y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1&quot;
  match 0: &quot;x&quot;
    c: 1
    e: 1
  match 1: &quot;+3.3X^-50&quot;
    c: 3.3
    e: -50
  match 2: &quot;+ 15x25.5&quot;
    c: 15
    e: 25.5
  match 3: &quot;- 4x&quot;
    c: -4
    e: 1
  match 4: &quot;+2x^2&quot;
    c: 2
    e: 2
  match 5: &quot;+3*x-2.5&quot;
    c: 3
    e: -2.5
  match 6: &quot;+1.1&quot;
    c: 1.1
    e: 0
testing case 1: &quot;3.3X^-50&quot;
  match 0: &quot;3.3X^-50&quot;
    c: 3.3
    e: -50

Here you have the program on golang playground to try.

huangapple
  • 本文由 发表于 2016年12月25日 08:06:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/41317786.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定