2023年1月9日 06:40:09go评论75阅读模式

英文:

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

问题

所以我需要按照不在引号内的所有空格对字符串进行标记，我正在使用JavaScript符号的正则表达式。

例如：

&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;

变成

[&quot; Test Test &quot;,ab,c,&quot; Test&quot;,&quot;Test &quot;,&quot;Test&quot;,&quot;T e s t&quot;]

但是对于我的用例，解决方案应该在以下测试设置中工作：
https://www.regextester.com/

在上述设置中，所有不在引号内的空格都应该被突出显示。如果它们在上述设置中被突出显示，它们将在我的程序中被正确解析。

更具体地说，我正在使用Boost::Regex C++来执行解析，如下所示：

...
std::string test_string(&quot;\&quot; Test Test \&quot; ab c \&quot; Test\&quot; \&quot;Test \&quot; \&quot;Test\&quot; \&quot;T e s t\&quot;&quot;);
// (,|;)?\\s+     : 拆分为 ,\s 或 ;\s
// (?![^\\[]*\\]) : 忽略 [] 内的空格
// (?![^\\{]*\\}) : 忽略 {} 内的空格
// (?![^\&quot;].*\&quot;)  : 忽略 &quot;&quot; 内的空格 !!! 我的尝试不起作用 !!!

//注意，下面的正则表达式分隔符声明不包括错误的正则表达式。
boost::regex delimiter(&quot;(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})&quot;);
std::vector&lt;std::string&gt; string_vector;
boost::split_regex(string_vector, test_string, delimiter);

对于不使用Boost::Regex或C++的人，上面的链接应该可以测试适用于上述用例的正则表达式。

感谢大家的帮助，我希望你们可以帮助我解决以上问题。

英文:

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.

For example:

&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;

becomes

[&quot; Test Test &quot;,ab,c,&quot; Test&quot;,&quot;Test &quot;,&quot;Test&quot;,&quot;T e s t&quot;]

For my use case however, the solution should work in the following test setting:
https://www.regextester.com/

All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.

For more specificity, I am using Boost::Regex C++ to do the parsing as follows:

...
std::string test_string(&quot;\&quot; Test Test \&quot; ab c \&quot; Test\&quot; \&quot;Test \&quot; \&quot;Test\&quot; \&quot;T e s t\&quot;&quot;);
// (,|;)?\\s+     : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\&quot;].*\&quot;)  : Ignore spaces inside &quot;&quot; !!! MY ATTEMPT DOESN&#39;T WORK !!!

//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter(&quot;(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})&quot;);
std::vector&lt;std::string&gt; string_vector;
boost::split_regex(string_vector, test_string, delimiter);

For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.

Thank you all for you assistance I hope you can help me with the above problem.

答案1

得分: 2

我绝对不会使用正则表达式来处理这个问题。首先，因为将其表达为PEG语法更容易。例如：

std::vector&lt;std::string&gt; tokens(std::string_view input) {
    namespace x3 = boost::spirit::x3;
    std::vector&lt;std::string&gt; r;

    auto atom                            //
        = &#39;[&#39; &gt;&gt; *~x3::char_(&#39;]&#39;) &gt;&gt; &#39;]&#39; //
        | &#39;{&#39; &gt;&gt; *~x3::char_(&#39;}&#39;) &gt;&gt; &#39;}&#39; //
        | &#39;&quot;&#39; &gt;&gt; *~x3::char_(&#39;&quot;&#39;) &gt;&gt; &#39;&quot;&#39; //
        | x3::graph;

    auto token = x3::raw[*atom];

    parse(input.begin(), input.end(), token % +x3::space, r);
    return r;
}

这个代码可以达到你的预期效果：

在线演示

int main() {
    for (std::string const input : {R&quot;(&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;)&quot;}) {
        std::cout &lt;&lt; input &lt;&lt; &quot;\n&quot;;
        for (auto&amp; tok : tokens(input))
            std::cout &lt;&lt; &quot; - &quot; &lt;&lt; quoted(tok, &#39;\&#39;&#39;) &lt;&lt; &quot;\n&quot;;
    }
}

输出：

&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;
 - &#39;&quot; Test Test &quot;&#39;
 - &#39;ab&#39;
 - &#39;c&#39;
 - &#39;&quot; Test&quot;&#39;
 - &#39;&quot;Test &quot;&#39;
 - &#39;&quot;Test&quot;&#39;
 - &#39;&quot;T e s t&quot;&#39;

附加内容

当你意识到你想要处理嵌套结构时（例如"string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]），这种方法真正发挥了作用。正则表达式在处理嵌套结构时往往表现不佳。Spirit语法规则可以是递归的。如果你让你的语法描述更加明确，我可以为你展示示例。

英文:

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:

std::vector&lt;std::string&gt; tokens(std::string_view input) {
    namespace x3 = boost::spirit::x3;
    std::vector&lt;std::string&gt; r;

    auto atom                            //
        = &#39;[&#39; &gt;&gt; *~x3::char_(&#39;]&#39;) &gt;&gt; &#39;]&#39; //
        | &#39;{&#39; &gt;&gt; *~x3::char_(&#39;}&#39;) &gt;&gt; &#39;}&#39; //
        | &#39;&quot;&#39; &gt;&gt; *~x3::char_(&#39;&quot;&#39;) &gt;&gt; &#39;&quot;&#39; //
        | x3::graph;

    auto token = x3::raw[*atom];

    parse(input.begin(), input.end(), token % +x3::space, r);
    return r;
}

This, off the bat, already performs as you intend:

Live On Coliru

int main() {
    for (std::string const input : {R&quot;(&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;)&quot;}) {
        std::cout &lt;&lt; input &lt;&lt; &quot;\n&quot;;
        for (auto&amp; tok : tokens(input))
            std::cout &lt;&lt; &quot; - &quot; &lt;&lt; quoted(tok, &#39;\&#39;&#39;) &lt;&lt; &quot;\n&quot;;
    }
}

Output:

&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;
 - &#39;&quot; Test Test &quot;&#39;
 - &#39;ab&#39;
 - &#39;c&#39;
 - &#39;&quot; Test&quot;&#39;
 - &#39;&quot;Test &quot;&#39;
 - &#39;&quot;Test&quot;&#39;
 - &#39;&quot;T e s t&quot;&#39;

BONUS

Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).

Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

答案2

得分: 0

您可以使用多个正则表达式，如果您可以接受这样的话。思路是将引号内的空格替换为不可打印字符（\x01），然后在拆分后还原它们：

const input = `&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;`;
let result = input
  .replace(/&quot;[^&quot;]*&quot;/g, m => m.replace(/ /g, '\x01')) // 替换引号内的空格
  .split(/ +/) // 以空格拆分
  .map(s => s.replace(/\x01/g, ' ')); // 还原引号内的空格
console.log(result);

如果您在字符串内部有转义引号，例如 "a \"quoted\" token"，您可以使用以下正则表达式：

const input = `&quot;A \&quot;quoted\&quot; token&quot; &quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;`;
let result = input
  .replace(/&quot;.*?[^\\]&quot;/g, m => m.replace(/ /g, '\x01')) // 替换引号内的空格
  .split(/ +/) // 以空格拆分
  .map(s => s.replace(/\x01/g, ' ')); // 还原引号内的空格
console.log(result);

如果您需要解析嵌套括号，您需要一个合适的语言解析器。但您也可以使用正则表达式来实现：https://stackoverflow.com/questions/74414740/parsing-javascript-objects-with-functions-as-json/74437880#74437880

了解更多关于正则表达式的信息： https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

英文:

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:

const input = `&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;`;
let result = input
  .replace(/&quot;[^&quot;]*&quot;/g, m =&gt; m.replace(/ /g, &#39;\x01&#39;)) // replace spaces inside quotes
  .split(/ +/) // split on spaces
  .map(s =&gt; s.replace(/\x01/g, &#39; &#39;)); // restore spaces inside quotes
console.log(result);

If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:

const input = `&quot;A \&quot;quoted\&quot; token&quot; &quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;`;
let result = input
  .replace(/&quot;.*?[^\\]&quot;/g, m =&gt; m.replace(/ /g, &#39;\x01&#39;)) // replace spaces inside quotes
  .split(/ +/) // split on spaces
  .map(s =&gt; s.replace(/\x01/g, &#39; &#39;)); // restore spaces inside quotes
console.log(result);

If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: https://stackoverflow.com/questions/74414740/parsing-javascript-objects-with-functions-as-json/74437880#74437880

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

问题

答案1

附加内容

BONUS

答案2

为什么我的代码中根本没有使用”int64″，却出现了关于”int64″的错误消息？

选择在编译时调用哪个成员函数

std::function和lambda未遵守引用要求。

你能像现代Fortran一样处理C++中的数组部分吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论