Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

huangapple go评论61阅读模式
英文:

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

问题

所以我需要按照不在引号内的所有空格对字符串进行标记,我正在使用JavaScript符号的正则表达式。

例如:

" Test Test " ab c " Test" "Test " "Test" "T e s t"

变成

[" Test Test ",ab,c," Test","Test ","Test","T e s t"]

但是对于我的用例,解决方案应该在以下测试设置中工作:
https://www.regextester.com/

在上述设置中,所有不在引号内的空格都应该被突出显示。如果它们在上述设置中被突出显示,它们将在我的程序中被正确解析。

更具体地说,我正在使用Boost::Regex C++来执行解析,如下所示:

...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+     : 拆分为 ,\s 或 ;\s
// (?![^\\[]*\\]) : 忽略 [] 内的空格
// (?![^\\{]*\\}) : 忽略 {} 内的空格
// (?![^\"].*\")  : 忽略 "" 内的空格 !!! 我的尝试不起作用 !!!

//注意,下面的正则表达式分隔符声明不包括错误的正则表达式。
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);

对于不使用Boost::Regex或C++的人,上面的链接应该可以测试适用于上述用例的正则表达式。

感谢大家的帮助,我希望你们可以帮助我解决以上问题。

英文:

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.

For example:

" Test Test " ab c " Test" "Test " "Test" "T e s t"

becomes

[" Test Test ",ab,c," Test","Test ","Test","T e s t"]

For my use case however, the solution should work in the following test setting:
https://www.regextester.com/

All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.

For more specificity, I am using Boost::Regex C++ to do the parsing as follows:

...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+     : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\")  : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!

//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);

For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.

Thank you all for you assistance I hope you can help me with the above problem.

答案1

得分: 2

我绝对不会使用正则表达式来处理这个问题。首先,因为将其表达为PEG语法更容易。例如:

std::vector<std::string> tokens(std::string_view input) {
    namespace x3 = boost::spirit::x3;
    std::vector<std::string> r;

    auto atom                            //
        = '[' >> *~x3::char_(']') >> ']' //
        | '{' >> *~x3::char_('}') >> '}' //
        | '"' >> *~x3::char_('"') >> '"' //
        | x3::graph;

    auto token = x3::raw[*atom];

    parse(input.begin(), input.end(), token % +x3::space, r);
    return r;
}

这个代码可以达到你的预期效果:

在线演示

int main() {
    for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
        std::cout << input << "\n";
        for (auto& tok : tokens(input))
            std::cout << " - " << quoted(tok, '\'') << "\n";
    }
}

输出:

" Test Test " ab c " Test" "Test " "Test" "T e s t"
 - '" Test Test "'
 - 'ab'
 - 'c'
 - '" Test"'
 - '"Test "'
 - '"Test"'
 - '"T e s t"'

附加内容

当你意识到你想要处理嵌套结构时(例如"string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]),这种方法真正发挥了作用。正则表达式在处理嵌套结构时往往表现不佳。Spirit语法规则可以是递归的。如果你让你的语法描述更加明确,我可以为你展示示例。

英文:

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:

std::vector<std::string> tokens(std::string_view input) {
    namespace x3 = boost::spirit::x3;
    std::vector<std::string> r;

    auto atom                            //
        = '[' >> *~x3::char_(']') >> ']' //
        | '{' >> *~x3::char_('}') >> '}' //
        | '"' >> *~x3::char_('"') >> '"' //
        | x3::graph;

    auto token = x3::raw[*atom];

    parse(input.begin(), input.end(), token % +x3::space, r);
    return r;
}

This, off the bat, already performs as you intend:

Live On Coliru

int main() {
    for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
        std::cout << input << "\n";
        for (auto& tok : tokens(input))
            std::cout << " - " << quoted(tok, '\'') << "\n";
    }
}

Output:

" Test Test " ab c " Test" "Test " "Test" "T e s t"
 - '" Test Test "'
 - 'ab'
 - 'c'
 - '" Test"'
 - '"Test "'
 - '"Test"'
 - '"T e s t"'

BONUS

Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).

Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

答案2

得分: 0

您可以使用多个正则表达式,如果您可以接受这样的话。思路是将引号内的空格替换为不可打印字符(\x01),然后在拆分后还原它们:

const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
  .replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // 替换引号内的空格
  .split(/ +/) // 以空格拆分
  .map(s => s.replace(/\x01/g, ' ')); // 还原引号内的空格
console.log(result);

如果您在字符串内部有转义引号,例如 "a \"quoted\" token",您可以使用以下正则表达式:

const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
  .replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // 替换引号内的空格
  .split(/ +/) // 以空格拆分
  .map(s => s.replace(/\x01/g, ' ')); // 还原引号内的空格
console.log(result);

如果您需要解析嵌套括号,您需要一个合适的语言解析器。但您也可以使用正则表达式来实现:https://stackoverflow.com/questions/74414740/parsing-javascript-objects-with-functions-as-json/74437880#74437880

了解更多关于正则表达式的信息: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

英文:

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const input = `&quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;`;
let result = input
  .replace(/&quot;[^&quot;]*&quot;/g, m =&gt; m.replace(/ /g, &#39;\x01&#39;)) // replace spaces inside quotes
  .split(/ +/) // split on spaces
  .map(s =&gt; s.replace(/\x01/g, &#39; &#39;)); // restore spaces inside quotes
console.log(result);

<!-- end snippet -->

If you have escaped quotes within a string, such as &quot;a \&quot;quoted\&quot; token&quot; you can use this regex instead:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const input = `&quot;A \&quot;quoted\&quot; token&quot; &quot; Test Test &quot; ab c &quot; Test&quot; &quot;Test &quot; &quot;Test&quot; &quot;T e s t&quot;`;
let result = input
  .replace(/&quot;.*?[^\\]&quot;/g, m =&gt; m.replace(/ /g, &#39;\x01&#39;)) // replace spaces inside quotes
  .split(/ +/) // split on spaces
  .map(s =&gt; s.replace(/\x01/g, &#39; &#39;)); // restore spaces inside quotes
console.log(result);

<!-- end snippet -->

If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: https://stackoverflow.com/questions/74414740/parsing-javascript-objects-with-functions-as-json/74437880#74437880

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

huangapple
  • 本文由 发表于 2023年1月9日 06:40:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75051783.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定