使用特定格式分割一些文本。

huangapple go评论80阅读模式
英文:

Split some text with a specific format

问题

需要关于以下问题的一些建议我收到的文本具有以下格式

    "text (textInBrackets), text2 (textInBrackets2), text3 (textInBrackets3),..."

每个text和textInBrackets都可以包含字母数字和括号每对的分隔符是逗号逗号旁边的闭括号是确定配对右侧元素结束位置的符号

我需要以这样的方式拆分文本以便可以将每对text和textInBrackets分开并将其放入数组中如下所示

    String[][] pairs = new String[n][2];
    pair[0][0] = "text";
    pair[0][1] = "textInBrackets";
    pair[1][0] = "text2";
    pair[1][1] = "textInBrackets2";

示例

    String text = "texttext(text)text(subtext), othertext152(de)sert(subothertext), textwithoutbracket, elems(subelem)";
    
    String[][] result = splitFunction(text);
    
    返回的数组为
        String[][] pairs = new String[n][2];
        pair[0][0] = "texttext(text)text";
        pair[0][1] = "subtext";
        pair[1][0] = "othertext152(de)sert";
        pair[1][1] = "subothertext";
        pair[2][0] = "textwithoutbracket";
        pair[2][1] = null;
        pair[3][0] = "elems";
        pair[3][1] = "subelem";
英文:

I need some advice with the following problem. I receive text with the following format:

"text (textInBrackets), text2 (textInBrackets2), text3 (textInBrackets3),..."

Every text and textInBrackets could have letters, numbers and also brackets. The separation between pairs are the commas, the closing bracket near the comma is the one that determines where the right element of the pair ends.

I need to split the text in a way that I could separate every pair of text and textInBrackets and put it in an array like:

String[][] pairs= new String[n][2];
pair[0][0]="text";
pair[0][1]="textInBrackets";
pair[1][0]="text2";
pair[1][1]="textInBrackets2";

Example:

String text="texttext(text)text(subtext), othertext152(de)sert(subothertext), textwithoutbracket, elems(subelem)";
String[][] return=splitFunction(text);
The return array is:
String[][] pairs= new String[n][2];
pair[0][0]="texttext(text)text";
pair[0][1]="subtext";
pair[1][0]="othertext152(de)sert";
pair[1][1]="subothertext";
pair[2][0]="textwithoutbracket";
pair[2][1]=null;
pair[3][0]="elems";
pair[3][1]="subelem";

I already have a solution for the problem but is not bullet proof and it has some bugs.

答案1

得分: 4

你要实现的目标实际上是一个难以实现的问题(如果括号内的文本必须被包含,例如"(sa(ssa)sa)")。如果你的情况是括号内的文本不能再包含另一个括号内的文本等等...解决方案将会相当简单,因为人们已经向你提出过。用于验证这种模式并从中获取分组的代码将如下所示:

String text = "text (textInBrackets), text2 (textInBrackets2), text3 (textInBrackets3)";
Pattern pattern = Pattern.compile("(\\w+ \\(\\w+\\))((, \\w+ \\(\\w+\\))*)");
Matcher matcher = pattern.matcher(text);
System.out.println(matcher.matches());
System.out.println(matcher.group(0));
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));

输出为:

true
text (textInBrackets), text2 (textInBrackets2), text3 (textInBrackets3)
text (textInBrackets)
, text2 (textInBrackets2), text3 (textInBrackets3)
, text3 (textInBrackets3)

但你还有一个规范,说明括号内的文本可能包含另一个括号内的文本等等(我不知道括号内是否必须再次是封闭的括号文本,如果不是的话,接下来的内容对于你的情况就无效)。这样的文本不再是正则语法(可以用正则表达式解析),而是上下文无关语法。要验证和解析这样的文本,你需要使用堆栈来实现,其中你会在找到左括号时推入,找到右括号时弹出。这实际上就是能够解析上下文无关语法的下推自动机所做的。如果你知道括号内的文本可以嵌套多少次,那么你的文本仍然是正则有效的语法。

例如:

"text (sad(sdasddsa)sadas)"

你知道括号文本最多嵌套1次,你可以调整你的手动实现或正则表达式以适应它。这样的模式可能如下所示(根据你想要它如何行为,空括号是否也有效等等可能会有所不同):

Pattern pattern = Pattern.compile("(\\w+ \\(\\w+(\\(\\w*\\))*\\w+\\))((, \\w+ \\(\\w+(\\(\\w*\\))*\\w+\\))*)");

你可以看到,我不得不调整模式,以包含有关嵌套括号的信息。你可以这样做X次,但不能无限次这样做。这正是这个问题失去了正则语法行为,变成了上下文无关语法的地方。

一旦你没有关于嵌套级别的信息(可以有N个嵌套级别),你需要使用上下文无关语法(或下推自动机)。由于这是一个相当难以解释的主题,因为人们需要在自动机理论、语法、正则表达式与正则语法的关系等方面有一些理论知识,我建议你学习一些关于这方面的背景知识,以理解我的回答。如果你没有太多时间来解决这个问题,只需将我提供的参数提供给要求你实现的人,并将你的程序实现为最多嵌套一级括号的情况下工作,例如。

英文:

What you are trying to achieve is actually hard problem to implement (if bracket's inside bracket's text must be enclosing, for example "(sa(ssa)sa)"). If your case was that text inside bracket's could not contain another text inside bracket's etc .. solution would be quite easy as people already proposed to you. Code to verify such pattern and to obtain groups from it would look like this:

String text = "text (textInBrackets), text2 (textInBrackets2), text3 (textInBrackets3)";
Pattern pattern = Pattern.compile("(\\w+ \\(\\w+\\))((, \\w+ \\(\\w+\\))*)");
Matcher matcher = pattern.matcher(text);
System.out.println(matcher.matches());
System.out.println(matcher.group(0));
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));

with output:

true
text (textInBrackets), text2 (textInBrackets2), text3 (textInBrackets3)
text (textInBrackets)
, text2 (textInBrackets2), text3 (textInBrackets3)
, text3 (textInBrackets3)

But you also have specification that tell's that text inside bracket's might contain another text inside bracket's etc .. (i don't know if it has to be closed bracket text again or not, if not what continues is not valid for your case). Such text is no longer regular grammar (which can be parsed with regex) but is context free grammar. To verify and parse such text you would need to use implementation with stack where u would push left bracket and pop right bracket once you find it. This is what actually push down automat, which is able to parse context free grammar, does. Your text would still be regular valid grammar if you would know how many times text within bracket can be nested.

For example:

"text (sad(sdasddsa)sadas)"

you know that bracket text is nested at max 1 time and you can adjust your manual implementation or regex to it. Such pattern would look like this (might be quite different that depend's on how you want it to behave, if empty bracket's are also valid or no etc...):

Pattern pattern = Pattern.compile("(\\w+ \\(\\w+(\\(\\w*\\))*\\w+\\))((, \\w+ \\(\\w+(\\(\\w*\\))*\\w+\\))*)");

You can see that i had to adjust my pattern so it contain's information about the nested bracket's. You can do this X time's but cannot do this forever. That's exactly where this problem loses it's regular grammar behavior and become's context free grammar.

Once you don't have information about nesting level's (and there can be N nested levels) you need to use context free grammar (or push down automat). Since this is quite hard topic to explain, because one needs to have some theory education around automata theory, grammar's, how regex relate's to regular grammar etc... I suggest you to learn some background around this to understand my answer. If you don't have much time to resolve this issue, just provide to whoever asked you to implement arguments i have provided and implement your program to work with nested bracket's at max nested level 1 for example.

答案2

得分: 2

你可以在逗号和空格上进行拆分,然后使用 lastIndexOfsubstring 来分隔这些部分。

String[] parts = text.split(", ");
String[][] result = new String[parts.length][2];
for (int i = 0; i < parts.length; i++) {
    String part = parts[i];
    int lastIdx = part.lastIndexOf('(');
    if (lastIdx == -1) {
        result[i][0] = part;
    } else {
        result[i] = new String[] { part.substring(0, lastIdx), part.substring(lastIdx + 1, part.length() - 1) };
    }
}

演示链接!

英文:

You can split on a comma and space and then use lastIndexOf and substring to divide the parts.

String[] parts = text.split(&quot;, &quot;);
String[][] result = new String[parts.length][2];
for (int i = 0; i &lt; parts.length; i++) {
String part = parts[i];
int lastIdx = part.lastIndexOf(&#39;(&#39;);
if (lastIdx == -1) {
result[i][0] = part;
} else {
result[i] = new String[] { part.substring(0, lastIdx), part.substring(lastIdx + 1, part.length() - 1) };
}
}

Demo!

huangapple
  • 本文由 发表于 2020年7月24日 23:06:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/63076337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定