Split one sentence with “and square brackets” into multiple

huangapple go评论41阅读模式
英文:

Split one sentence with 「 and square brackets into multiple

问题

我在字典文本数据中有这种模式的句子:

我对懒惰「极度厌恶 [油腻食物]。

是否有办法可以将其分成以下4个句子,以便在字典中更容易搜索(使用Perl)?

我对懒惰的极度厌恶。

我对油腻食物的极度厌恶。

我对懒惰的极度厌恶。

我对油腻食物的极度厌恶。

英文:

I have sentences of this pattern in the dictionary text data:

I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].

Is there a way that I can split it into 4 sentences as follows to make it more searchable in the dictionary (using Perl)?

I have an absolute abhorrence of laziness.

I have an absolute abhorrence of greasy food.

I have a deadly abhorrence of laziness.

I have a deadly abhorrence of greasy food.

答案1

得分: 4

这是一个有趣的问题。以下是一个解决方案。

暂时将开括号 替换为 &lt;,然后调整句子。<sup>&dagger;</sup> 举一个例子:

<!-- language: lang-none -->

word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end
  1. 将字符串分割成标记:包含替代项 &lt;...[...] 的子字符串,以及围绕它们的单词组成的子字符串。一旦完成这一步骤,将每个包含替代项的子字符串分解成两个替代项,并放入一个数组引用中。因此,我们将得到一个数组,其中包含:

    ('word', ['a A1', 'b b1'], 'and more', ['a A2', 'b b2'], 'but', ['a A3', 'b b3'], 'end')
    
  2. 识别替代项的索引(在这里是 1,3,5

  3. 创建这些索引的所有组合(作为一个集合,因此找到所有子集的集合,即幂集)。对于子集中的索引,在构成句子时选择第一个替代项,对于不在子集中的索引,选择第二个替代项(或者反过来)。

  4. 遍历标记数组并打印,按照上述描述选择替代项。

我使用了 Algorithm::Combinatorics 用于组合,但当然也有其他库可以使用。

以下是一个包含上述测试句子的程序(只包含 ASCII 字符):

use warnings;
use strict;
use feature 'say';

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

my $str = q(word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end);
say $str;

my @tokens = 
    map { /^&lt;/ ? [ /&lt;([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(&lt;[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print " $tokens[$iw] " }
        elsif (any  { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
        else                               { print " $tokens[$iw]->[1] " }
    }
    say '';
}

考虑到所有自然语言的句子结构和细节,可以进行很大程度的简化。在代码方面也有许多改进的空间,还需要做一些清理工作(例如多余的空格),但它确实可以打印出所有包含替代短语的组合。

该库可以一次生成一个项目:在标量上下文中调用其函数时,会返回一个迭代器,可以使用 ->next 获取下一个项目。对于非常大的项目集,这一点非常重要。


以下是一个包含在问题中给出的句子的程序。 (上述解决方案中将 字符替换为了 ASCII (&lt;),因为某些系统仍然可能存在与Unicode有关的问题。除此之外,程序与前面的程序相同。)

use warnings;
use strict;
use feature 'say';

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

use utf8;
use open qw(:std :encoding(UTF-8));

my $str = q(I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].);
say $str;

my @tokens = 
    map { /^「/ ? [ /「([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(「[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print " $tokens[$iw] " }
        elsif (any  { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
        else                               { print " $tokens[$iw]->[1] " }
    }
    say '';
}

希望对你有所帮助!

英文:

An interesting problem. Here is one solution.

For now replace the open paren by &lt; and adjust the sentence.<sup>&dagger;</sup> Take an example string:

<!-- language: lang-none -->

word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end
  1. Split the string into tokens: substrings containing alternatives &lt;...[...], and substrings with groups of words around them. Once we are here, break each alternatives-substring into the two alternatives and put that in an arrayref. So we'll have an array with:

    (&#39;word&#39;, [&#39;a A1&#39;, &#39;b b1&#39;], &#39;and more&#39;, [&#39;a A2&#39;, &#39;b b2&#39;], 
    
        &#39;but&#39;, [&#39;a A3&#39;, &#39;b b3&#39;], &#39;end&#39;)
    
  2. Identify indices of alternatives (1,3,5 here)

  3. Create all combinations of these indices (as a set, so find the set of all subsets, the power set). For the indices in a subset we take the first alternative when composing a sentence, for those not in the subset we take the second (or the other way round)

  4. Go through the tokens array and print, selecting the alternatives as described above

I use Algorithm::Combinatorics for combinations but there are of course other libraries.

A program with a test sentence introduced above (and only ascii characters)

use warnings;
use strict;
use feature &#39;say&#39;;

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

my $str = q(word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end);
say $str;

 my @tokens = 
     map { /^&lt;/ ? [ /&lt;([^\[]+) \[([^\]]+)\]/x ] : $_ }
     split /(&lt;[^\[]+ \[[^\]]+\])/x, $str;
 #say &quot;@tokens&quot;;

 my @idx = grep { ref $tokens[$_] eq &#39;ARRAY&#39; } 0..$#tokens;
 #say &quot;@idx&quot;;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tok) {
        if    (none { $iw == $_ } @idx)    { print &quot; $tok[$iw] &quot; }
        elsif (any  { $iw == $_ } @take_0) { print &quot; $tok[$iw]-&gt;[0] &quot; }
        else                               { print &quot; $tok[$iw]-&gt;[1] &quot; }
    }
    say &#39;&#39;;
}

There are great simplifications considering all kinds of sentence structure an details from natural languages. There is plenty of room for code improvement, and there's a bit of cleanup to do (extra spaces, for one), but it does print all combinations with alternative phrases.

The library can generate one item at a time: when invoked in scalar context its functions return an iterator, on which -&gt;next gives the next item. This is important for very large sets of items.


Here is a program with the sentence given in the question. (The solution above has ascii (&lt;) instead of the character, as some systems still have problems with Unicode. Other than that the program is the same.)

use warnings;
use strict;
use feature &#39;say&#39;;

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

use utf8;
use open qw(:std :encoding(UTF-8));

my $str = q(I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].);
say $str;

my @tokens = 
    map { /^「/ ? [ /「([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(「[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq &#39;ARRAY&#39; } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print &quot; $tokens[$iw] &quot; }
        elsif (any  { $iw == $_ } @take_0) { print &quot; $tokens[$iw]-&gt;[0] &quot; }
        else                               { print &quot; $tokens[$iw]-&gt;[1] &quot; }
    }
    say &#39;&#39;;
}

答案2

得分: 3

首先解析成

```perl
my @def = (
   [ "我有" ],
   [ "绝对的", "致命的" ],
   [ "厌恶" ],
   [ "懒惰", "油腻食物" ],
   [ "。" ],
);

这可以通过以下方式实现,使用一个验证性的解析器:

my @def;
for ( $str ) {
   / \G ( [^「]+ ) /xgc
      and push @def, [ $1 ];

   if ( / \G 「 /xgc ) {
      / \G ( [^「\[\]]+ ) [ ] \[ ( [^「\[\]]+ ) \] /xgc
         or die( "偏移量 ".( pos() - 1 )." 处出现错误序列\n" );

      push @def, [ $1, $2 ];
      redo;
   }

   /\G \z /xgc
      and last;

   die( "不应该发生的情况" );
}

然后找到产品。这可以通过以下方式实现:

use Algorithm::Loops qw( NestedLoops );

my $iter = NestedLoops( \@def );
while ( my @parts = $iter->() ) {
   say join "", @parts;
}

或者

use Algorithm::Loops qw( NestedLoops );

NestedLoops( \@def, sub { say join "", @_; } );
英文:

First, parse into

my @def = (
   [ &quot;I have &quot; ],
   [ &quot;an absolute&quot;, &quot;a deadly&quot; ],
   [ &quot; abhorrence of &quot; ],
   [ &quot;laziness&quot;, &quot;greasy food&quot; ],
   [ &quot;.&quot; ],
);

This can be achieved using the following, a validating parser:

my @def;
for ( $str ) {
   / \G ( [^「]+ ) /xgc
      and push @def, [ $1 ];

   if ( / \G 「 /xgc ) {
      / \G ( [^「\[\]]+ ) [ ] \[ ( [^「\[\]]+ ) \] /xgc
         or die( &quot;Bad sequence at offset &quot;.( pos() - 1 ).&quot;\n&quot; );

      push @def, [ $1, $2 ];
      redo;
   }

   /\G \z /xgc
      and last;

   die( &quot;Should not happen&quot; );
}

Then find the product. This can be achieved using the following:

use Algorithm::Loops qw( NestedLoops );

my $iter = NestedLoops( \@def );
while ( my @parts = $iter-&gt;() ) {
   say join &quot;&quot;, @parts;
}

or

use Algorithm::Loops qw( NestedLoops );

NestedLoops( \@def, sub { say join &quot;&quot;, @_; } );

huangapple
  • 本文由 发表于 2023年2月10日 03:54:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75403775.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定