2023年2月10日 03:54:37go评论53阅读模式

英文:

Split one sentence with ｢ and square brackets into multiple

问题

我在字典文本数据中有这种模式的句子：

我对懒惰｢极度厌恶 [油腻食物]。

是否有办法可以将其分成以下4个句子，以便在字典中更容易搜索（使用Perl）？

我对懒惰的极度厌恶。

我对油腻食物的极度厌恶。

我对懒惰的极度厌恶。

我对油腻食物的极度厌恶。

英文:

I have sentences of this pattern in the dictionary text data:

I have ｢an absolute [a deadly] abhorrence of ｢laziness [greasy food].

Is there a way that I can split it into 4 sentences as follows to make it more searchable in the dictionary (using Perl)?

I have an absolute abhorrence of laziness.

I have an absolute abhorrence of greasy food.

I have a deadly abhorrence of laziness.

I have a deadly abhorrence of greasy food.

答案1

得分: 4

这是一个有趣的问题。以下是一个解决方案。

暂时将开括号 ｢ 替换为 <，然后调整句子。<sup>&dagger;</sup> 举一个例子：

<!-- language: lang-none -->

word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end

将字符串分割成标记：包含替代项 <...[...] 的子字符串，以及围绕它们的单词组成的子字符串。一旦完成这一步骤，将每个包含替代项的子字符串分解成两个替代项，并放入一个数组引用中。因此，我们将得到一个数组，其中包含：
```
('word', ['a A1', 'b b1'], 'and more', ['a A2', 'b b2'], 'but', ['a A3', 'b b3'], 'end')
```
识别替代项的索引（在这里是 1,3,5）
创建这些索引的所有组合（作为一个集合，因此找到所有子集的集合，即幂集）。对于子集中的索引，在构成句子时选择第一个替代项，对于不在子集中的索引，选择第二个替代项（或者反过来）。
遍历标记数组并打印，按照上述描述选择替代项。

我使用了 Algorithm::Combinatorics 用于组合，但当然也有其他库可以使用。

以下是一个包含上述测试句子的程序（只包含 ASCII 字符）：

use warnings;
use strict;
use feature 'say';

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

my $str = q(word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end);
say $str;

my @tokens = 
    map { /^&lt;/ ? [ /&lt;([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(&lt;[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print " $tokens[$iw] " }
        elsif (any  { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
        else                               { print " $tokens[$iw]->[1] " }
    }
    say '';
}

考虑到所有自然语言的句子结构和细节，可以进行很大程度的简化。在代码方面也有许多改进的空间，还需要做一些清理工作（例如多余的空格），但它确实可以打印出所有包含替代短语的组合。

该库可以一次生成一个项目：在标量上下文中调用其函数时，会返回一个迭代器，可以使用 ->next 获取下一个项目。对于非常大的项目集，这一点非常重要。

以下是一个包含在问题中给出的句子的程序。（上述解决方案中将 ｢ 字符替换为了 ASCII (<)，因为某些系统仍然可能存在与Unicode有关的问题。除此之外，程序与前面的程序相同。）

use warnings;
use strict;
use feature 'say';

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

use utf8;
use open qw(:std :encoding(UTF-8));

my $str = q(I have ｢an absolute [a deadly] abhorrence of ｢laziness [greasy food].);
say $str;

my @tokens = 
    map { /^｢/ ? [ /｢([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(｢[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print " $tokens[$iw] " }
        elsif (any  { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
        else                               { print " $tokens[$iw]->[1] " }
    }
    say '';
}

希望对你有所帮助！

英文:

An interesting problem. Here is one solution.

For now replace the open paren ｢ by < and adjust the sentence.<sup>&dagger;</sup> Take an example string:

word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end

Split the string into tokens: substrings containing alternatives <...[...], and substrings with groups of words around them. Once we are here, break each alternatives-substring into the two alternatives and put that in an arrayref. So we'll have an array with:
```
(&#39;word&#39;, [&#39;a A1&#39;, &#39;b b1&#39;], &#39;and more&#39;, [&#39;a A2&#39;, &#39;b b2&#39;], 

    &#39;but&#39;, [&#39;a A3&#39;, &#39;b b3&#39;], &#39;end&#39;)
```
Identify indices of alternatives (1,3,5 here)
Create all combinations of these indices (as a set, so find the set of all subsets, the power set). For the indices in a subset we take the first alternative when composing a sentence, for those not in the subset we take the second (or the other way round)
Go through the tokens array and print, selecting the alternatives as described above

I use Algorithm::Combinatorics for combinations but there are of course other libraries.

A program with a test sentence introduced above (and only ascii characters)

use warnings;
use strict;
use feature &#39;say&#39;;

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

my $str = q(word &lt;a A1[b b1] and more &lt;a A2[b b2] but &lt;a A3[b b3] end);
say $str;

 my @tokens = 
     map { /^&lt;/ ? [ /&lt;([^\[]+) \[([^\]]+)\]/x ] : $_ }
     split /(&lt;[^\[]+ \[[^\]]+\])/x, $str;
 #say &quot;@tokens&quot;;

 my @idx = grep { ref $tokens[$_] eq &#39;ARRAY&#39; } 0..$#tokens;
 #say &quot;@idx&quot;;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tok) {
        if    (none { $iw == $_ } @idx)    { print &quot; $tok[$iw] &quot; }
        elsif (any  { $iw == $_ } @take_0) { print &quot; $tok[$iw]-&gt;[0] &quot; }
        else                               { print &quot; $tok[$iw]-&gt;[1] &quot; }
    }
    say &#39;&#39;;
}

There are great simplifications considering all kinds of sentence structure an details from natural languages. There is plenty of room for code improvement, and there's a bit of cleanup to do (extra spaces, for one), but it does print all combinations with alternative phrases.

The library can generate one item at a time: when invoked in scalar context its functions return an iterator, on which ->next gives the next item. This is important for very large sets of items.

Here is a program with the sentence given in the question. (The solution above has ascii (<) instead of the ｢ character, as some systems still have problems with Unicode. Other than that the program is the same.)

use warnings;
use strict;
use feature &#39;say&#39;;

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

use utf8;
use open qw(:std :encoding(UTF-8));

my $str = q(I have ｢an absolute [a deadly] abhorrence of ｢laziness [greasy food].);
say $str;

my @tokens = 
    map { /^｢/ ? [ /｢([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(｢[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq &#39;ARRAY&#39; } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print &quot; $tokens[$iw] &quot; }
        elsif (any  { $iw == $_ } @take_0) { print &quot; $tokens[$iw]-&gt;[0] &quot; }
        else                               { print &quot; $tokens[$iw]-&gt;[1] &quot; }
    }
    say &#39;&#39;;
}

答案2

得分: 3

首先，解析成

```perl
my @def = (
   [ "我有" ],
   [ "绝对的", "致命的" ],
   [ "厌恶" ],
   [ "懒惰", "油腻食物" ],
   [ "。" ],
);

这可以通过以下方式实现，使用一个验证性的解析器：

my @def;
for ( $str ) {
   / \G ( [^｢]+ ) /xgc
      and push @def, [ $1 ];

   if ( / \G ｢ /xgc ) {
      / \G ( [^｢\[\]]+ ) [ ] \[ ( [^｢\[\]]+ ) \] /xgc
         or die( "偏移量 ".( pos() - 1 )." 处出现错误序列\n" );

      push @def, [ $1, $2 ];
      redo;
   }

   /\G \z /xgc
      and last;

   die( "不应该发生的情况" );
}

然后找到产品。这可以通过以下方式实现：

use Algorithm::Loops qw( NestedLoops );

my $iter = NestedLoops( \@def );
while ( my @parts = $iter->() ) {
   say join "", @parts;
}

或者

use Algorithm::Loops qw( NestedLoops );

NestedLoops( \@def, sub { say join "", @_; } );

英文:

First, parse into

my @def = (
   [ &quot;I have &quot; ],
   [ &quot;an absolute&quot;, &quot;a deadly&quot; ],
   [ &quot; abhorrence of &quot; ],
   [ &quot;laziness&quot;, &quot;greasy food&quot; ],
   [ &quot;.&quot; ],
);

This can be achieved using the following, a validating parser:

my @def;
for ( $str ) {
   / \G ( [^｢]+ ) /xgc
      and push @def, [ $1 ];

   if ( / \G ｢ /xgc ) {
      / \G ( [^｢\[\]]+ ) [ ] \[ ( [^｢\[\]]+ ) \] /xgc
         or die( &quot;Bad sequence at offset &quot;.( pos() - 1 ).&quot;\n&quot; );

      push @def, [ $1, $2 ];
      redo;
   }

   /\G \z /xgc
      and last;

   die( &quot;Should not happen&quot; );
}

Then find the product. This can be achieved using the following:

use Algorithm::Loops qw( NestedLoops );

my $iter = NestedLoops( \@def );
while ( my @parts = $iter-&gt;() ) {
   say join &quot;&quot;, @parts;
}

use Algorithm::Loops qw( NestedLoops );

NestedLoops( \@def, sub { say join &quot;&quot;, @_; } );

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Split one sentence with “and square brackets” into multiple

问题

答案1

答案2

如何在test::base/test::nginx/perl测试用例中添加换行符？

What does 1 inside eval body mean in perl?

File::Find::Rule – 重复的输出

使用Perl根据正则表达式或符号后出现的值对数组进行排序。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论