英文:
Split one sentence with 「 and square brackets into multiple
问题
我在字典文本数据中有这种模式的句子:
我对懒惰「极度厌恶 [油腻食物]。
是否有办法可以将其分成以下4个句子,以便在字典中更容易搜索(使用Perl)?
我对懒惰的极度厌恶。
我对油腻食物的极度厌恶。
我对懒惰的极度厌恶。
我对油腻食物的极度厌恶。
英文:
I have sentences of this pattern in the dictionary text data:
I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].
Is there a way that I can split it into 4 sentences as follows to make it more searchable in the dictionary (using Perl)?
I have an absolute abhorrence of laziness.
I have an absolute abhorrence of greasy food.
I have a deadly abhorrence of laziness.
I have a deadly abhorrence of greasy food.
答案1
得分: 4
这是一个有趣的问题。以下是一个解决方案。
暂时将开括号 「
替换为 <
,然后调整句子。<sup>†</sup> 举一个例子:
<!-- language: lang-none -->
word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end
-
将字符串分割成标记:包含替代项
<...[...]
的子字符串,以及围绕它们的单词组成的子字符串。一旦完成这一步骤,将每个包含替代项的子字符串分解成两个替代项,并放入一个数组引用中。因此,我们将得到一个数组,其中包含:('word', ['a A1', 'b b1'], 'and more', ['a A2', 'b b2'], 'but', ['a A3', 'b b3'], 'end')
-
识别替代项的索引(在这里是
1,3,5
) -
创建这些索引的所有组合(作为一个集合,因此找到所有子集的集合,即幂集)。对于子集中的索引,在构成句子时选择第一个替代项,对于不在子集中的索引,选择第二个替代项(或者反过来)。
-
遍历标记数组并打印,按照上述描述选择替代项。
我使用了 Algorithm::Combinatorics 用于组合,但当然也有其他库可以使用。
以下是一个包含上述测试句子的程序(只包含 ASCII 字符):
use warnings;
use strict;
use feature 'say';
use List::Util qw(any none);
use Algorithm::Combinatorics qw(subsets);
my $str = q(word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end);
say $str;
my @tokens =
map { /^</ ? [ /<([^\[]+) \[([^\]]+)\]/x ] : $_ }
split /(<[^\[]+ \[[^\]]+\])/x, $str;
my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
my @subsets = subsets( \@idx );
for my $ss (@subsets) {
my @take_0 = @$ss;
for my $iw (0..$#tokens) {
if (none { $iw == $_ } @idx) { print " $tokens[$iw] " }
elsif (any { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
else { print " $tokens[$iw]->[1] " }
}
say '';
}
考虑到所有自然语言的句子结构和细节,可以进行很大程度的简化。在代码方面也有许多改进的空间,还需要做一些清理工作(例如多余的空格),但它确实可以打印出所有包含替代短语的组合。
该库可以一次生成一个项目:在标量上下文中调用其函数时,会返回一个迭代器,可以使用 ->next
获取下一个项目。对于非常大的项目集,这一点非常重要。
以下是一个包含在问题中给出的句子的程序。 (上述解决方案中将 「
字符替换为了 ASCII (<
),因为某些系统仍然可能存在与Unicode有关的问题。除此之外,程序与前面的程序相同。)
use warnings;
use strict;
use feature 'say';
use List::Util qw(any none);
use Algorithm::Combinatorics qw(subsets);
use utf8;
use open qw(:std :encoding(UTF-8));
my $str = q(I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].);
say $str;
my @tokens =
map { /^「/ ? [ /「([^\[]+) \[([^\]]+)\]/x ] : $_ }
split /(「[^\[]+ \[[^\]]+\])/x, $str;
my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
my @subsets = subsets( \@idx );
for my $ss (@subsets) {
my @take_0 = @$ss;
for my $iw (0..$#tokens) {
if (none { $iw == $_ } @idx) { print " $tokens[$iw] " }
elsif (any { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
else { print " $tokens[$iw]->[1] " }
}
say '';
}
希望对你有所帮助!
英文:
An interesting problem. Here is one solution.
For now replace the open paren 「
by <
and adjust the sentence.<sup>†</sup> Take an example string:
<!-- language: lang-none -->
word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end
-
Split the string into tokens: substrings containing alternatives
<...[...]
, and substrings with groups of words around them. Once we are here, break each alternatives-substring into the two alternatives and put that in an arrayref. So we'll have an array with:('word', ['a A1', 'b b1'], 'and more', ['a A2', 'b b2'], 'but', ['a A3', 'b b3'], 'end')
-
Identify indices of alternatives (
1,3,5
here) -
Create all combinations of these indices (as a set, so find the set of all subsets, the power set). For the indices in a subset we take the first alternative when composing a sentence, for those not in the subset we take the second (or the other way round)
-
Go through the tokens array and print, selecting the alternatives as described above
I use Algorithm::Combinatorics for combinations but there are of course other libraries.
A program with a test sentence introduced above (and only ascii characters)
use warnings;
use strict;
use feature 'say';
use List::Util qw(any none);
use Algorithm::Combinatorics qw(subsets);
my $str = q(word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end);
say $str;
my @tokens =
map { /^</ ? [ /<([^\[]+) \[([^\]]+)\]/x ] : $_ }
split /(<[^\[]+ \[[^\]]+\])/x, $str;
#say "@tokens";
my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
#say "@idx";
my @subsets = subsets( \@idx );
for my $ss (@subsets) {
my @take_0 = @$ss;
for my $iw (0..$#tok) {
if (none { $iw == $_ } @idx) { print " $tok[$iw] " }
elsif (any { $iw == $_ } @take_0) { print " $tok[$iw]->[0] " }
else { print " $tok[$iw]->[1] " }
}
say '';
}
There are great simplifications considering all kinds of sentence structure an details from natural languages. There is plenty of room for code improvement, and there's a bit of cleanup to do (extra spaces, for one), but it does print all combinations with alternative phrases.
The library can generate one item at a time: when invoked in scalar context its functions return an iterator, on which ->next
gives the next item. This is important for very large sets of items.
Here is a program with the sentence given in the question. (The solution above has ascii (<
) instead of the 「
character, as some systems still have problems with Unicode. Other than that the program is the same.)
use warnings;
use strict;
use feature 'say';
use List::Util qw(any none);
use Algorithm::Combinatorics qw(subsets);
use utf8;
use open qw(:std :encoding(UTF-8));
my $str = q(I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].);
say $str;
my @tokens =
map { /^「/ ? [ /「([^\[]+) \[([^\]]+)\]/x ] : $_ }
split /(「[^\[]+ \[[^\]]+\])/x, $str;
my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
my @subsets = subsets( \@idx );
for my $ss (@subsets) {
my @take_0 = @$ss;
for my $iw (0..$#tokens) {
if (none { $iw == $_ } @idx) { print " $tokens[$iw] " }
elsif (any { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
else { print " $tokens[$iw]->[1] " }
}
say '';
}
答案2
得分: 3
首先,解析成
```perl
my @def = (
[ "我有" ],
[ "绝对的", "致命的" ],
[ "厌恶" ],
[ "懒惰", "油腻食物" ],
[ "。" ],
);
这可以通过以下方式实现,使用一个验证性的解析器:
my @def;
for ( $str ) {
/ \G ( [^「]+ ) /xgc
and push @def, [ $1 ];
if ( / \G 「 /xgc ) {
/ \G ( [^「\[\]]+ ) [ ] \[ ( [^「\[\]]+ ) \] /xgc
or die( "偏移量 ".( pos() - 1 )." 处出现错误序列\n" );
push @def, [ $1, $2 ];
redo;
}
/\G \z /xgc
and last;
die( "不应该发生的情况" );
}
然后找到产品。这可以通过以下方式实现:
use Algorithm::Loops qw( NestedLoops );
my $iter = NestedLoops( \@def );
while ( my @parts = $iter->() ) {
say join "", @parts;
}
或者
use Algorithm::Loops qw( NestedLoops );
NestedLoops( \@def, sub { say join "", @_; } );
英文:
First, parse into
my @def = (
[ "I have " ],
[ "an absolute", "a deadly" ],
[ " abhorrence of " ],
[ "laziness", "greasy food" ],
[ "." ],
);
This can be achieved using the following, a validating parser:
my @def;
for ( $str ) {
/ \G ( [^「]+ ) /xgc
and push @def, [ $1 ];
if ( / \G 「 /xgc ) {
/ \G ( [^「\[\]]+ ) [ ] \[ ( [^「\[\]]+ ) \] /xgc
or die( "Bad sequence at offset ".( pos() - 1 )."\n" );
push @def, [ $1, $2 ];
redo;
}
/\G \z /xgc
and last;
die( "Should not happen" );
}
Then find the product. This can be achieved using the following:
use Algorithm::Loops qw( NestedLoops );
my $iter = NestedLoops( \@def );
while ( my @parts = $iter->() ) {
say join "", @parts;
}
or
use Algorithm::Loops qw( NestedLoops );
NestedLoops( \@def, sub { say join "", @_; } );
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论