如何将包含已标记字符串的元组列表与原始字符串进行调和?

huangapple go评论59阅读模式
英文:

How do you reconcile a list of tuples containing a tokenized string with the original string?

问题

Here is the translated code segment:

def find_word_from_index(idx, word_string):
    words = word_string.split()
    current_index = 0

    for word in words:
        start_index = current_index
        end_index = current_index + len(word) - 1
        if start_index <= idx <= end_index:
            return word
        current_index = end_index + 2
    return None

word_tag_list = []
for index, tag, _ in idx_tag_token:
    word = find_word_from_index(index, word_string)
    word_tag_list.append((word, tag))
word_tag_list

And here is the desired output:

[('At', 'O'),
 ('London,', 'GPE-B'),
 ('the', 'DATE-B'),
 ('12th', 'DATE-I'),
 ('in', 'O'),
 ('February,', 'DATE-B'),
 ('1942,', 'DATE-I'),
 ('and', 'O'),
 ('for', 'O'),
 ('that', 'O'),
 ('that', 'O'),
 ('reason', 'O'),
 ("Mark's", 'PERSON-B'),
 ('(3)', 'O'),
 ('wins,', 'O'),
 ('American', 'NORP-B'),
 ('parts', 'O')]
英文:

I am trying to reconcile idx_tag_token, which is a list of tuples containing a tokenized string and its label and character index, with the original string word_string. I want to output a list of tuples, with each tuple containing an element of the original string if split on whitespace, along with label information from idx_tag_token.

I have written some code that finds a token's associated word in word_string based on the character index. I then create a list of tuples with each of these words and the associated label. This is defined as word_tag_list. However, based on this, I am unsure how to proceeed to create the desired output.

The conditions to update the labels are not complicated, but I can't work out the appropriate system here.

Any assistance would be truly appreciated.

The data:

word_string = &quot;At London, the 12th in February, 1942, and for that that reason Mark&#39;s (3) wins, American parts&quot;

idx_tag_token =[(0, &#39;O&#39;, &#39;At&#39;),
                (3, &#39;GPE-B&#39;, &#39;London&#39;),
                (9, &#39;O&#39;, &#39;,&#39;),
                (11, &#39;DATE-B&#39;, &#39;the&#39;),
                (15, &#39;DATE-I&#39;, &#39;12th&#39;),
                (20, &#39;O&#39;, &#39;in&#39;),
                (23, &#39;DATE-B&#39;, &#39;February&#39;),
                (31, &#39;DATE-I&#39;, &#39;,&#39;),
                (33, &#39;DATE-I&#39;, &#39;1942&#39;),
                (37, &#39;O&#39;, &#39;,&#39;),
                (39, &#39;O&#39;, &#39;and&#39;),
                (43, &#39;O&#39;, &#39;for&#39;),
                (47, &#39;O&#39;, &#39;that&#39;),
                (52, &#39;O&#39;, &#39;that&#39;),
                (57, &#39;O&#39;, &#39;reason&#39;),
                (64, &#39;PERSON-B&#39;, &#39;Mark&#39;),
                (68, &#39;O&#39;, &quot;&#39;s&quot;),
                (71, &#39;O&#39;, &#39;(&#39;),
                (72, &#39;O&#39;, &#39;3&#39;),
                (73, &#39;O&#39;, &#39;)&#39;),
                (75, &#39;O&#39;, &#39;wins&#39;),
                (79, &#39;O&#39;, &#39;,&#39;),
                (81, &#39;NORP-B&#39;, &#39;American&#39;),
                (90, &#39;O&#39;, &#39;parts&#39;)]

My code:

def find_word_from_index(idx, word_string):
    words = word_string.split()
    current_index = 0

    for word in words:
        start_index = current_index
        end_index = current_index + len(word) - 1
        if start_index &lt;= idx &lt;= end_index:
            return word
        current_index = end_index + 2
    return None


word_tag_list = []
for index, tag, _ in idx_tag_token:
    word = find_word_from_index(index, word_string)
    word_tag_list.append((word, tag))
word_tag_list

Current output:

[(&#39;At&#39;, &#39;O&#39;),
 (&#39;London,&#39;, &#39;GPE-B&#39;),
 (&#39;London,&#39;, &#39;O&#39;),
 (&#39;the&#39;, &#39;DATE-B&#39;),
 (&#39;12th&#39;, &#39;DATE-I&#39;),
 (&#39;in&#39;, &#39;O&#39;),
 (&#39;February,&#39;, &#39;DATE-B&#39;),
 (&#39;February,&#39;, &#39;DATE-I&#39;),
 (&#39;1942,&#39;, &#39;DATE-I&#39;),
 (&#39;1942,&#39;, &#39;O&#39;),
 (&#39;and&#39;, &#39;O&#39;),
 (&#39;for&#39;, &#39;O&#39;),
 (&#39;that&#39;, &#39;O&#39;),
 (&#39;that&#39;, &#39;O&#39;),
 (&#39;reason&#39;, &#39;O&#39;),
 (&quot;Mark&#39;s&quot;, &#39;PERSON-B&#39;),
 (&quot;Mark&#39;s&quot;, &#39;O&#39;),
 (&#39;(3)&#39;, &#39;O&#39;),
 (&#39;(3)&#39;, &#39;O&#39;),
 (&#39;(3)&#39;, &#39;O&#39;),
 (&#39;wins,&#39;, &#39;O&#39;),
 (&#39;wins,&#39;, &#39;O&#39;),
 (&#39;American&#39;, &#39;NORP-B&#39;),
 (&#39;parts&#39;, &#39;O&#39;)]

Desired output:

[(&#39;At&#39;, &#39;O&#39;),
(&#39;London,&#39;, &#39;GPE-B&#39;),
(&#39;the&#39;, &#39;DATE-B&#39;),
(&#39;12th&#39;, &#39;DATE-I&#39;),
(&#39;in&#39;, &#39;O&#39;),
(&#39;February,&#39;, &#39;DATE-B&#39;),
(&#39;1942,&#39;, &#39;DATE-I&#39;),
(&#39;and&#39;, &#39;O&#39;),
(&#39;for&#39;, &#39;O&#39;),
(&#39;that&#39;, &#39;O&#39;),
(&#39;that&#39;, &#39;O&#39;),
(&#39;reason&#39;, &#39;O&#39;),
(&quot;Mark&#39;s&quot;, &#39;PERSON-B&#39;),
(&#39;(3)&#39;, &#39;O&#39;),
(&#39;wins,&#39;, &#39;O&#39;),
(&#39;American&#39;, &#39;NORP-B&#39;),
(&#39;parts&#39;, &#39;O&#39;)]

答案1

得分: 1

以下是您提供的代码的翻译:

def get_tokens(tokens):
	it = iter(tokens)
	_, token_type, next_token = next(it)
	word = yield
	while True:
		if next_token == word:
			word = yield next_token, token_type
			_, token_type, next_token = next(it)
		else:
			_, _, tmp = next(it)
			next_token += tmp

it = get_tokens(idx_tag_token)
next(it)
out = [it.send(w) for w in word_string.split()]

print(out)

打印结果:

[
    ("At", "O"),
    ("London,", "GPE-B"),
    ("the", "DATE-B"),
    ("12th", "DATE-I"),
    ("in", "O"),
    ("February,", "DATE-B"),
    ("1942,", "DATE-I"),
    ("and", "O"),
    ("for", "O"),
    ("that", "O"),
    ("that", "O"),
    ("reason", "O"),
    ("Mark's", "PERSON-B"),
    ("(3)", "O"),
    ("wins,", "O"),
    ("American", "NORP-B"),
    ("parts", "O"),
]
英文:

Try:

def get_tokens(tokens):
	it = iter(tokens)
	_, token_type, next_token = next(it)
	word = yield
	while True:
		if next_token == word:
			word = yield next_token, token_type
			_, token_type, next_token = next(it)
		else:
			_, _, tmp = next(it)
			next_token += tmp

it = get_tokens(idx_tag_token)
next(it)
out = [it.send(w) for w in word_string.split()]

print(out)

Prints:

[
    (&quot;At&quot;, &quot;O&quot;),
    (&quot;London,&quot;, &quot;GPE-B&quot;),
    (&quot;the&quot;, &quot;DATE-B&quot;),
    (&quot;12th&quot;, &quot;DATE-I&quot;),
    (&quot;in&quot;, &quot;O&quot;),
    (&quot;February,&quot;, &quot;DATE-B&quot;),
    (&quot;1942,&quot;, &quot;DATE-I&quot;),
    (&quot;and&quot;, &quot;O&quot;),
    (&quot;for&quot;, &quot;O&quot;),
    (&quot;that&quot;, &quot;O&quot;),
    (&quot;that&quot;, &quot;O&quot;),
    (&quot;reason&quot;, &quot;O&quot;),
    (&quot;Mark&#39;s&quot;, &quot;PERSON-B&quot;),
    (&quot;(3)&quot;, &quot;O&quot;),
    (&quot;wins,&quot;, &quot;O&quot;),
    (&quot;American&quot;, &quot;NORP-B&quot;),
    (&quot;parts&quot;, &quot;O&quot;),
]

huangapple
  • 本文由 发表于 2023年5月21日 22:20:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76300362.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定