英文:
How do you reconcile a list of tuples containing a tokenized string with the original string?
问题
Here is the translated code segment:
def find_word_from_index(idx, word_string):
words = word_string.split()
current_index = 0
for word in words:
start_index = current_index
end_index = current_index + len(word) - 1
if start_index <= idx <= end_index:
return word
current_index = end_index + 2
return None
word_tag_list = []
for index, tag, _ in idx_tag_token:
word = find_word_from_index(index, word_string)
word_tag_list.append((word, tag))
word_tag_list
And here is the desired output:
[('At', 'O'),
('London,', 'GPE-B'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('1942,', 'DATE-I'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
('(3)', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]
英文:
I am trying to reconcile idx_tag_token
, which is a list of tuples containing a tokenized string and its label and character index, with the original string word_string
. I want to output a list of tuples, with each tuple containing an element of the original string if split on whitespace, along with label information from idx_tag_token
.
I have written some code that finds a token's associated word in word_string
based on the character index. I then create a list of tuples with each of these words and the associated label. This is defined as word_tag_list
. However, based on this, I am unsure how to proceeed to create the desired output.
The conditions to update the labels are not complicated, but I can't work out the appropriate system here.
Any assistance would be truly appreciated.
The data:
word_string = "At London, the 12th in February, 1942, and for that that reason Mark's (3) wins, American parts"
idx_tag_token =[(0, 'O', 'At'),
(3, 'GPE-B', 'London'),
(9, 'O', ','),
(11, 'DATE-B', 'the'),
(15, 'DATE-I', '12th'),
(20, 'O', 'in'),
(23, 'DATE-B', 'February'),
(31, 'DATE-I', ','),
(33, 'DATE-I', '1942'),
(37, 'O', ','),
(39, 'O', 'and'),
(43, 'O', 'for'),
(47, 'O', 'that'),
(52, 'O', 'that'),
(57, 'O', 'reason'),
(64, 'PERSON-B', 'Mark'),
(68, 'O', "'s"),
(71, 'O', '('),
(72, 'O', '3'),
(73, 'O', ')'),
(75, 'O', 'wins'),
(79, 'O', ','),
(81, 'NORP-B', 'American'),
(90, 'O', 'parts')]
My code:
def find_word_from_index(idx, word_string):
words = word_string.split()
current_index = 0
for word in words:
start_index = current_index
end_index = current_index + len(word) - 1
if start_index <= idx <= end_index:
return word
current_index = end_index + 2
return None
word_tag_list = []
for index, tag, _ in idx_tag_token:
word = find_word_from_index(index, word_string)
word_tag_list.append((word, tag))
word_tag_list
Current output:
[('At', 'O'),
('London,', 'GPE-B'),
('London,', 'O'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('February,', 'DATE-I'),
('1942,', 'DATE-I'),
('1942,', 'O'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
("Mark's", 'O'),
('(3)', 'O'),
('(3)', 'O'),
('(3)', 'O'),
('wins,', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]
Desired output:
[('At', 'O'),
('London,', 'GPE-B'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('1942,', 'DATE-I'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
('(3)', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]
答案1
得分: 1
以下是您提供的代码的翻译:
def get_tokens(tokens):
it = iter(tokens)
_, token_type, next_token = next(it)
word = yield
while True:
if next_token == word:
word = yield next_token, token_type
_, token_type, next_token = next(it)
else:
_, _, tmp = next(it)
next_token += tmp
it = get_tokens(idx_tag_token)
next(it)
out = [it.send(w) for w in word_string.split()]
print(out)
打印结果:
[
("At", "O"),
("London,", "GPE-B"),
("the", "DATE-B"),
("12th", "DATE-I"),
("in", "O"),
("February,", "DATE-B"),
("1942,", "DATE-I"),
("and", "O"),
("for", "O"),
("that", "O"),
("that", "O"),
("reason", "O"),
("Mark's", "PERSON-B"),
("(3)", "O"),
("wins,", "O"),
("American", "NORP-B"),
("parts", "O"),
]
英文:
Try:
def get_tokens(tokens):
it = iter(tokens)
_, token_type, next_token = next(it)
word = yield
while True:
if next_token == word:
word = yield next_token, token_type
_, token_type, next_token = next(it)
else:
_, _, tmp = next(it)
next_token += tmp
it = get_tokens(idx_tag_token)
next(it)
out = [it.send(w) for w in word_string.split()]
print(out)
Prints:
[
("At", "O"),
("London,", "GPE-B"),
("the", "DATE-B"),
("12th", "DATE-I"),
("in", "O"),
("February,", "DATE-B"),
("1942,", "DATE-I"),
("and", "O"),
("for", "O"),
("that", "O"),
("that", "O"),
("reason", "O"),
("Mark's", "PERSON-B"),
("(3)", "O"),
("wins,", "O"),
("American", "NORP-B"),
("parts", "O"),
]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论