IndexError: 在实现翻译的Transformer模型时,self中的索引超出范围

huangapple go评论75阅读模式
英文:

IndexError: Index out of range in self while implementing transformer model for translation

问题

我正在尝试实现一个用于翻译任务的Transformer模型,参考了一些YouTube教程。但是我遇到了索引超出范围的错误。看起来问题出在输入维度上,但我搞不清楚是什么问题。

这是代码(Google Colab链接)

你可以在这里找到数据集

我尝试过更改维度,但没有帮助,或者我没有正确地做到。希望有人能帮助解决这个问题。谢谢!

英文:

I am trying to implement a transformer model for the translation task, from some youtube tutorials. But I am getting the index out-of-range error. It seems The problem is with the input dimensions, but I can't figure it out.
Here is the code (google colab link)

You can find the datasets here

I tried to change the dimensions but It didn't help or I couldn't do it correctly. I hope someone can help solve this problem. Thanks

答案1

得分: 0

Here is the code with the suggested modifications:

# Define unique tags for special tokens
START_TOKEN = "START"
PADDING_TOKEN = "PAD"
END_TOKEN = "END"

# Modify english_vocabulary to use unique tags
english_vocabulary = {
    START_TOKEN: 0,
    PADDING_TOKEN: 1,
    END_TOKEN: 2,
    # Add other words from your vocabulary here
}

# Modify the forward method in SentenceEmbedding class
def forward(self, x, start_token, end_token):
    x = self.batch_tokenize(x, start_token, end_token)
    print(torch.max(x))  # Print the max value in x for debugging
    x = self.embedding(x)
    pos = self.position_encoder().to(get_device())
    x = self.dropout(x + pos)

# Add '\\\\' to the english_vocabulary
english_vocabulary['\\\\'] = len(english_vocabulary)

# Rest of your code remains the same

Please note that these modifications are intended to address the issue you mentioned in your message. Make sure to integrate them into your existing code as needed.

英文:

I went through your code and found out that in the error trace of yours (error in forward call of SentenceEmbedding, encoder stage)
> 69 def forward(self, x, start_token, end_token): # sentence
> 70 x = self.batch_tokenize(x, start_token, end_token)
> 71 ---> x = self.embedding(x)
> 72 pos = self.position_encoder().to(get_device())
> 73 x = self.dropout(x + pos)

If you add print(torch.max(x)) before the line x = self.embedding(x)

Then you can see that the error is because x contains id that is >=68. If the value is greater than 68, then Pytorch will raise the error mentioned in the stack trace.

It means that while you are converting tokens to ids, you are assigning a value greater than 68.

To prove my point:

when you are creating english_to_index, since there are three "" in your english_vocabulary (START_TOKEN, PADDING_TOKEN, END_TOKEN are all "") you end up generating { "": 69 }. Since this value is greater than the len(english_to_index) # length = 68.
Hence, you are getting IndexError: index out of range in self

Solution

As a solution, you can give unique tags to these tokens (which is generally prescribed) as:

START_TOKEN = "START"
PADDING_TOKEN = "PAD"
END_TOKEN = "END"

This will make sure that the generated dictionaries will have the correct sizes.
Please find the working Google Colaboratory file here with the solution section.

I added '\\' to the english_vocabulary since after a few iterations we get a KeyError: '\\'.

Hope it helps.

huangapple
  • 本文由 发表于 2023年7月17日 15:41:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76702377.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定