英文:
Redshift Loading Data Points To Different Bucket
问题
这篇帖子原本是关于我在将一些简单的CSV数据加载到Redshift表中时遇到的问题,但在写了一半时,我意识到不知何故,在选择COPY命令中的存储桶时,Redshift指向了错误的存储桶!
有人能解释为什么会这样吗?为了背景,我选择的存储桶是
s3://soccer-project/Player
但Redshift默认选择了
s3://soccer-project/Player_Attributes
这是我的存储桶中的另一个文件
对于Redshift还不太熟悉...有人能帮我理解这个问题吗
谢谢
英文:
This post was going to be about my issues loading some simple csv data in a Redshift table but halfway through writing it I realised that, for whatever reason, when selecting the bucket in the COPY command, Redshift was pointing to the wrong one!
Can someone explain why this is the case? For context, the bucket I selected was
s3://soccer-project/Player
but Redshift defaulted to
s3://soccer-project/Player_Attributes
which is another file in my bucket
New to Redshift... can someone help me understand this
Thanks
答案1
得分: 1
只有S3对象路径的顶部部分是存储桶。在您的两种情况下,这都是"soccer-project"。
现在我预期您显示的只是对象名称的一部分 - 在您的问题中是"Player"和"Player_Attributes"。这些不是对象的完整名称。完整的对象名称包括这些部分以及斜杠和更多文本。Redshift已设置为接受部分对象名称,以便它可以扩展复制的文件,以包括与部分匹配的所有对象名称。如果我对问题的理解有误,请纠正我。
要理解发生了什么,您需要了解S3是一个对象存储而不是文件系统。这意味着所有文件都存储在每个存储桶下,"扁平"存储。只有两个东西标识对象 - 存储桶名称和对象名称。存储中没有真正的层次结构。但是,为了使人们在查看时更加有组织,S3会查看对象名称中的斜杠,并使事物看起来层次化。但实际上,存储桶名称和斜杠之后的一切都是对象名称,包括任何斜杠、"文件夹"名称或您认为具有独特含义的任何内容。这都是对象名称。
现在来看您的情况:您的存储桶中可能有以"Player"或"Player_Attributes"开头的对象名称,对象名称中的下一个字符是斜杠。这只是对象名称的第一部分。我猜测您的COPY命令的FROM子句可能类似于"s3://soccer-project/Player*"。(如果您在问题中提供COPY命令,将有助于更清楚地理解发生了什么。)""是一个通配符,匹配对象名称中的所有后续字符,这将匹配"Player_Attributes"。如果一切都正确,那么您可以通过将FROM子句更改为"s3://soccer-project/Player/"(添加斜杠)来修复此问题。
如我上面所说,这是基于提供的部分信息的最佳猜测。如果这不正确,请更新问题。
英文:
Only the top (left most) part of the S3 object path is the bucket. In both of your cases this is "soccer-project".
Now I expect that what you are showing is only part of the object name - "Player" and "Player_Attributes" in your question. These are not the full names to the objects. The full object names are these parts plus a slash and more text. Redshift is set up to take partial object names so that it can expand the files copied to include all object names that match the partial. Correct me if I'm interpreting the question incorrectly.
To understand what is going on you need to understand that S3 is an object store and not a file system. This means that all files are stored "flat" under each bucket. Only 2 things identify the object - bucket name and object name. There is no real hierarch in the storage. However to make things a little more organized when us humans look S3 will organize the objects by looking at slashes in the object name and make things seem hierarchical. But in reality everything after the bucket-name and slash is the object name, including any slashes, "folder" names, or anything else you think has unique meaning. It is all the object name.
Now to your situation: You likely have object names in your bucket that start with "Player" or "Player_Attributes", with the next character in the name being a slash. This is all just the first part of the object name. I'd also guess that your COPY command has a FROM clause like "s3://soccer-project/Player*". (Providing your COPY command in the question would really help clear up what is going on.) The '' is a wildcard that matches all following characters in the object name which will match "Player_Attributes". If all of this is correct then you can fix this by changing the FROM clause to "s3://soccer-project/Player/" (slash added).
As I said above this is a best guess based on the partial info provided. Please update the question if this is incorrect.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论