英文:
AWS Athena table created in Glue catalog is empty
问题
我正在尝试在S3 Parquet文件上创建Athena表格。
我在Glue目录中创建了一个表格(数据目录->数据库->表格->添加表格)。
添加了S3位置并手动上传了模式。
当我在Athena中查询表格时,查询结果中没有记录 - 简单的select * from table
。
我搜索了问题并看到可能的解释是在位置文件夹下有多个不同的模式。
这对我来说是这种情况,我在同一个S3文件夹下有两种不同模式的Parquet文件,因为添加了一些字段 - 这意味着新模式包含了先前的模式。
数据按照日期和时间分区,其中一个时间具有新模式 - 当我尝试在这个单独的时间上创建表格时,它正常工作并返回了预期的记录。
有没有办法“合并”这些模式?
因为使用Crawlers我看到有一个更新模式的选项:
在数据存储中更新模式:
更新数据目录中的表格定义
我想要的是看到新列,并且在属于这些新列下的先前Parquet数据中有null值。
如果这不可能的话,在这种情况下我应该使用Glue吗?每天文件夹中都有新文件,我希望避免每天重新爬行文件夹。
提前感谢。
英文:
I am trying to create an Athena table over s3 parquet files.
I created a table in glue catalog (Data Catalog -> Databases -> Tables -> Add Table).
Added s3 location and uploaded a schema manually.
When I query the table in Athena there are no record in the query result - simple select * from table
.
I searched the problem and saw a possible explanation for this is having several different schemas under the location folder.
That is the case for me, I have under the same s3 folder parquet files with two different schemas due to addition of some fields - which means the new schema contains the previous.
The data is partitioned by dt and tm, and one of the tm's has a new schema - when I tried to create a table over this single tm it worked and returned records as expected.
Is there a way to "merge" the schemas?
Because using Crawlers I saw there's an option to update the schema:
Schema updates in the data store:
Update the table definition in the data catalog
What I would have wanted is to see the new columns, and to have nulls in the data belongs to previous parquets under these new columns.
I that's not possible, should I use glue in this situation? there are new files in the folder every day and I wanted to avoid re-crawling the folder every day.
Thanks in advance.
答案1
得分: 0
我认为在同一S3路径下拥有不同架构的文件是完全不可取的。
我建议将所有这些不同的文件存储在不同的S3位置。然后,您应该能够轻松地为每个位置构建一个表。
完成后,您始终可以通过使用视图将这些表合并为单个表的选项。
我找到了一些有用的AWS文档,对您可能有帮助。
https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-syncing
英文:
I believe that having different files with different schemas under the same S3 path is something totally inadvisable.
I would recommend to store all those different files in different S3 locations. After that, you should be able to build a table easily for each location.
Once that is done, you always have the option to merge those table into a single one by using a view.
I found some useful AWS documentation that can be helpful for you.
https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-syncing
答案2
得分: 0
Parquet格式支持模式演化(来源:https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html)。
您只需创建一个包括所有列的表,包括不在旧Parquet文件中但在最新文件中的新列。
如果模式中有一个列,但在Parquet文件中没有,那么该列将显示为null,这正是您希望的。即使在同一位置包含不同列定义的Parquet文件也适用。
另一种创建表的方法是直接在SQL中使用CREATE EXTERNAL TABLES:
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
SQL语法大致如下:
create external table nameofyourtable (
COLUMN1_NAME type,
COLUMN2_NAME type,
COLUMN3_NAME type,
...
COLUMNX_NAME type ) Stored as parquet location "s3://bucketname/.../parquetfilebucket"
英文:
Parquet format supports schema evolution (SOURCE : https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html).
You just need to create a table with all columns including the new columns that are not in your old parquet files, but in the newest files.
If a column in your schema but not in the parquet file, the column will appear as null which is what you were hoping for. This applies even if parquet files containing different column definitions are in the same location.
Another way you can create tables is using the CREATE EXTERNAL TABLES directly in SQL :
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
SQL syntax looks roughtly like this :
create external table nameofyourtable (
COLUMN1_NAME type,
COLUMN2_NAME type,
COLUMN3_NAME type,
...
COLUMNX_NAME type ) Stored as parquet location "s3://bucketname/.../parquetfilebucket"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论