英文:
Easiest way to remap column headers in Glue/Athena?
问题
Data has headers like _col_0
, _col_1
, etc. I have a sample data file that has the correct column headers.
然而,所有的数据都存储在大约250个snappy/parquet文件中。
What is the easiest way to remap the column headers in Glue?
什么是在Glue中重新映射列标题的最简单方法?
UPDATE:
So I tried John R's comment below - I went in and edited the table schema in glue, to rename the columns, but when I query the data, any column I edited is now missing data.
因此,我尝试了下面的John R的评论 - 我进入Glue中编辑了表架构,以重命名列,但当我查询数据时,我编辑的任何列现在都丢失了数据。
I tried to re-run the glue job and it overwrote the edited schema (makes sense) but the data is back.
我尝试重新运行Glue作业,它覆盖了编辑的架构(这是合理的),但数据又回来了。
So editing the column name in the schema makes that data get dropped, or not applied to the column. Searched on google, but don't see any related issues.
因此,在架构中编辑列名会导致数据被丢弃,或者不应用于该列。在Google上搜索,但没有看到任何相关的问题。
UPDATE #2:
If I rename the column back to its original name (which is in the snappy/parquet files) then the data comes back.
如果我将列重命名回其原始名称(该名称位于snappy/parquet文件中),那么数据将恢复。
UPDATE #3:
I basically solved this by generating a view in Athena and renaming the _col_0... columns to their correct names.
我基本上通过在Athena中生成一个视图,并将_col_0...列重命名为其正确的名称来解决了这个问题。
英文:
Data has headers like _col_0
, _col_1
, etc. I have a sample data file that has the correct column headers.
However, all the data is in snappy/parquet across ~250 files.
What is the easiest way to remap the column headers in Glue?
Thanks.
UPDATE:
So I tried John R's comment below - I went in and edited the table schema in glue, to rename the columns, but when I query the data, any column I edited is now missing data.
I tried to re-run the glue job and it overwrote the edited schema (makes sense) but the data is back.
So editing the column name in the schema makes that data get dropped, or not applied to the column. Searched on google, but don't see any related issues.
UPDATE #2:
If I rename the column back to it's original name (which is in the snappy/parquet files) then the data comes back.
UPDATE #3:
I basically solved this by generating a view in Athena and renaming the _col_0... columns to the their correct names.
答案1
得分: 1
你可以通过使用Athena的UNLOAD命令(也适用于Redshift)重新创建带有所需列名的文件,并从中创建另一个表,然后删除旧文件(如果需要)。这可能听起来有点复杂,但实际上并不复杂。你可以使用Athena的UNLOAD命令,示例如下,然后根据自己的需求自定义命令。请查阅文档以获取更多参考信息:UNLOAD - Amazon Athena
UNLOAD (SELECT _col_0 AS new_col0_name, _col_1 AS new_col1_name FROM table)
TO 's3://yourbucket/your_partitioned_table/'
WITH (format = 'PARQUET', compression = 'SNAPPY', partitioned_by = ARRAY['new_col1_name'])
如果需要更多选项,如控制最大文件大小,Redshift UNLOAD更为强大,可以利用Redshift Spectrum直接从S3查询数据。
英文:
I would say that you can't rename columns from tables stored in parquet because the schema is contained in the file itself. If you add a file with another header than glue will threat it in one of the 3 ways when running crawlers: ignore changes, add new columns or create new tables; depends on how it's set up.
What you can do is recreate all the files in another bucket with the desired column names and create another table from it, then delete the old files if you want.
Altough this might sound a little difficulty it is not. You can leverage the Athena's UNLOAD command (also available on Redshift) to do so. This is also a good moment to (re)partition your files if you see fit. I will leave an example here, you can than customize the command for your own needs. Check the docs for further reference: UNLOAD - Amazon Athena
UNLOAD (SELECT _col_0 AS new_col0_name, _col_1 AS new_col1_name FROM table)
TO 's3://yourbucket/your_partitioned_table/'
WITH (format = 'PARQUET', compression = 'SNAPPY', partitioned_by = ARRAY['new_col1_name'])
Redshift UNLOAD is more robust, so if you need more options, like controling max file size, you can do it by leveraging Redshift Spectrum to query data directly from S3.
答案2
得分: 0
与@JohnRotenstein提到的类似方法是在Athena中“编辑”表的DDL,通过删除并重新创建表,这将允许您使用新的列名查询数据。
- 转到Athena控制台,右键单击表名
- 选择“生成表DDL”,这将为您提供修改表所需的代码(复制代码并保存以备后用)
- 右键单击表名,选择“删除表”
- 在查询编辑器中粘贴第二步的代码,手动更改列名并运行代码
- 查询表,您应该看到所有数据以新的列名显示
现在,这取决于您的Glue作业如何更新表,下次运行Glue作业时可能会被覆盖。
英文:
A similar approach to what @JohnRotenstein mentioned is to "edit" the table's DDL by dropping and creating the table again in Athena, that would allow you to query the data with the new column names.
- Go to athena console and right click over the table name
- Select "Generate table DDL", this would provide you the code needed to modify the table (copy the code and save it for later use)
- Right click over the table name and select "Delete table"
- In the query editor paste the code from step two and manually alter the column names and run the code
- Query the table and you should see all the data with the new column names
Now this can be useful depending on how your glue job updates the table later, you might ending overwritten that the next time you run your glue job.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论