英文:
create view from athena table using program
问题
我有100多张Athena表,例如table1,table2等等。
是否有任何方法可以在Athena中创建一个通用视图,以便在自动创建任何表时(使用任何程序或脚本)也创建视图?
例如,如果我在Athena中创建任何表MyTable
,那么也应创建视图v_MyTable
。
英文:
I've 100 plus tables in Athena, for example table1, table2 and so on.
Is there any way I could create generic view once I have any table in Athena automatically (using any program or script)?
For example, If I create any table MyTable
in Athena, so v_MyTable
should be created as view too.
答案1
得分: 1
你可以使用AWS-Glue或AWS-Lambda来实现这个目的。我更倾向于使用AWS-Glue,类似下面的Glue作业示例一样进行每日调度,但你也可以设计自己的解决方案。AWS-Lambda支持文件上传触发器,但不支持文件夹上传触发器,所以无论如何你都应该使用调度。
1- 创建一个Delta表,用于保存所有同时具有其自己视图的表名。但当开始时没有表-视图对时,它当然是空的。(顺便说一句,你可以使用普通的Glue Catalog表而不是Delta表,每次覆盖它。但是,Delta允许在S3上使用插入-更新-删除)。
2- 在你的Glue作业上创建一个sqlContext。
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
spark_session = glue_context.spark_session
sql_context = SQLContext(spark_session.sparkContext, spark_session)
3- 从所需的数据库中获取所有表名,类似下面的方式;
table_names_df = sqlContext.sql("show tables in <database_name>")
4- 从Delta表中获取具有相关视图的表名;
tables_with_view_df = sqlContext.sql("select * from <database_name.delta_table_name>")
5- 左连接这些表,以查找哪些表没有自己的视图。
tables_wo_view_df = (table_names_df.join(tables_with_view_df,
table_names_df.table_name==tables_with_view_df.table_name2,
"leftouter")
.where(col("table_name2").isNull())
.drop("table_name2"))
6- 创建视图,但最好使用循环,不要像下面这样(不要忘记将tables_wo_view_df转换为列表。例如,你可以使用rdd.flatMap())。
create_views_sql_script = """
CREATE VIEW <{database_name}.{table_name_wo_view}_view> AS
SELECT *
FROM <{database_name}.{table_name_wo_view}>
"""
ath = boto3.client('athena')
ath.start_query_execution(
QueryString=create_views_sql_script,
ResultConfiguration={'OutputLocation': 's3://.../'})
7- 最后使用tables_wo_view_df更新Delta表。
谢谢。
英文:
You can use AWS-Glue or AWS-Lambda for this purpose.
I'd prefer using AWS-Glue with daily schedule like below Glue Job example but you can design your own solution.
AWS-Lambda supports file-upload trigger but not folder-upload trigger so you should use a schedule anyway.
1- Create a delta-table which will keep all table names which have also its own view. But of course it will be empty at the beginning since there is no table-view pair at the beginning. (By the way you could use normal Glue Catalog table instead Delta Table by overwriting at each time. However, Delta provides of using Insert-Update-Delete on S3).
2- Create a sqlContext on your Glue Job.
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
spark_session = glue_context.spark_session
sql_context = SQLContext(spark_session.sparkContext, spark_session)
3- Get all the table names from the database desired like below;
table_names_df = sqlContext.sql("show tables in <database_name>")
4- Get the table names which have its relevant view from the Delta table;
tables_with_view_df = sqlContext.sql("select * from <database_name.delta_table_name>")
5- Left join these tables to find out which tables don't have its own view.
tables_wo_view_df = (table_names_df.join(tables_with_view_df,
table_names_df.table_name==tables_with_view_df.table_name2,
"leftouter")
.where(col("table_name2").isNull())
.drop("table_name2"))
6- Create the views but you'd better use a loop
unlike below (do not forget to convert tables_wo_view_df to a list. You can use rdd.flatMap() for example.
create_views_sql_script = """
CREATE VIEW <{database_name}.{table_name_wo_view}_view> AS
SELECT *
FROM <{database_name}.{table_name_wo_view}>
"""
ath = boto3.client('athena')
ath.start_query_execution(
QueryString=create_views_sql_script,
ResultConfiguration={'OutputLocation': 's3://.../'})
7- Finally update the Delta Table with tables_wo_view_df.
Thanks.
答案2
得分: 0
AWS Glue数据目录通过API调用可访问。
例如,可以调用get_tables()
(其中包括Views)和create_table()
。
您可以创建一个按计划运行的AWS Lambda函数。Lambda函数将调用API以查询现有的表(包括视图),并在必要时创建新视图。
英文:
The AWS Glue Data Catalog is accessible via API calls.
For example, there are calls to get_tables()
(which includes Views) and create_table()
.
You could create an AWS Lambda function that runs on a schedule. The Lambda function would call the APIs to query the existing tables (including views) and create new views when necessary.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论