创建视图从Athena表使用程序

huangapple go评论97阅读模式
英文:

create view from athena table using program

问题

我有100多张Athena表,例如table1,table2等等。

是否有任何方法可以在Athena中创建一个通用视图,以便在自动创建任何表时(使用任何程序或脚本)也创建视图?

例如,如果我在Athena中创建任何表MyTable,那么也应创建视图v_MyTable

英文:

I've 100 plus tables in Athena, for example table1, table2 and so on.

Is there any way I could create generic view once I have any table in Athena automatically (using any program or script)?

For example, If I create any table MyTable in Athena, so v_MyTable should be created as view too.

答案1

得分: 1

你可以使用AWS-Glue或AWS-Lambda来实现这个目的。我更倾向于使用AWS-Glue,类似下面的Glue作业示例一样进行每日调度,但你也可以设计自己的解决方案。AWS-Lambda支持文件上传触发器,但不支持文件夹上传触发器,所以无论如何你都应该使用调度。

1- 创建一个Delta表,用于保存所有同时具有其自己视图的表名。但当开始时没有表-视图对时,它当然是空的。(顺便说一句,你可以使用普通的Glue Catalog表而不是Delta表,每次覆盖它。但是,Delta允许在S3上使用插入-更新-删除)。

2- 在你的Glue作业上创建一个sqlContext。

spark_context = SparkContext()
glue_context = GlueContext(spark_context)
spark_session = glue_context.spark_session
sql_context = SQLContext(spark_session.sparkContext, spark_session)

3- 从所需的数据库中获取所有表名,类似下面的方式;

table_names_df = sqlContext.sql("show tables in <database_name>")

4- 从Delta表中获取具有相关视图的表名;

tables_with_view_df = sqlContext.sql("select * from <database_name.delta_table_name>")

5- 左连接这些表,以查找哪些表没有自己的视图。

tables_wo_view_df = (table_names_df.join(tables_with_view_df,
              table_names_df.table_name==tables_with_view_df.table_name2,
"leftouter")
.where(col("table_name2").isNull())
.drop("table_name2"))

6- 创建视图,但最好使用循环,不要像下面这样(不要忘记将tables_wo_view_df转换为列表。例如,你可以使用rdd.flatMap())。

create_views_sql_script = """
CREATE VIEW <{database_name}.{table_name_wo_view}_view> AS
SELECT *
FROM <{database_name}.{table_name_wo_view}>
"""
ath = boto3.client('athena')
ath.start_query_execution(
        QueryString=create_views_sql_script,
        ResultConfiguration={'OutputLocation': 's3://.../'})

7- 最后使用tables_wo_view_df更新Delta表。

谢谢。

英文:

You can use AWS-Glue or AWS-Lambda for this purpose.
I'd prefer using AWS-Glue with daily schedule like below Glue Job example but you can design your own solution.
AWS-Lambda supports file-upload trigger but not folder-upload trigger so you should use a schedule anyway.

1- Create a delta-table which will keep all table names which have also its own view. But of course it will be empty at the beginning since there is no table-view pair at the beginning. (By the way you could use normal Glue Catalog table instead Delta Table by overwriting at each time. However, Delta provides of using Insert-Update-Delete on S3).

2- Create a sqlContext on your Glue Job.

spark_context = SparkContext()
glue_context = GlueContext(spark_context)
spark_session = glue_context.spark_session
sql_context = SQLContext(spark_session.sparkContext, spark_session)

3- Get all the table names from the database desired like below;

table_names_df = sqlContext.sql(&quot;show tables in &lt;database_name&gt;&quot;)

4- Get the table names which have its relevant view from the Delta table;

tables_with_view_df = sqlContext.sql(&quot;select * from &lt;database_name.delta_table_name&gt;&quot;)

5- Left join these tables to find out which tables don't have its own view.

tables_wo_view_df = (table_names_df.join(tables_with_view_df,
              table_names_df.table_name==tables_with_view_df.table_name2,
&quot;leftouter&quot;)
.where(col(&quot;table_name2&quot;).isNull())
.drop(&quot;table_name2&quot;))

6- Create the views but you'd better use a loop
unlike below (do not forget to convert tables_wo_view_df to a list. You can use rdd.flatMap() for example.

create_views_sql_script = &quot;&quot;&quot;
CREATE VIEW &lt;{database_name}.{table_name_wo_view}_view&gt; AS
SELECT *
FROM &lt;{database_name}.{table_name_wo_view}&gt;
&quot;&quot;&quot;
ath = boto3.client(&#39;athena&#39;)
ath.start_query_execution(
        QueryString=create_views_sql_script,
        ResultConfiguration={&#39;OutputLocation&#39;: &#39;s3://.../&#39;})

7- Finally update the Delta Table with tables_wo_view_df.

Thanks.

答案2

得分: 0

AWS Glue数据目录通过API调用可访问。

例如,可以调用get_tables()(其中包括Views)和create_table()

您可以创建一个按计划运行的AWS Lambda函数。Lambda函数将调用API以查询现有的表(包括视图),并在必要时创建新视图。

英文:

The AWS Glue Data Catalog is accessible via API calls.

For example, there are calls to get_tables()(which includes Views) and create_table().

You could create an AWS Lambda function that runs on a schedule. The Lambda function would call the APIs to query the existing tables (including views) and create new views when necessary.

huangapple
  • 本文由 发表于 2023年3月7日 13:52:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75658449.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定