英文:
Create Hive Metastore Tables from Multiple Sub Folders on Blob storage
问题
我在Azure blob存储上有以下文件夹结构的Delta表格。
Lvl1/Lvl2/db1/Table1
Lvl1/Lvl2/db1/Table2
Lvl1/Lvl2/db1/Table3
Lvl1/Lvl2/db2/Table1
Lvl1/Lvl2/db2/Table2
Lvl1/Lvl2/db2/Table3
Lvl1/Lvl2/db3/Table1
我想要在一个单独的数据库中为上述所有表格创建Hive Metastore表格链接。因此,我使用以下命令创建了数据库。
spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb')
我目前正在使用以下命令链接表格。
表格名称 = [动态生成表格名称]
spark.sql(f'CREATE TABLE IF NOT EXISTS parent_db.{tablename} USING DELTA LOCATION \'{path}\'')
我希望Spark能够读取上述所有表格的位置,并在我上面创建的单一数据库中创建这些表格。因此,当从Databricks的Data选项卡中浏览Hive Metastore时,应该如下所示。
Parent_db --> db1_table1
--> db1_table2
--> db1_table3
--> db2_table1
--> db2_table2
--> db2_table3
--> db3_table1
...
我可以使用db1、db2、db3等动态表格命名,问题仅在于从Delta位置读取所有表格并创建这些表格(读取根文件夹中的所有子文件夹)。
所以我只想循环遍历文件夹并为单一数据库下的所有表格创建链接。请帮助解决这个问题。
英文:
I have Azure delta tables per the below folder structure on blob storage.
Lvl1/Lvl2/db1/Table1
Lvl1/Lvl2/db1/Table2
Lvl1/Lvl2/db1/Table3
Lvl1/Lvl2/db2/Table1
Lvl1/Lvl2/db2/Table2
Lvl1/Lvl2/db2/Table3
Lvl1/Lvl2/db3/Table1
I want to create Hive Metastore table links for All the above tables under a single database
So I created the database using the following Command
spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb')
I am currently linking the tables by using the following command
Tablename = [dynamically generates the tablename]
spark.sql(f'CREATE TABLE IF NOT EXISTS parent_db.{tablename} USING DELTA LOCATION \'{path}\'')
I want spark to read all the above table locations, and create the tables with the tablenames within the single database that I have created above.
So Hive Metastore when browsed from Databricks Data tab should look like this
Parent_db --> db1_table1
Db2_table1
Db2_table2
Db1_table2
Db1_table3
Db3_table3
.
.
.
I can create the dynamic table namings with db1, db2,db3 … the issue is only to read all the tables from the delta location and create the tables (reading all subfolders within the root folder)
So All i want is to loop through the Folders and create link for all tables under the single db.
Any help with this one please …
答案1
得分: 0
我已经复制了上面的内容并能够获取存储在Hive元数据存储数据库中的表格。
首先,我在我的Blob存储中有与我的挂载位置相同路径的相同Delta表格。
然后使用以下代码获取Delta表格的路径列表,并创建一个包含数据库和表格的列表,如下所示。
import glob, os
paths = [x[5:] for x in glob.iglob('/dbfs/mnt/data/Lvl1/Lvl2/**/*')]
print("路径列表:", paths)
dbs_list = [x[-2] for x in [y.split('/') for y in paths]]
print("数据库列表:", dbs_list)
table_list = [x[-1] for x in [y.split('/') for y in paths]]
print("表格列表:", table_list)
然后使用以下代码在Hive元数据存储中创建表格。
spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb2')
for i in range(0, len(paths)):
table_name = dbs_list[i] + '_' + table_list[i]
spark.sql(f'CREATE TABLE IF NOT EXISTS parentdb2.{table_name} USING DELTA LOCATION \'{paths[i]}\'')
**我的执行:**
![在这里输入图片描述](https://i.stack.imgur.com/pDwSs.png)
![在这里输入图片描述](https://i.stack.imgur.com/rXtfB.png)
**结果:**
![在这里输入图片描述](https://i.stack.imgur.com/Gq4cg.png)
英文:
I have reproduced the above and able to get the tables stored in hive meta store database.
First, I have the same delta tables in my blob storage with the same path at my mount location.
Then use the below code to get the paths list of delta tables and create a list for dbs and tables like below.
import glob, os
paths=[x[5:]for x in glob.iglob('/dbfs/mnt/data/Lvl1/Lvl2/**/*')]
print("paths list : ",paths)
dbs_list=[x[-2] for x in [y.split('/') for y in paths]]
print("dbs list : ",dbs_list)
table_list=[x[-1] for x in [y.split('/') for y in paths]]
print("table list : ",table_list)
Then use the below code to create the tables in hive metastore.
spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb2')
for i in range(0,len(paths)):
table_name=dbs_list[i]+'_'+table_list[i]
spark.sql(f'CREATE TABLE IF NOT EXISTS parentdb2.{table_name} USING DELTA LOCATION \'{paths[i]}\'')
My Execution:
Result:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论