当我们删除Spark管理的表时会发生什么?

huangapple go评论53阅读模式
英文:

what happens when we delete spark managed tables?

问题

  1. 下面的代码会删除 Spark 托管表,这意味着它会删除我的 S3 原始数据,或者说 Spark 删除了数据和元数据。

  2. 我在这里阅读到,当我们创建托管表时,Spark 使用 Delta 格式,实际上我的原始数据是以 CSV 格式存储在 S3 中,这是否意味着它会将 CSV 转换为 Delta 格式,还是会复制相同的数据并以 Delta 格式写入到某个地方?

  3. 如果我创建了 Spark 托管表,它会使用相同的基础存储位置还是新的位置?请详细解释。

英文:

I recently started learning about spark. I was studying about spark managed tables. so as per docs " spark manages the both the data and metadata". Assume that i have a csv file in s3 and I read it into data frame like below.

df = spark.read
.format("csv")
.option("header", "true") 
.option("inferSchema", "true") 
.load("s3a://databricks-learning-s333/temp/flights.csv")

now i created a spark managed table in data bricks as below..

spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,  
  distance INT, origin STRING, destination STRING)")

df.write.saveAsTable("managed_us_delay_flights_tbl")

now it is a spark managed table, so spark manages both the data and metadata.

as per docs, if we delete managed table spark deletes managed table it will delete the both metadata and actual data (docs)

Here are my questions:

  1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata.

    spark.sql('DROP TABLE managed_us_delay_flights_tbl')
    
  2. I read here that when we create managed tables, spark uses the delta format, actually my original data in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same with and write it in delta format somewhere ?

  3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail.

答案1

得分: 1

  1. 你的CSV文件是你的数据源。在上面的代码中,Spark将读取CSV文件并将其加载到一个数据框中,接下来,Spark将数据写入Delta表,而不会影响你的CSV源文件。当你删除托管表时,你的源CSV不会受到影响。

  2. Spark将使用Delta格式创建新表。Delta格式是一种Parquet格式。这是默认的格式,但你也可以选择其他格式。它不会影响源CSV文件,只会影响你的表目的地。

  3. 托管表会被创建在你的数据库文件夹下("learn_spark_db"),可以使用以下方式找到这个根文件夹:

%sql describe learn_spark_db
英文:
  1. Your CSV file is your source. In the code above Spark will read the CSV file and load it into a dataframe, in the next step, Spark will write the data into a Delta table, without affecting your CSV source.
    When you delete the managed table, your source CSV is not affected.

  2. Spark will use the Delta format for the new table. Delta format is a type of Parquet format. Thats the default, but you can choose other formats. It will not affect the source CSV, only your table destination.

  3. Managed tables are created under the folder of your database ("learn_spark_db"), find this root folder by using:

     %sql describe learn_spark_db
    

答案2

得分: 0

Q1. 以下代码会删除Spark托管表,这意味着它会删除我的S3原始数据吗?Spark删除数据和元数据是什么意思?

答:它不会删除您的原始S3数据。由于您创建的是托管表,数据存储在dbfs的/user/hive/warehouse/learn_spark_db.db/文件夹下。执行删除语句后,数据将从/user/hive/warehouse/learn_spark_db.db/目录中删除,而不是从S3中删除。如果在创建表时提供了位置,它将被视为非托管表,在删除表时只会删除元数据。

Q2. 我在这里看到,当我们创建托管表时,Spark会使用Delta格式,实际上,我的原始数据是以CSV格式存储在S3中的,这是否意味着它会将CSV转换为Delta格式,还是会复制相同的数据并以Delta格式写入某个地方?

答:它不会改变S3中的原始数据,它将在dfbs的/user/hive/warehouse/learn_spark_db.db/位置以Delta格式写入相同的数据,如果您没有指定任何格式。您可以使用Databricks实用程序查看新的数据文件:

dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")

Q3. 如果我创建Spark托管表,它会使用相同的底层存储还是新的位置?请详细解释。

答:每当您创建一个Databricks资源时,通常会为存储数据创建一个底层存储账户,通常称为Databricks文件系统(DBFS)。

DBFS(Databricks文件系统)是由Databricks集群使用的分布式文件系统。DBFS是云存储(如S3或Azure Blob Store)上的一个抽象层,允许将外部存储桶挂载为DBFS命名空间中的路径。

您可以使用UI或Databricks实用程序查看:

dbutils.fs.ls("/")

现在回答您的问题,每当有人创建托管表时,它会将元数据和数据存储在底层的Databricks托管存储账户中。

您可以使用以下命令查看:

dbutils.fs.ls("/user/hive/warehouse/")
英文:

Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata?

Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3.
If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.

Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?

Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format.
You can see the new data file using the databricks utility:

dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")

Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?

Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs).

DBFS (Databricks File System) is a distributed file system used by Databricks clusters. DBFS is an abstraction layer over cloud storage (e.g. S3 or Azure Blob Store), allowing external storage buckets to be mounted as paths in the DBFS namespace

you can see using UI or using databricks utility :

dbutils.fs.ls("/")

Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.

You can see this using:

dbutils.fs.ls("/user/hive/warehouse/")

huangapple
  • 本文由 发表于 2023年7月10日 11:00:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76650441.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定