错误从S3存储桶加载数据到Databricks外部表

huangapple go评论89阅读模式
英文:

Error loading data from S3 bucket to Databricks External Table

问题

以下是代码的翻译部分:

  1. 使用我在网上找到的示例,下面的代码引发错误,因为它无法从S3存储桶中读取。问题在于我必须传递AWS凭据,这些凭据在变量S3_dir中找到,其中包含了存储桶的路径。我无法使这个工作。
  2. %sql
  3. DROP TABLE IF EXISTS mydb.bigtable;
  4. CREATE TABLE mydb.bigtable
  5. (
  6. id BIGINT,
  7. string1 STRING,
  8. numbers BIGINT,
  9. ) 使用 delta
  10. LOCATION '/user/hive/warehouse/mydb'
  11. SELECT * from delta.f"{S3_dir}";
英文:

Using an example I found online, below code throws error as it cannot read from S3 bucket. Problem is I have to pass in the AWS credentials which is found in variable S3_dir with the bucket path. I am unable to get this to work.

  1. %sql
  2. DROP TABLE IF EXISTS mydb.bigtable;
  3. CREATE TABLE mydb.bigtable
  4. (
  5. id BIGINT,
  6. string1 STRING,
  7. numbers BIGINT,
  8. ) USING delta
  9. LOCATION '/user/hive/warehouse/mydb'
  10. SELECT * from delta.f"{S3_dir}";

答案1

得分: 1

为了使用存储中的任何数据源(例如AWS S3、Azure ADLS),您需要在Databricks上挂载S3存储桶。以下是逐步的过程:

步骤1:为Databricks创建AWS访问密钥和秘密密钥

  • 登录到您的AWS帐户
  • 选择安全凭据 -> 在您的个人资料名称下 [右上角]
  • 查找访问密钥(向下滚动) -> 创建并下载生成CSV文件

注意:此文件包含您的访问密钥和秘密密钥

步骤2:登录到Databricks并运行以下代码

创建一个名为“mount_s3_dbfs”的笔记本

  1. import urllib
  2. # 1. 添加必要信息
  3. ACCESS_KEY = <access_key>
  4. SECRET_KEY = <secret_key>
  5. S3_BUCKET = <s3bucket_name>
  6. # 2. 编码您的秘密密钥
  7. ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
  8. # 3. 定义URL
  9. URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)
  10. # 4. 使用DBFS挂载您的S3存储桶
  11. dbutils.fs.mount(URL, f"/mnt/{S3_BUCKET}")
  12. # 5. 打印挂载的文件夹内容
  13. dbutils.fs.ls(URL, f"/mnt/{S3_BUCKET}/")

注意:不建议直接提供您的凭据,尝试将其上传到您的DBFS并使用Spark的读取CSV方法进行读取。

步骤3:使用Spark从AWS读取和写入文件

  1. %sql
  2. DROP TABLE IF EXISTS mydb.bigtable;
  3. CREATE TABLE mydb.bigtable
  4. (
  5. id BIGINT,
  6. string1 STRING,
  7. numbers BIGINT
  8. ) USING delta
  9. LOCATION '/user/hive/warehouse/mydb'
  10. SELECT * FROM delta.`path\to\s3file`;

注意:使用'`'(反引号)来定义您的S3文件位置。

英文:

In-order to use any data source from a storage(such as AWS S3, Azure ADLS), you need to mount the s3 bucket with Databricks. Here is the step by step procedure:

Step 1: Create AWS Access Key and Secret Key for Databricks

  • Log into your AWS account
  • Select security credentials -> under your profile name [top right
    corner]
  • Look for Access keys (Scroll down) -> Create and download the
    generate CSV file

Note: This file contains your access and secret key

Step 2: Login to Databricks and run the following code

Create a notebook called “mount_s3_dbfs”

  1. import urllib
  2. # 1. Add necessary information
  3. ACCESS_KEY = &lt;access_key&gt;
  4. SECRET_KEY = &lt;secret_key&gt;
  5. S3_BUCKET = &lt;s3bucket_name&gt;
  6. # 2. Encode your secret key
  7. ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe=&quot;&quot;)
  8. # 3. Defining URL
  9. URL = &quot;s3a://{}:{}@{}&quot;.format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)
  10. # 4. Mounting your S3 Bucket with DBFS
  11. dbutils.fs.mount(URL,f&quot;/mnt/{S3_BUCKET}&quot;)
  12. # 5. Print mounted folder content
  13. dbutils.fs.ls(URL,f&quot;/mnt/{S3_BUCKET}/&quot;)

Note: Providing your credential directly is not recommended, try uploading it into your DBFS and read it using spark read csv method.

Step 3: Use spark to read and write file from AWS

  1. %sql
  2. DROP TABLE IF EXISTS mydb.bigtable;
  3. CREATE TABLE mydb.bigtable
  4. (
  5. id BIGINT,
  6. string1 STRING,
  7. numbers BIGINT,
  8. ) USING delta
  9. LOCATION &#39;/user/hive/warehouse/mydb&#39;
  10. SELECT * FROM delta.`path\to\s3file`;

Note: Use '`' (backtick) to define your s3 file location.

huangapple
  • 本文由 发表于 2023年2月24日 08:46:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551692.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定