错误从S3存储桶加载数据到Databricks外部表

huangapple go评论65阅读模式
英文:

Error loading data from S3 bucket to Databricks External Table

问题

以下是代码的翻译部分:

使用我在网上找到的示例,下面的代码引发错误,因为它无法从S3存储桶中读取。问题在于我必须传递AWS凭据,这些凭据在变量S3_dir中找到,其中包含了存储桶的路径。我无法使这个工作。

%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT,
) 使用 delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";
英文:

Using an example I found online, below code throws error as it cannot read from S3 bucket. Problem is I have to pass in the AWS credentials which is found in variable S3_dir with the bucket path. I am unable to get this to work.

%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";

答案1

得分: 1

为了使用存储中的任何数据源(例如AWS S3、Azure ADLS),您需要在Databricks上挂载S3存储桶。以下是逐步的过程:

步骤1:为Databricks创建AWS访问密钥和秘密密钥

  • 登录到您的AWS帐户
  • 选择安全凭据 -> 在您的个人资料名称下 [右上角]
  • 查找访问密钥(向下滚动) -> 创建并下载生成CSV文件

注意:此文件包含您的访问密钥和秘密密钥

步骤2:登录到Databricks并运行以下代码

创建一个名为“mount_s3_dbfs”的笔记本

import urllib

# 1. 添加必要信息
ACCESS_KEY = <access_key>
SECRET_KEY = <secret_key>
S3_BUCKET = <s3bucket_name>

# 2. 编码您的秘密密钥
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

# 3. 定义URL
URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)

# 4. 使用DBFS挂载您的S3存储桶
dbutils.fs.mount(URL, f"/mnt/{S3_BUCKET}")

# 5. 打印挂载的文件夹内容
dbutils.fs.ls(URL, f"/mnt/{S3_BUCKET}/")

注意:不建议直接提供您的凭据,尝试将其上传到您的DBFS并使用Spark的读取CSV方法进行读取。

步骤3:使用Spark从AWS读取和写入文件

%sql

DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * FROM delta.`path\to\s3file`;

注意:使用'`'(反引号)来定义您的S3文件位置。

英文:

In-order to use any data source from a storage(such as AWS S3, Azure ADLS), you need to mount the s3 bucket with Databricks. Here is the step by step procedure:

Step 1: Create AWS Access Key and Secret Key for Databricks

  • Log into your AWS account
  • Select security credentials -> under your profile name [top right
    corner]
  • Look for Access keys (Scroll down) -> Create and download the
    generate CSV file

Note: This file contains your access and secret key

Step 2: Login to Databricks and run the following code

Create a notebook called “mount_s3_dbfs”

import urllib

# 1. Add necessary information
ACCESS_KEY = &lt;access_key&gt;
SECRET_KEY = &lt;secret_key&gt;
S3_BUCKET = &lt;s3bucket_name&gt;

# 2. Encode your secret key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe=&quot;&quot;)

# 3. Defining URL
URL = &quot;s3a://{}:{}@{}&quot;.format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)

# 4. Mounting your S3 Bucket with DBFS
dbutils.fs.mount(URL,f&quot;/mnt/{S3_BUCKET}&quot;)

# 5. Print mounted folder content
dbutils.fs.ls(URL,f&quot;/mnt/{S3_BUCKET}/&quot;)

Note: Providing your credential directly is not recommended, try uploading it into your DBFS and read it using spark read csv method.

Step 3: Use spark to read and write file from AWS

%sql

DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT,
) USING delta
LOCATION &#39;/user/hive/warehouse/mydb&#39;
SELECT * FROM delta.`path\to\s3file`;

Note: Use '`' (backtick) to define your s3 file location.

huangapple
  • 本文由 发表于 2023年2月24日 08:46:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551692.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定