英文:
Error loading data from S3 bucket to Databricks External Table
问题
以下是代码的翻译部分:
使用我在网上找到的示例,下面的代码引发错误,因为它无法从S3存储桶中读取。问题在于我必须传递AWS凭据,这些凭据在变量S3_dir中找到,其中包含了存储桶的路径。我无法使这个工作。
%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
id BIGINT,
string1 STRING,
numbers BIGINT,
) 使用 delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";
英文:
Using an example I found online, below code throws error as it cannot read from S3 bucket. Problem is I have to pass in the AWS credentials which is found in variable S3_dir with the bucket path. I am unable to get this to work.
%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
id BIGINT,
string1 STRING,
numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";
答案1
得分: 1
为了使用存储中的任何数据源(例如AWS S3、Azure ADLS),您需要在Databricks上挂载S3存储桶。以下是逐步的过程:
步骤1:为Databricks创建AWS访问密钥和秘密密钥
- 登录到您的AWS帐户
- 选择安全凭据 -> 在您的个人资料名称下 [右上角]
- 查找访问密钥(向下滚动) -> 创建并下载生成CSV文件
注意:此文件包含您的访问密钥和秘密密钥
步骤2:登录到Databricks并运行以下代码
创建一个名为“mount_s3_dbfs”的笔记本
import urllib
# 1. 添加必要信息
ACCESS_KEY = <access_key>
SECRET_KEY = <secret_key>
S3_BUCKET = <s3bucket_name>
# 2. 编码您的秘密密钥
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
# 3. 定义URL
URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)
# 4. 使用DBFS挂载您的S3存储桶
dbutils.fs.mount(URL, f"/mnt/{S3_BUCKET}")
# 5. 打印挂载的文件夹内容
dbutils.fs.ls(URL, f"/mnt/{S3_BUCKET}/")
注意:不建议直接提供您的凭据,尝试将其上传到您的DBFS并使用Spark的读取CSV方法进行读取。
步骤3:使用Spark从AWS读取和写入文件
%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
id BIGINT,
string1 STRING,
numbers BIGINT
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * FROM delta.`path\to\s3file`;
注意:使用'`'(反引号)来定义您的S3文件位置。
英文:
In-order to use any data source from a storage(such as AWS S3, Azure ADLS), you need to mount the s3 bucket with Databricks. Here is the step by step procedure:
Step 1: Create AWS Access Key and Secret Key for Databricks
- Log into your AWS account
- Select security credentials -> under your profile name [top right
corner] - Look for Access keys (Scroll down) -> Create and download the
generate CSV file
Note: This file contains your access and secret key
Step 2: Login to Databricks and run the following code
Create a notebook called “mount_s3_dbfs”
import urllib
# 1. Add necessary information
ACCESS_KEY = <access_key>
SECRET_KEY = <secret_key>
S3_BUCKET = <s3bucket_name>
# 2. Encode your secret key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
# 3. Defining URL
URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)
# 4. Mounting your S3 Bucket with DBFS
dbutils.fs.mount(URL,f"/mnt/{S3_BUCKET}")
# 5. Print mounted folder content
dbutils.fs.ls(URL,f"/mnt/{S3_BUCKET}/")
Note: Providing your credential directly is not recommended, try uploading it into your DBFS and read it using spark read csv method.
Step 3: Use spark to read and write file from AWS
%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
id BIGINT,
string1 STRING,
numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * FROM delta.`path\to\s3file`;
Note: Use '`' (backtick) to define your s3 file location.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论