2023年5月21日 05:46:00go评论81阅读模式

英文:

How to read a csv file from google storage using duckdb

问题

I'm using duckdb version 0.8.0

I have a CSV file located in google storage gs://some_bucket/some_file.csv and want to load this using duckdb.

In pandas I can do pd.read_csv("gs://some_bucket/some_file.csv"), but this doesn't seem to work in duckdb. I see that there's some documentation here: https://duckdb.org/docs/guides/import/s3_import.html, but I find that confusing as it's mainly aimed at s3 usage.

I guess that I have to run:

duckdb.sql("INSTALL httpfs;")
duckdb.sql("LOAD httpfs;")

From the documentation, I'm not sure what the parameters for :

SET s3_access_key_id='key_id';
SET s3_secret_access_key='access_key';

Would be.

How do I load a csv from google storage in duckdb?

Edit - approaches which haven't worked

I've added hmac keys and downloaded them following guide here: https://cloud.google.com/storage/docs/authentication/managing-hmackeys#gsutil_1

import duckdb
import os

duckdb.sql("LOAD httpfs;")

hmac_access = os.getenv('GOOGLE_HMAC_ACCESS_ID')
hmac_secret = os.getenv('GOOGLE_HMAC_SECRET')

duckdb.sql(f"SET s3_access_key_id='{hmac_access}';")
duckdb.sql(f"SET s3_secret_access_key='{hmac_secret}';")


################################################################################
# approach 1

# Doesn't work - fails with:
# 
# Traceback (most recent call last):
#   File "duck_test.py", line 18, in <module>
#     duckdb.sql("SELECT * FROM '{gcp_path_1}'").show()
# duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!
# Did you mean "pg_am"?

# duckdb.sql(f"SELECT * FROM '{gcp_path_1}'").show()


################################################################################
# approach 2

# Fails with:

# Traceback (most recent call last):
#   File "duck_test.py", line 32, in <module>
#     duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")
# duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://some_bucket.s3.amazonaws.com/some_file.csv' (HTTP 400)

duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")


## Edit (working) 

In the code above I forgot to set 
```python
duckdb.sql("SET s3_endpoint='storage.googleapis.com'")

After setting this both approaches read from storage.

英文:

I'm using duckdb version 0.8.0

I have a CSV file located in google storage gs://some_bucket/some_file.csv and want to load this using duckdb.

I guess that I have to run:

duckdb.sql(&quot;INSTALL httpfs;&quot;)
duckdb.sql(&quot;LOAD httpfs;&quot;)

From the documentation, I'm not sure what the parameters for :

SET s3_access_key_id=&#39;key_id&#39;;
SET s3_secret_access_key=&#39;access_key&#39;;

Would be.

How do I load a csv from google storage in duckdb?

Edit - approaches which haven't worked

I've added hmac keys and downloaded them following guide here: https://cloud.google.com/storage/docs/authentication/managing-hmackeys#gsutil_1

import duckdb
import os

duckdb.sql(&quot;LOAD httpfs;&quot;)

hmac_access = os.getenv(&#39;GOOGLE_HMAC_ACCESS_ID&#39;)
hmac_secret = os.getenv(&#39;GOOGLE_HMAC_SECRET&#39;)

duckdb.sql(f&quot;SET s3_access_key_id=&#39;{hmac_access}&#39;;&quot;)
duckdb.sql(f&quot;SET s3_secret_access_key=&#39;{hmac_secret}&#39;;&quot;)


################################################################################
# approach 1

# Doesn&#39;t work - fails with:
# 
# Traceback (most recent call last):
#   File &quot;duck_test.py&quot;, line 18, in &lt;module&gt;
#     duckdb.sql(&quot;SELECT * FROM &#39;{gcp_path_1}&#39;&quot;).show()
# duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!
# Did you mean &quot;pg_am&quot;?

# duckdb.sql(f&quot;SELECT * FROM &#39;{gcp_path_1}&#39;&quot;).show()


################################################################################
# approach 2

# Fails with:

# Traceback (most recent call last):
#   File &quot;duck_test.py&quot;, line 32, in &lt;module&gt;
#     duckdb.sql(f&quot;SELECT * from read_csv(&#39;{gcp_path_1}&#39;, AUTO_DETECT=TRUE);&quot;)
# duckdb.HTTPException: HTTP Error: HTTP GET error on &#39;https://some_bucket.s3.amazonaws.com/some_file.csv&#39; (HTTP 400)

duckdb.sql(f&quot;SELECT * from read_csv(&#39;{gcp_path_1}&#39;, AUTO_DETECT=TRUE);&quot;)

Edit (working)

In the code above I forgot to set

duckdb.sql(&quot;SET s3_endpoint=&#39;storage.googleapis.com&#39;&quot;)

After setting this both approaches read from storage.

答案1

得分: 3

为了访问您的文件，您需要将s3用作URL前缀。

如果您的GCS存储桶对公众开放

duckdb.sql('INSTALL httpfs')
duckdb.sql('LOAD httpfs')

duckdb.sql("SELECT * FROM 's3://some_bucket/some_file.csv'").show()

如果您的GCS存储桶不是公开的。
首先，您需要设置HMAC密钥。前往Google Cloud控制台并按照说明操作。然后，执行以下命令

duckdb.sql('INSTALL httpfs')
duckdb.sql('LOAD httpfs')
duckdb.sql("SET s3_endpoint='storage.googleapis.com'")

# 您将从在Google控制台中配置设置的前一步骤中获取key_id。
duckdb.sql("SET s3_access_key_id='<your-key-id>'") 

# 您将从在Google控制台中配置设置的前一步骤中获取secret_access_key。
duckdb.sql("SET s3_secret_access_key='<your-key>'") 

duckdb.sql("SELECT * FROM 's3://some_bucket/some_file.csv'").show()

英文:

To access your files, you need to use s3 as the URL prefix.

If your GCS bucket is open to the public

duckdb.sql(&#39;INSTALL httpfs&#39;)
duckdb.sql(&#39;LOAD httpfs&#39;)

duckdb.sql(&quot;SELECT * FROM &#39;s3://some_bucket/some_file.csv&#39;&quot;).show()

If your GCS bucket is not public.
First, you need to set an HMAC key. Go to
google cloud console and follow the instructions. Then, execute the following commands

duckdb.sql(&#39;INSTALL httpfs&#39;)
duckdb.sql(&#39;LOAD httpfs&#39;)
duckdb.sql(&quot;SET s3_endpoint=&#39;storage.googleapis.com&#39;&quot;)

# You will obtain the key_id from the previous step of 
# configuring settings in the Google Console.
duckdb.sql(&quot;SET s3_access_key_id=&#39;&lt;your-key-id&gt;&#39;&quot;) 

# You will obtain the secret_access_key from the previous step of 
# configuring settings in the Google Console.
duckdb.sql(&quot;SET s3_secret_access_key=&#39;&lt;your-key&gt;&#39;&quot;) 

duckdb.sql(&quot;SELECT * FROM &#39;s3://some_bucket/some_file.csv&#39;&quot;).show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用DuckDB从Google存储中读取CSV文件

问题

Edit - approaches which haven't worked

Edit - approaches which haven't worked

Edit (working)

答案1

为什么每一行都执行打印语句，当目标只是在列表底部打印一次？

Angular i18n在将应用推送到Google App Engine生产环境时无法正常工作。

Spark DataFrame将字符串转换为日期会导致空值

如何反向转换加载的 pickle XGBoost 模型的预测输出？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论