如何使用DuckDB从Google存储中读取CSV文件

huangapple go评论64阅读模式
英文:

How to read a csv file from google storage using duckdb

问题

I'm using duckdb version 0.8.0

I have a CSV file located in google storage gs://some_bucket/some_file.csv and want to load this using duckdb.

In pandas I can do pd.read_csv("gs://some_bucket/some_file.csv"), but this doesn't seem to work in duckdb. I see that there's some documentation here: https://duckdb.org/docs/guides/import/s3_import.html, but I find that confusing as it's mainly aimed at s3 usage.

I guess that I have to run:

duckdb.sql("INSTALL httpfs;")
duckdb.sql("LOAD httpfs;")

From the documentation, I'm not sure what the parameters for :

SET s3_access_key_id='key_id';
SET s3_secret_access_key='access_key';

Would be.

How do I load a csv from google storage in duckdb?

Edit - approaches which haven't worked

I've added hmac keys and downloaded them following guide here: https://cloud.google.com/storage/docs/authentication/managing-hmackeys#gsutil_1

import duckdb
import os

duckdb.sql("LOAD httpfs;")

hmac_access = os.getenv('GOOGLE_HMAC_ACCESS_ID')
hmac_secret = os.getenv('GOOGLE_HMAC_SECRET')

duckdb.sql(f"SET s3_access_key_id='{hmac_access}';")
duckdb.sql(f"SET s3_secret_access_key='{hmac_secret}';")


################################################################################
# approach 1

# Doesn't work - fails with:
# 
# Traceback (most recent call last):
#   File "duck_test.py", line 18, in <module>
#     duckdb.sql("SELECT * FROM '{gcp_path_1}'").show()
# duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!
# Did you mean "pg_am"?

# duckdb.sql(f"SELECT * FROM '{gcp_path_1}'").show()


################################################################################
# approach 2

# Fails with:

# Traceback (most recent call last):
#   File "duck_test.py", line 32, in <module>
#     duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")
# duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://some_bucket.s3.amazonaws.com/some_file.csv' (HTTP 400)

duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")


## Edit (working) 

In the code above I forgot to set 
```python
duckdb.sql("SET s3_endpoint='storage.googleapis.com'")

After setting this both approaches read from storage.

英文:

I'm using duckdb version 0.8.0

I have a CSV file located in google storage gs://some_bucket/some_file.csv and want to load this using duckdb.

In pandas I can do pd.read_csv(&quot;gs://some_bucket/some_file.csv&quot;), but this doesn't seem to work in duckdb. I see that there's some documentation here: https://duckdb.org/docs/guides/import/s3_import.html, but I find that confusing as it's mainly aimed at s3 usage.

I guess that I have to run:

duckdb.sql(&quot;INSTALL httpfs;&quot;)
duckdb.sql(&quot;LOAD httpfs;&quot;)

From the documentation, I'm not sure what the parameters for :

SET s3_access_key_id=&#39;key_id&#39;;
SET s3_secret_access_key=&#39;access_key&#39;;

Would be.

How do I load a csv from google storage in duckdb?

Edit - approaches which haven't worked

I've added hmac keys and downloaded them following guide here: https://cloud.google.com/storage/docs/authentication/managing-hmackeys#gsutil_1

import duckdb
import os

duckdb.sql(&quot;LOAD httpfs;&quot;)

hmac_access = os.getenv(&#39;GOOGLE_HMAC_ACCESS_ID&#39;)
hmac_secret = os.getenv(&#39;GOOGLE_HMAC_SECRET&#39;)

duckdb.sql(f&quot;SET s3_access_key_id=&#39;{hmac_access}&#39;;&quot;)
duckdb.sql(f&quot;SET s3_secret_access_key=&#39;{hmac_secret}&#39;;&quot;)


################################################################################
# approach 1

# Doesn&#39;t work - fails with:
# 
# Traceback (most recent call last):
#   File &quot;duck_test.py&quot;, line 18, in &lt;module&gt;
#     duckdb.sql(&quot;SELECT * FROM &#39;{gcp_path_1}&#39;&quot;).show()
# duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!
# Did you mean &quot;pg_am&quot;?

# duckdb.sql(f&quot;SELECT * FROM &#39;{gcp_path_1}&#39;&quot;).show()


################################################################################
# approach 2

# Fails with:

# Traceback (most recent call last):
#   File &quot;duck_test.py&quot;, line 32, in &lt;module&gt;
#     duckdb.sql(f&quot;SELECT * from read_csv(&#39;{gcp_path_1}&#39;, AUTO_DETECT=TRUE);&quot;)
# duckdb.HTTPException: HTTP Error: HTTP GET error on &#39;https://some_bucket.s3.amazonaws.com/some_file.csv&#39; (HTTP 400)

duckdb.sql(f&quot;SELECT * from read_csv(&#39;{gcp_path_1}&#39;, AUTO_DETECT=TRUE);&quot;)

Edit (working)

In the code above I forgot to set

duckdb.sql(&quot;SET s3_endpoint=&#39;storage.googleapis.com&#39;&quot;)

After setting this both approaches read from storage.

答案1

得分: 3

为了访问您的文件,您需要将s3用作URL前缀。

  1. 如果您的GCS存储桶对公众开放
duckdb.sql('INSTALL httpfs')
duckdb.sql('LOAD httpfs')

duckdb.sql("SELECT * FROM 's3://some_bucket/some_file.csv'").show()
  1. 如果您的GCS存储桶不是公开的。
    首先,您需要设置HMAC密钥。前往Google Cloud控制台并按照说明操作。然后,执行以下命令
duckdb.sql('INSTALL httpfs')
duckdb.sql('LOAD httpfs')
duckdb.sql("SET s3_endpoint='storage.googleapis.com'")

# 您将从在Google控制台中配置设置的前一步骤中获取key_id。
duckdb.sql("SET s3_access_key_id='<your-key-id>'") 

# 您将从在Google控制台中配置设置的前一步骤中获取secret_access_key。
duckdb.sql("SET s3_secret_access_key='<your-key>'") 

duckdb.sql("SELECT * FROM 's3://some_bucket/some_file.csv'").show()
英文:

To access your files, you need to use s3 as the URL prefix.

  1. If your GCS bucket is open to the public
duckdb.sql(&#39;INSTALL httpfs&#39;)
duckdb.sql(&#39;LOAD httpfs&#39;)

duckdb.sql(&quot;SELECT * FROM &#39;s3://some_bucket/some_file.csv&#39;&quot;).show()

  1. If your GCS bucket is not public.
    First, you need to set an HMAC key. Go to
    google cloud console and follow the instructions. Then, execute the following commands
duckdb.sql(&#39;INSTALL httpfs&#39;)
duckdb.sql(&#39;LOAD httpfs&#39;)
duckdb.sql(&quot;SET s3_endpoint=&#39;storage.googleapis.com&#39;&quot;)

# You will obtain the key_id from the previous step of 
# configuring settings in the Google Console.
duckdb.sql(&quot;SET s3_access_key_id=&#39;&lt;your-key-id&gt;&#39;&quot;) 

# You will obtain the secret_access_key from the previous step of 
# configuring settings in the Google Console.
duckdb.sql(&quot;SET s3_secret_access_key=&#39;&lt;your-key&gt;&#39;&quot;) 

duckdb.sql(&quot;SELECT * FROM &#39;s3://some_bucket/some_file.csv&#39;&quot;).show()

huangapple
  • 本文由 发表于 2023年5月21日 05:46:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297471.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定