从Azure StorageStreamDownloader读取gzip文件。

huangapple go评论83阅读模式
英文:

Read gzip from Azure StorageStreamDownloader

问题

以下是翻译好的内容:

我想要读取一个从Azure Blob存储下载的gzip文件:

  1. myStorageStreamDownloaderObject = blob_service_client.get_container_client('myContainer').download_blob(myBlob.json.gzip)

(请注意,该文件包含一个JSON文件,扩展名为.gzip,而不是.gz。)

尝试 1:

  1. import gzip as gzip
  2. contents = myStorageStreamDownloaderObject.readall()
  3. result = gzip.decompress(contents)

这将产生以下结果:

  1. contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
  2. result = b''

尝试 2:

  1. from io import BytesIO
  2. import pandas as pd
  3. with BytesIO() as input_blob:
  4. myStorageStreamDownloaderObject.readinto(input_blob)
  5. input_blob.seek(0)
  6. df = pd.read_csv(input_blob, compression='gzip')

这将产生以下结果:

  1. EmptyDataError: No columns to parse from file

使用 Spark 则有效:

  1. df = spark.read.option("compression", "gzip").schema(json_schema).json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])

这将返回一个包含 275 行的数据框。但我正在寻找一种不使用 Spark 的解决方案。

英文:

I would like to read a gzip that I have downloaded from an Azure blob storage:

  1. myStorageStreamDownloaderObject = blob_service_client.
  2. get_container_client('myContainer').
  3. download_blob(myBlob.json.gzip)

(Note that the file contains a json and it has extension .gzip (not .gz).)

Attempt 1:

  1. import gzip as gzip
  2. contents = myStorageStreamDownloaderObject.readall()
  3. result = gzip.decompress(contents)

The yields:

  1. contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
  2. result = b''

Attempt 2:

  1. from io import BytesIO
  2. import pandas as pd
  3. with BytesIO() as input_blob:
  4. myStorageStreamDownloaderObject.readinto(input_blob)
  5. input_blob.seek(0)
  6. df = pd.read_csv(input_blob, compression='gzip')

This yields:

  1. EmptyDataError: No columns to parse from file

It does work with Spark

  1. df = spark.
  2. read.option("compression", "gzip").
  3. schema(json_schema).
  4. json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])

This nicely returns a dataframe with 275 rows. But I am looking for a solution without Spark.

答案1

得分: 1

以下是您要翻译的内容:

从 Azure Blob 存储中下载并读取一个 gzip 文件

您可以使用以下代码从 Azure Blob 存储中读取 JSON(gzip) 文件。

代码:

  1. from azure.storage.blob import BlobServiceClient
  2. import gzip
  3. import pandas as pd
  4. from pandas import DataFrame
  5. import json
  6. # 设置连接字符串和容器名称
  7. connection_string = "<Your-connection-string>"
  8. container_name = "test"
  9. blob_name = "student1.json.gzip"
  10. # 创建 BlobServiceClient 对象
  11. blob_service_client = BlobServiceClient.from_connection_string(connection_string)
  12. # 获取用于 gzip 文件的 BlobClient 对象
  13. blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
  14. # 将 gzip 文件下载为流
  15. stream_downloader = blob_client.download_blob()
  16. contents = stream_downloader.readall()
  17. result = gzip.decompress(contents)
  18. json_data = json.loads(result)
  19. print(json_data)

输出:

  1. {'Register': {'Name': {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'AI', '5': 'robotics'}, 'ID': {'1': '0001', '2': '0002', '3': '0003', '4': '0004', '5': '0005'}, 'Domain': {'1': 'Microsoft', '2': 'Amazon', '3': 'google', '4': 'sysca', '5': 'IRobot'}, 'Rank': {'1': '0002', '2': '0001', '3': '0003', '4': '0004', '5': '0005'}}}

从Azure StorageStreamDownloader读取gzip文件。

要读取文件,您可以在代码中使用 Dataframe,通过 print(Dataframe(json_data)) 这样的方式。

输出:

  1. Register
  2. Domain {'1': 'Microsoft', '2': 'Amazon', '3': 'google...
  3. ID {'1': '0001', '2': '0002', '3': '0003', '4': '...
  4. Name {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'A...
  5. Rank {'1': '0002', '2': '0001', '3': '0003', '4': '...

从Azure StorageStreamDownloader读取gzip文件。

英文:

> Read a gzip that I have downloaded from an Azure blob storage

You can use the below code to read JSON(gzip) files from Azure blob storage.

Code:

  1. from azure.storage.blob import BlobServiceClient
  2. import gzip
  3. import pandas as pd
  4. from pandas import DataFrame
  5. import json
  6. # Set the connection string and container name
  7. connection_string = &quot;&lt;Your-connection-string&gt;&quot;
  8. container_name = &quot;test&quot;
  9. blob_name = &quot;student1.json.gzip&quot;
  10. # Create a BlobServiceClient object
  11. blob_service_client = BlobServiceClient.from_connection_string(connection_string)
  12. # Get a BlobClient object for the gzip file
  13. blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
  14. # Download the gzip file as a stream
  15. stream_downloader = blob_client.download_blob()
  16. contents =stream_downloader.readall()
  17. result = gzip.decompress(contents)
  18. json_data = json.loads(result)
  19. print(json_data)

Output:

  1. {&#39;Register&#39;: {&#39;Name&#39;: {&#39;1&#39;: &#39;azure&#39;, &#39;2&#39;: &#39;aws&#39;, &#39;3&#39;: &#39;gcp&#39;, &#39;4&#39;: &#39;AI&#39;, &#39;5&#39;: &#39;robotics&#39;}, &#39;ID&#39;: {&#39;1&#39;: &#39;0001&#39;, &#39;2&#39;: &#39;0002&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;0004&#39;, &#39;5&#39;: &#39;0005&#39;}, &#39;Domain&#39;: {&#39;1&#39;: &#39;Microsoft&#39;, &#39;2&#39;: &#39;Amazon&#39;, &#39;3&#39;: &#39;google&#39;, &#39;4&#39;: &#39;sysca&#39;, &#39;5&#39;: &#39;IRobot&#39;}, &#39;Rank&#39;: {&#39;1&#39;: &#39;0002&#39;, &#39;2&#39;: &#39;0001&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;0004&#39;, &#39;5&#39;: &#39;0005&#39;}}}

从Azure StorageStreamDownloader读取gzip文件。

To read the file, you can use Dataframe in code by using print(Dataframe(json_data)).

Output:

  1. Register
  2. Domain {&#39;1&#39;: &#39;Microsoft&#39;, &#39;2&#39;: &#39;Amazon&#39;, &#39;3&#39;: &#39;google...
  3. ID {&#39;1&#39;: &#39;0001&#39;, &#39;2&#39;: &#39;0002&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;...
  4. Name {&#39;1&#39;: &#39;azure&#39;, &#39;2&#39;: &#39;aws&#39;, &#39;3&#39;: &#39;gcp&#39;, &#39;4&#39;: &#39;A...
  5. Rank {&#39;1&#39;: &#39;0002&#39;, &#39;2&#39;: &#39;0001&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;...

从Azure StorageStreamDownloader读取gzip文件。

答案2

得分: 0

问题的根本原因是:我没有意识到myStorageStreamDownloaderObject就像一个生成器。

我的完整代码大致如下:

  1. contents = myStorageStreamDownloaderObject.readall()
  2. print(contents)
  3. # 这会产生:b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...
  4. contents = myStorageStreamDownloaderObject.readall()
  5. result = gzip.decompress(contents)
  6. print(result)
  7. # 这会产生:b''

但因为myStorageStreamDownloaderObject就像一个生成器,我不应该调用它两次。以下代码可以正常工作:

  1. contents = myStorageStreamDownloaderObject.readall()
  2. print(contents)
  3. result = gzip.decompress(contents)
  4. print(result)

(对于在问题中不准确地总结我的代码,我道歉。)

英文:

The problem had a very basic cause:
I did not realize myStorageStreamDownloaderObject is like a generator.

My full code was something like this:

  1. contents = myStorageStreamDownloaderObject.readall()
  2. print(contents)
  3. # This yields: b&#39;\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...
  4. contents = myStorageStreamDownloaderObject.readall()
  5. result = gzip.decompress(contents)
  6. print(result)
  7. # This yields: b&#39;&#39;

But because myStorageStreamDownloaderObject is like a generator I should not have called it twice. The following does work:

  1. contents = myStorageStreamDownloaderObject.readall()
  2. print(contents)
  3. result = gzip.decompress(contents)
  4. print(result)

(My apologies for inaccurately summarizing my code in the question.)

huangapple
  • 本文由 发表于 2023年5月22日 04:10:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76301740.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定