从Azure StorageStreamDownloader读取gzip文件。

huangapple go评论58阅读模式
英文:

Read gzip from Azure StorageStreamDownloader

问题

以下是翻译好的内容:

我想要读取一个从Azure Blob存储下载的gzip文件:

myStorageStreamDownloaderObject = blob_service_client.get_container_client('myContainer').download_blob(myBlob.json.gzip)

(请注意,该文件包含一个JSON文件,扩展名为.gzip,而不是.gz。)

尝试 1:

import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)

这将产生以下结果:

contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
result = b''

尝试 2:

from io import BytesIO
import pandas as pd

with BytesIO() as input_blob:
    myStorageStreamDownloaderObject.readinto(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob, compression='gzip')

这将产生以下结果:

EmptyDataError: No columns to parse from file

使用 Spark 则有效:

df = spark.read.option("compression", "gzip").schema(json_schema).json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])

这将返回一个包含 275 行的数据框。但我正在寻找一种不使用 Spark 的解决方案。

英文:

I would like to read a gzip that I have downloaded from an Azure blob storage:

myStorageStreamDownloaderObject = blob_service_client.
    get_container_client('myContainer').
    download_blob(myBlob.json.gzip)

(Note that the file contains a json and it has extension .gzip (not .gz).)

Attempt 1:

import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)

The yields:

contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
result   = b''

Attempt 2:

from io import BytesIO
import pandas as pd

with BytesIO() as input_blob:
    myStorageStreamDownloaderObject.readinto(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob, compression='gzip')

This yields:

EmptyDataError: No columns to parse from file

It does work with Spark

df = spark.
    read.option("compression", "gzip").
    schema(json_schema).
    json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])

This nicely returns a dataframe with 275 rows. But I am looking for a solution without Spark.

答案1

得分: 1

以下是您要翻译的内容:

从 Azure Blob 存储中下载并读取一个 gzip 文件

您可以使用以下代码从 Azure Blob 存储中读取 JSON(gzip) 文件。

代码:

from azure.storage.blob import BlobServiceClient
import gzip
import pandas as pd
from pandas import DataFrame
import json

# 设置连接字符串和容器名称
connection_string = "<Your-connection-string>"
container_name = "test"
blob_name = "student1.json.gzip"

# 创建 BlobServiceClient 对象
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# 获取用于 gzip 文件的 BlobClient 对象
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

# 将 gzip 文件下载为流
stream_downloader = blob_client.download_blob()
contents = stream_downloader.readall()
result = gzip.decompress(contents)
json_data = json.loads(result)
print(json_data)

输出:

{'Register': {'Name': {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'AI', '5': 'robotics'}, 'ID': {'1': '0001', '2': '0002', '3': '0003', '4': '0004', '5': '0005'}, 'Domain': {'1': 'Microsoft', '2': 'Amazon', '3': 'google', '4': 'sysca', '5': 'IRobot'}, 'Rank': {'1': '0002', '2': '0001', '3': '0003', '4': '0004', '5': '0005'}}}

从Azure StorageStreamDownloader读取gzip文件。

要读取文件,您可以在代码中使用 Dataframe,通过 print(Dataframe(json_data)) 这样的方式。

输出:

                                     Register
Domain  {'1': 'Microsoft', '2': 'Amazon', '3': 'google...
ID      {'1': '0001', '2': '0002', '3': '0003', '4': '...
Name    {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'A...
Rank    {'1': '0002', '2': '0001', '3': '0003', '4': '...

从Azure StorageStreamDownloader读取gzip文件。

英文:

> Read a gzip that I have downloaded from an Azure blob storage

You can use the below code to read JSON(gzip) files from Azure blob storage.

Code:

from azure.storage.blob import BlobServiceClient
import gzip
import  pandas  as  pd
from  pandas  import  DataFrame
import json

# Set the connection string and container name
connection_string = &quot;&lt;Your-connection-string&gt;&quot;
container_name = &quot;test&quot;
blob_name = &quot;student1.json.gzip&quot;

# Create a BlobServiceClient object
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# Get a BlobClient object for the gzip file
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

# Download the gzip file as a stream
stream_downloader = blob_client.download_blob()
contents =stream_downloader.readall()
result = gzip.decompress(contents)
json_data = json.loads(result)
print(json_data)

Output:

{&#39;Register&#39;: {&#39;Name&#39;: {&#39;1&#39;: &#39;azure&#39;, &#39;2&#39;: &#39;aws&#39;, &#39;3&#39;: &#39;gcp&#39;, &#39;4&#39;: &#39;AI&#39;, &#39;5&#39;: &#39;robotics&#39;}, &#39;ID&#39;: {&#39;1&#39;: &#39;0001&#39;, &#39;2&#39;: &#39;0002&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;0004&#39;, &#39;5&#39;: &#39;0005&#39;}, &#39;Domain&#39;: {&#39;1&#39;: &#39;Microsoft&#39;, &#39;2&#39;: &#39;Amazon&#39;, &#39;3&#39;: &#39;google&#39;, &#39;4&#39;: &#39;sysca&#39;, &#39;5&#39;: &#39;IRobot&#39;}, &#39;Rank&#39;: {&#39;1&#39;: &#39;0002&#39;, &#39;2&#39;: &#39;0001&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;0004&#39;, &#39;5&#39;: &#39;0005&#39;}}}

从Azure StorageStreamDownloader读取gzip文件。

To read the file, you can use Dataframe in code by using print(Dataframe(json_data)).

Output:

                                                 Register
Domain  {&#39;1&#39;: &#39;Microsoft&#39;, &#39;2&#39;: &#39;Amazon&#39;, &#39;3&#39;: &#39;google...
ID      {&#39;1&#39;: &#39;0001&#39;, &#39;2&#39;: &#39;0002&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;...
Name    {&#39;1&#39;: &#39;azure&#39;, &#39;2&#39;: &#39;aws&#39;, &#39;3&#39;: &#39;gcp&#39;, &#39;4&#39;: &#39;A...
Rank    {&#39;1&#39;: &#39;0002&#39;, &#39;2&#39;: &#39;0001&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;...

从Azure StorageStreamDownloader读取gzip文件。

答案2

得分: 0

问题的根本原因是:我没有意识到myStorageStreamDownloaderObject就像一个生成器。

我的完整代码大致如下:

contents = myStorageStreamDownloaderObject.readall()
print(contents)
# 这会产生:b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...

contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)
print(result)
# 这会产生:b''

但因为myStorageStreamDownloaderObject就像一个生成器,我不应该调用它两次。以下代码可以正常工作:

contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)

(对于在问题中不准确地总结我的代码,我道歉。)

英文:

The problem had a very basic cause:
I did not realize myStorageStreamDownloaderObject is like a generator.

My full code was something like this:

contents = myStorageStreamDownloaderObject.readall()
print(contents)
# This yields: b&#39;\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...

contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)
print(result)
# This yields: b&#39;&#39;

But because myStorageStreamDownloaderObject is like a generator I should not have called it twice. The following does work:

contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)

(My apologies for inaccurately summarizing my code in the question.)

huangapple
  • 本文由 发表于 2023年5月22日 04:10:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76301740.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定