2023年5月22日 04:10:42go评论125阅读模式

英文:

Read gzip from Azure StorageStreamDownloader

问题

以下是翻译好的内容：

我想要读取一个从Azure Blob存储下载的gzip文件：

myStorageStreamDownloaderObject = blob_service_client.get_container_client('myContainer').download_blob(myBlob.json.gzip)

（请注意，该文件包含一个JSON文件，扩展名为.gzip，而不是.gz。）

尝试 1：

import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)

这将产生以下结果：

contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...
result = b''

尝试 2：

from io import BytesIO
import pandas as pd

with BytesIO() as input_blob:
    myStorageStreamDownloaderObject.readinto(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob, compression='gzip')

这将产生以下结果：

EmptyDataError: No columns to parse from file

使用 Spark 则有效：

df = spark.read.option("compression", "gzip").schema(json_schema).json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])

这将返回一个包含 275 行的数据框。但我正在寻找一种不使用 Spark 的解决方案。

英文:

I would like to read a gzip that I have downloaded from an Azure blob storage:

myStorageStreamDownloaderObject = blob_service_client.
    get_container_client(&#39;myContainer&#39;).
    download_blob(myBlob.json.gzip)

(Note that the file contains a json and it has extension .gzip (not .gz).)

Attempt 1:

import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)

The yields:

contents = b&#39;\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...
result   = b&#39;&#39;

Attempt 2:

from io import BytesIO
import pandas as pd

with BytesIO() as input_blob:
    myStorageStreamDownloaderObject.readinto(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob, compression=&#39;gzip&#39;)

This yields:

EmptyDataError: No columns to parse from file

It does work with Spark

df = spark.
    read.option(&quot;compression&quot;, &quot;gzip&quot;).
    schema(json_schema).
    json([f&quot;wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}&quot;])

This nicely returns a dataframe with 275 rows. But I am looking for a solution without Spark.

答案1

得分: 1

以下是您要翻译的内容：

从 Azure Blob 存储中下载并读取一个 gzip 文件

您可以使用以下代码从 Azure Blob 存储中读取 JSON(gzip) 文件。

代码：

from azure.storage.blob import BlobServiceClient
import gzip
import pandas as pd
from pandas import DataFrame
import json

# 设置连接字符串和容器名称
connection_string = "<Your-connection-string>"
container_name = "test"
blob_name = "student1.json.gzip"

# 创建 BlobServiceClient 对象
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# 获取用于 gzip 文件的 BlobClient 对象
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

# 将 gzip 文件下载为流
stream_downloader = blob_client.download_blob()
contents = stream_downloader.readall()
result = gzip.decompress(contents)
json_data = json.loads(result)
print(json_data)

输出：

{'Register': {'Name': {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'AI', '5': 'robotics'}, 'ID': {'1': '0001', '2': '0002', '3': '0003', '4': '0004', '5': '0005'}, 'Domain': {'1': 'Microsoft', '2': 'Amazon', '3': 'google', '4': 'sysca', '5': 'IRobot'}, 'Rank': {'1': '0002', '2': '0001', '3': '0003', '4': '0004', '5': '0005'}}}

从Azure StorageStreamDownloader读取gzip文件。

要读取文件，您可以在代码中使用 Dataframe，通过 print(Dataframe(json_data)) 这样的方式。

输出：

                                     Register
Domain  {'1': 'Microsoft', '2': 'Amazon', '3': 'google...
ID      {'1': '0001', '2': '0002', '3': '0003', '4': '...
Name    {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'A...
Rank    {'1': '0002', '2': '0001', '3': '0003', '4': '...

从Azure StorageStreamDownloader读取gzip文件。

英文:

> Read a gzip that I have downloaded from an Azure blob storage

You can use the below code to read JSON(gzip) files from Azure blob storage.

Code:

from azure.storage.blob import BlobServiceClient
import gzip
import  pandas  as  pd
from  pandas  import  DataFrame
import json

# Set the connection string and container name
connection_string = &quot;&lt;Your-connection-string&gt;&quot;
container_name = &quot;test&quot;
blob_name = &quot;student1.json.gzip&quot;

# Create a BlobServiceClient object
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# Get a BlobClient object for the gzip file
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

# Download the gzip file as a stream
stream_downloader = blob_client.download_blob()
contents =stream_downloader.readall()
result = gzip.decompress(contents)
json_data = json.loads(result)
print(json_data)

Output:

{&#39;Register&#39;: {&#39;Name&#39;: {&#39;1&#39;: &#39;azure&#39;, &#39;2&#39;: &#39;aws&#39;, &#39;3&#39;: &#39;gcp&#39;, &#39;4&#39;: &#39;AI&#39;, &#39;5&#39;: &#39;robotics&#39;}, &#39;ID&#39;: {&#39;1&#39;: &#39;0001&#39;, &#39;2&#39;: &#39;0002&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;0004&#39;, &#39;5&#39;: &#39;0005&#39;}, &#39;Domain&#39;: {&#39;1&#39;: &#39;Microsoft&#39;, &#39;2&#39;: &#39;Amazon&#39;, &#39;3&#39;: &#39;google&#39;, &#39;4&#39;: &#39;sysca&#39;, &#39;5&#39;: &#39;IRobot&#39;}, &#39;Rank&#39;: {&#39;1&#39;: &#39;0002&#39;, &#39;2&#39;: &#39;0001&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;0004&#39;, &#39;5&#39;: &#39;0005&#39;}}}

从Azure StorageStreamDownloader读取gzip文件。

To read the file, you can use Dataframe in code by using print(Dataframe(json_data)).

Output:

                                                 Register
Domain  {&#39;1&#39;: &#39;Microsoft&#39;, &#39;2&#39;: &#39;Amazon&#39;, &#39;3&#39;: &#39;google...
ID      {&#39;1&#39;: &#39;0001&#39;, &#39;2&#39;: &#39;0002&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;...
Name    {&#39;1&#39;: &#39;azure&#39;, &#39;2&#39;: &#39;aws&#39;, &#39;3&#39;: &#39;gcp&#39;, &#39;4&#39;: &#39;A...
Rank    {&#39;1&#39;: &#39;0002&#39;, &#39;2&#39;: &#39;0001&#39;, &#39;3&#39;: &#39;0003&#39;, &#39;4&#39;: &#39;...

从Azure StorageStreamDownloader读取gzip文件。

答案2

得分: 0

问题的根本原因是：我没有意识到myStorageStreamDownloaderObject就像一个生成器。

我的完整代码大致如下：

contents = myStorageStreamDownloaderObject.readall()
print(contents)
# 这会产生：b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...

contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)
print(result)
# 这会产生：b''

但因为myStorageStreamDownloaderObject就像一个生成器，我不应该调用它两次。以下代码可以正常工作：

contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)

（对于在问题中不准确地总结我的代码，我道歉。）

英文:

The problem had a very basic cause:
I did not realize myStorageStreamDownloaderObject is like a generator.

My full code was something like this:

contents = myStorageStreamDownloaderObject.readall()
print(contents)
# This yields: b&#39;\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4&gt;8\x99...

contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)
print(result)
# This yields: b&#39;&#39;

But because myStorageStreamDownloaderObject is like a generator I should not have called it twice. The following does work:

contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)

(My apologies for inaccurately summarizing my code in the question.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从Azure StorageStreamDownloader读取gzip文件。

问题

答案1

答案2

将路由设备消息从Azure IoT Hub传输到Azure Blob Storage。

如何在不进行内存分配的情况下使用gzip压缩并上传文件？

如何使用”compress/gzip”包来压缩一个文件？

Python smart_open在文档中的代码中引发了NotImplementedError错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论