英文:
Read gzip from Azure StorageStreamDownloader
问题
以下是翻译好的内容:
我想要读取一个从Azure Blob存储下载的gzip文件:
myStorageStreamDownloaderObject = blob_service_client.get_container_client('myContainer').download_blob(myBlob.json.gzip)
(请注意,该文件包含一个JSON文件,扩展名为.gzip,而不是.gz。)
尝试 1:
import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)
这将产生以下结果:
contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
result = b''
尝试 2:
from io import BytesIO
import pandas as pd
with BytesIO() as input_blob:
myStorageStreamDownloaderObject.readinto(input_blob)
input_blob.seek(0)
df = pd.read_csv(input_blob, compression='gzip')
这将产生以下结果:
EmptyDataError: No columns to parse from file
使用 Spark 则有效:
df = spark.read.option("compression", "gzip").schema(json_schema).json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])
这将返回一个包含 275 行的数据框。但我正在寻找一种不使用 Spark 的解决方案。
英文:
I would like to read a gzip that I have downloaded from an Azure blob storage:
myStorageStreamDownloaderObject = blob_service_client.
get_container_client('myContainer').
download_blob(myBlob.json.gzip)
(Note that the file contains a json and it has extension .gzip (not .gz).)
Attempt 1:
import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)
The yields:
contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
result = b''
Attempt 2:
from io import BytesIO
import pandas as pd
with BytesIO() as input_blob:
myStorageStreamDownloaderObject.readinto(input_blob)
input_blob.seek(0)
df = pd.read_csv(input_blob, compression='gzip')
This yields:
EmptyDataError: No columns to parse from file
It does work with Spark
df = spark.
read.option("compression", "gzip").
schema(json_schema).
json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])
This nicely returns a dataframe with 275 rows. But I am looking for a solution without Spark.
答案1
得分: 1
以下是您要翻译的内容:
从 Azure Blob 存储中下载并读取一个 gzip 文件
您可以使用以下代码从 Azure Blob 存储中读取 JSON(gzip) 文件。
代码:
from azure.storage.blob import BlobServiceClient
import gzip
import pandas as pd
from pandas import DataFrame
import json
# 设置连接字符串和容器名称
connection_string = "<Your-connection-string>"
container_name = "test"
blob_name = "student1.json.gzip"
# 创建 BlobServiceClient 对象
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# 获取用于 gzip 文件的 BlobClient 对象
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# 将 gzip 文件下载为流
stream_downloader = blob_client.download_blob()
contents = stream_downloader.readall()
result = gzip.decompress(contents)
json_data = json.loads(result)
print(json_data)
输出:
{'Register': {'Name': {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'AI', '5': 'robotics'}, 'ID': {'1': '0001', '2': '0002', '3': '0003', '4': '0004', '5': '0005'}, 'Domain': {'1': 'Microsoft', '2': 'Amazon', '3': 'google', '4': 'sysca', '5': 'IRobot'}, 'Rank': {'1': '0002', '2': '0001', '3': '0003', '4': '0004', '5': '0005'}}}
要读取文件,您可以在代码中使用 Dataframe
,通过 print(Dataframe(json_data))
这样的方式。
输出:
Register
Domain {'1': 'Microsoft', '2': 'Amazon', '3': 'google...
ID {'1': '0001', '2': '0002', '3': '0003', '4': '...
Name {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'A...
Rank {'1': '0002', '2': '0001', '3': '0003', '4': '...
英文:
> Read a gzip that I have downloaded from an Azure blob storage
You can use the below code to read JSON(gzip) files from Azure blob storage.
Code:
from azure.storage.blob import BlobServiceClient
import gzip
import pandas as pd
from pandas import DataFrame
import json
# Set the connection string and container name
connection_string = "<Your-connection-string>"
container_name = "test"
blob_name = "student1.json.gzip"
# Create a BlobServiceClient object
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# Get a BlobClient object for the gzip file
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# Download the gzip file as a stream
stream_downloader = blob_client.download_blob()
contents =stream_downloader.readall()
result = gzip.decompress(contents)
json_data = json.loads(result)
print(json_data)
Output:
{'Register': {'Name': {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'AI', '5': 'robotics'}, 'ID': {'1': '0001', '2': '0002', '3': '0003', '4': '0004', '5': '0005'}, 'Domain': {'1': 'Microsoft', '2': 'Amazon', '3': 'google', '4': 'sysca', '5': 'IRobot'}, 'Rank': {'1': '0002', '2': '0001', '3': '0003', '4': '0004', '5': '0005'}}}
To read the file, you can use Dataframe
in code by using print(Dataframe(json_data))
.
Output:
Register
Domain {'1': 'Microsoft', '2': 'Amazon', '3': 'google...
ID {'1': '0001', '2': '0002', '3': '0003', '4': '...
Name {'1': 'azure', '2': 'aws', '3': 'gcp', '4': 'A...
Rank {'1': '0002', '2': '0001', '3': '0003', '4': '...
答案2
得分: 0
问题的根本原因是:我没有意识到myStorageStreamDownloaderObject
就像一个生成器。
我的完整代码大致如下:
contents = myStorageStreamDownloaderObject.readall()
print(contents)
# 这会产生:b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)
print(result)
# 这会产生:b''
但因为myStorageStreamDownloaderObject
就像一个生成器,我不应该调用它两次。以下代码可以正常工作:
contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)
(对于在问题中不准确地总结我的代码,我道歉。)
英文:
The problem had a very basic cause:
I did not realize myStorageStreamDownloaderObject
is like a generator.
My full code was something like this:
contents = myStorageStreamDownloaderObject.readall()
print(contents)
# This yields: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)
print(result)
# This yields: b''
But because myStorageStreamDownloaderObject
is like a generator I should not have called it twice. The following does work:
contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)
(My apologies for inaccurately summarizing my code in the question.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论