2023年6月30日 00:14:53go评论137阅读模式

英文:

How to read parquet files from Azure Blobs into Pandas DataFrame with columns projection on server-side?

问题

Is it possible to perform a column projection on the parquet file at server level before downloading it to be more efficient? I.e. I would like to filter only desidered columns before downloading the file.

At the moment I am connecting to Azure services only by a connection string if that helps and using the Python Client library.

英文:

Following this question: https://stackoverflow.com/questions/63351478/how-to-read-parquet-files-from-azure-blobs-into-pandas-dataframe

At the moment I am connecting to Azure services only by a connection string if that helps and using the Python Client library.

答案1

得分: 1

在下载 Azure Blob 存储中的 Parquet 文件之前，是否可以在服务器级别执行列投影以提高效率？即，我想在下载文件之前仅筛选所需的列。

要从 Azure Blob 存储中的 Parquet 文件下载所需的列，您可以使用以下 Python 代码：

代码：

import pyarrow.parquet as pq
from azure.storage.blob import BlobServiceClient
import pandas as pd 

# 设置 BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string('your connection string')

# 获取对 Azure Blob 存储中 Parquet 文件的引用
blob_container_client = blob_service_client.get_container_client('test1')
blob_client = blob_container_client.get_blob_client('samplepar.parquet')

# 定义要从 Parquet 文件中读取的列列表
columns = ['title', 'salary', 'birthdate', 'id1', 'id2']
columns_query = ", ".join([f"[{column}]" for column in columns])
query = f"SELECT {columns_query} FROM BlobStorage"
with open("sample.parquet1", "wb") as file:
    blob_client.download_blob().download_to_stream(file)

table = pq.read_table("sample1.parquet")
available_columns = [column for column in columns if column in table.column_names]
print(available_columns)
if available_columns:
    table = table.select(available_columns)
    df = table.to_pandas()
    print(df)
else:
    print("错误：Parquet 文件中没有指定的列。")

**输出：**

['title', 'salary', 'birthdate']
title salary birthdate
0 Internal Auditor 49756.53 3/8/1971
1 Accountant IV 150280.17 1/16/1968
2 Structural Engineer 144972.51 2/1/1960
3 Senior Cost Accountant 90263.05 4/8/1997


**下载的文件：**

[![enter image description here][1]][1]

[![enter image description here][2]][2]

  [1]: https://i.stack.imgur.com/Mhny3.png
  [2]: https://i.stack.imgur.com/NsHW2.png

请注意，上述代码首先从 Azure Blob 存储中下载 Parquet 文件，然后根据您指定的列对文件进行列选择，并将结果转换为 Pandas DataFrame。如果指定的列不在 Parquet 文件中，则会显示错误消息。

英文:

> Is it possible to perform a column projection on the parquet file at server level before downloading it to be more efficient? I.e. I would like to filter only desired columns before downloading the file.

To download the desired column from the parquet file in Azure blob storage, you can use the below Python code:

Code:

import pyarrow.parquet as pq
from azure.storage.blob import BlobServiceClient
import pandas as pd 

# Set up the BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(&#39;your connection string&#39;)

# Get a reference to the Parquet file in Azure Blob Storage
blob_container_client = blob_service_client.get_container_client(&#39;test1&#39;)
blob_client = blob_container_client.get_blob_client(&#39;samplepar.parquet&#39;)

# Define the list of columns to read from the Parquet file
columns = [&#39;title&#39;, &#39;salary&#39;, &#39;birthdate&#39;, &#39;id1&#39;, &#39;id2&#39;]
columns_query = &quot;, &quot;.join([f&quot;[{column}]&quot; for column in columns])
query = f&quot;SELECT {columns_query} FROM BlobStorage&quot;
with open(&quot;sample.parquet1&quot;, &quot;wb&quot;) as file:
    blob_client.download_blob().download_to_stream(file)

table = pq.read_table(&quot;sample1.parquet&quot;)
available_columns = [column for column in columns if column in table.column_names]
print(available_columns)
if available_columns:
    table = table.select(available_columns)
    df = table.to_pandas()
    print(df)
else:
    print(&quot;Error: None of the specified columns are present in the Parquet file.&quot;)

Output:

[&#39;title&#39;, &#39;salary&#39;, &#39;birthdate&#39;]
                      title     salary  birthdate
0          Internal Auditor   49756.53   3/8/1971
1             Accountant IV  150280.17  1/16/1968
2       Structural Engineer  144972.51   2/1/1960
3    Senior Cost Accountant   90263.05   4/8/1997

Downloaded File:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to read parquet files from Azure Blobs into Pandas DataFrame with columns projection on server-side?

问题

答案1

Kusto图例排序

Is it possible to add a custom domain for an Azure Blob Storage Static Website without a public CNAME record?

Azure服务用于发送短信和电子邮件通知。

将Azure App Service从.NET Core 3.1升级到.NET 6.0。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论