2023年4月4日 04:21:54go评论121阅读模式

英文:

Is there a native S3 filesystem implementation for Apache Arrow Java?

问题

我正在使用Apache Arrow在Java中工作，我想知道是否在Java库中有像Python中Arrow实现（pyarrow）中提供的S3FileSystem一样的本机S3文件系统实现。我已经查看了Arrow Java IPC文档，但没有看到这样的实现。

在Python中，使用pyarrow，可以像这样从S3中读取一个表：

import pyarrow.parquet as pq

# 使用URI - > 文件系统将被推断
pq.read_table("s3://my-bucket/data.parquet")
# 使用路径和文件系统
s3 = fs.S3FileSystem(..)
pq.read_table("my-bucket/data.parquet", filesystem=s3)

我想知道是否为Google Cloud Storage文件系统（GcsFileSystem）和Hadoop分布式文件系统（HDFS）实现了类似的功能。

如果在Java中没有本地实现，是否计划在Java中提供这些功能的即将发布或测试版？

英文:

I'm working with Apache Arrow in Java and I want to know if there is an implementation in the java library that provides a native S3 filesystem implementation like the one provided in the Python implementation of Arrow (pyarrow) which uses the S3FileSystem. I have gone through the Arrow Java IPC documentation and I do not see any such implementation there.

In Python, using pyarrow, one can read a table from S3 like this:

import pyarrow.parquet as pq

# using a URI -&gt; filesystem is inferred
pq.read_table(&quot;s3://my-bucket/data.parquet&quot;)
# using a path and filesystem
s3 = fs.S3FileSystem(..)
pq.read_table(&quot;my-bucket/data.parquet&quot;, filesystem=s3)

I want to know if similar functionalities are implemented for Google Cloud Storage File System (GcsFileSystem) and Hadoop Distributed File System (HDFS) as well.

If there is no native implementation available in Java, is there any upcoming or beta release planned to provide these functionalities in Java?

答案1

得分: 1

It didn't appear that Arrow Java provides a purely native FileSystem support for cloud providers.

Another option would be to use the Arrow Java Dataset module which offers a Factory that supports reading data from external file systems thru FileSystemDatasetFactory JNI classes.

We are going to use this S3/GS URIs for demo:

- aws s3 ls s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet
- gsutil ls gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet

Let use this Arrow Java Dataset Cookbook for testing:

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = &quot;s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet&quot;; // AWS S3
        // String uri = &quot;hdfs://{hdfs_host}:{port}/nyc-taxi-tiny/year=2022/month=2/part-0.parquet&quot;; // HDFS
        // String uri = &quot;gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet&quot;; // Google Cloud Storage
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
                BufferAllocator allocator = new RootAllocator();
                DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
                Dataset dataset = datasetFactory.finish();
                Scanner scanner = dataset.newScan(options);
                ArrowReader reader = scanner.scanBatches()
        ) {
            Schema schema = scanner.schema();
            System.out.println(schema);
            while (reader.loadNextBatch()) {
                System.println(&quot;RowCount: &quot; + reader.getVectorSchemaRoot().getRowCount());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Consider:

S3 is included by default
GS ror errors like Got GCS URI but Arrow compiled without GCS support consider to add (-DARROW_GCS=ON)
HDFS is also supported

英文:

It didn't appear that Arrow Java provides a purely native FileSystem support for cloud providers.

Another option would be to use the Arrow Java Dataset module which offers a Factory that supports reading data from external file systems thru FileSystemDatasetFactory JNI classes.

We are going to use this S3/GS URIs for demo:

- aws s3 ls s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet
- gsutil ls gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet

Let use this Arrow Java Dataset Cookbook for testing:

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = &quot;s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet&quot;; // AWS S3
        // String uri = &quot;hdfs://{hdfs_host}:{port}/nyc-taxi-tiny/year=2022/month=2/part-0.parquet&quot;; // HDFS
        // String uri = &quot;gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet&quot;; // Google Cloud Storage
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
                BufferAllocator allocator = new RootAllocator();
                DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
                Dataset dataset = datasetFactory.finish();
                Scanner scanner = dataset.newScan(options);
                ArrowReader reader = scanner.scanBatches()
        ) {
            Schema schema = scanner.schema();
            System.out.println(schema);
            while (reader.loadNextBatch()) {
                System.out.println(&quot;RowCount: &quot; + reader.getVectorSchemaRoot().getRowCount());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Consider:

S3 is included by default
GS ror errors like Got GCS URI but Arrow compiled without GCS support consider to add (-DARROW_GCS=ON)
HDFS is also supported

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Apache Arrow Java 是否有原生的 S3 文件系统实现？

问题

答案1

连接到S3

触发 Snowflake 任务在 snowpipe 完成运行后。

将PDF托管在AWS S3。

如何在加载Pyarrow表时对地图类型列应用过滤器？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论