S3 API 与 Hadoop 文件系统相比,在连接到 S3 时性能有什么不同?

huangapple go评论45阅读模式
英文:

Is there any difference in performance when we connect to S3 via S3 API versus via Hadoop Filesystem

问题

Approach 1: 使用S3 API

// 创建AmazonS3客户端
AmazonS3 s3client = AmazonS3ClientBuilder
                    .standard()
                    .withCredentials(new AWSStaticCredentialsProvider(credentials))
                    .withRegion(Regions.valueOf(region))
                    .build();

Approach 2: 使用Hadoop文件系统:

// 设置Hadoop配置信息
configuration.set("fs.s3a.access.key","XXXXXXXXXXX");
configuration.set("fs.s3a.secret.key","XXXXXXXXXXX");				 
configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem");
configuration.set("fs.s3a.endpoint","http://127.0.0.1:8080");
UserGroupInformation.setConfiguration(configuration);
// 获取S3文件系统
fileSystem = new Path("s3a://"+ bucketName).getFileSystem(configuration);

我们知道何时使用哪种方法吗?哪种方法更有效地读取数据?

根据我的观察,文件系统路线较慢。但我没有找到任何支持性能差异的文档。

英文:

I want to create a java utility to read S3 bucket information.
We can connect to s3 via native s3 APIs and the Hadoop filesystem approach.

Approach 1: Using S3 APIs

AmazonS3 s3client = AmazonS3ClientBuilder
                        .standard()
                        .withCredentials(new AWSStaticCredentialsProvider(credentials))
                        .withRegion(Regions.valueOf(region))
                        .build();

Approach 1: Using Hadoop Filesystem:

            configuration.set("fs.s3a.access.key","XXXXXXXXXXX");
			configuration.set("fs.s3a.secret.key","XXXXXXXXXXX");				 
            configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem");
			configuration.set("fs.s3a.endpoint","http://127.0.0.1:8080");
			UserGroupInformation.setConfiguration(configuration);
			fileSystem = new Path("s3a://"+ bucketName).getFileSystem(configuration);

Do we know when we use which approach? Which approach is more efficient to read data?

In my observation, the filesystem route is slower. But I have not found any documentation supporting the performance difference.

答案1

得分: 0

性能不应该是唯一的考虑因素。如果您想要更高的性能,或者至少希望获得更好的文件操作一致性保证,可以考虑使用 S3Guard。

但如果您必须创建一个仅会与 S3 通信且不需要集成 Hadoop 生态系统或使用其他 Hadoop 兼容文件系统(如 HDFS、GCS、ADLS 等)的 Java 客户端,那么您应该使用纯粹的 AWS SDK。

如果您正在尝试在 127.0.0.1 上运行一些模拟的 S3 服务(或 MinIO),那么这不是对真实 S3 服务的适当基准测试。

英文:

Performance shouldn't be the only factor. If you want higher performance, or at least better file operation consistency guarantees, look into S3Guard.

But if you have to create a Java client that will only ever talk to S3, and never needs to integrate with Hadoop ecosystem, or use other Hadoop compatible filesystems (HDFS, GCS, ADLS, etc), then you should use plain AWS SDK.

If you're trying to run some mocked S3 service (or MinIO) on 127.0.0.1, then that's not a proper benchmark to a real S3 service

huangapple
  • 本文由 发表于 2023年2月14日 21:05:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75448265.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定