英文:
Is there any difference in performance when we connect to S3 via S3 API versus via Hadoop Filesystem
问题
Approach 1: 使用S3 API
// 创建AmazonS3客户端
AmazonS3 s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.valueOf(region))
.build();
Approach 2: 使用Hadoop文件系统:
// 设置Hadoop配置信息
configuration.set("fs.s3a.access.key","XXXXXXXXXXX");
configuration.set("fs.s3a.secret.key","XXXXXXXXXXX");
configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem");
configuration.set("fs.s3a.endpoint","http://127.0.0.1:8080");
UserGroupInformation.setConfiguration(configuration);
// 获取S3文件系统
fileSystem = new Path("s3a://"+ bucketName).getFileSystem(configuration);
我们知道何时使用哪种方法吗?哪种方法更有效地读取数据?
根据我的观察,文件系统路线较慢。但我没有找到任何支持性能差异的文档。
英文:
I want to create a java utility to read S3 bucket information.
We can connect to s3 via native s3 APIs and the Hadoop filesystem approach.
Approach 1: Using S3 APIs
AmazonS3 s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.valueOf(region))
.build();
Approach 1: Using Hadoop Filesystem:
configuration.set("fs.s3a.access.key","XXXXXXXXXXX");
configuration.set("fs.s3a.secret.key","XXXXXXXXXXX");
configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem");
configuration.set("fs.s3a.endpoint","http://127.0.0.1:8080");
UserGroupInformation.setConfiguration(configuration);
fileSystem = new Path("s3a://"+ bucketName).getFileSystem(configuration);
Do we know when we use which approach? Which approach is more efficient to read data?
In my observation, the filesystem route is slower. But I have not found any documentation supporting the performance difference.
答案1
得分: 0
性能不应该是唯一的考虑因素。如果您想要更高的性能,或者至少希望获得更好的文件操作一致性保证,可以考虑使用 S3Guard。
但如果您必须创建一个仅会与 S3 通信且不需要集成 Hadoop 生态系统或使用其他 Hadoop 兼容文件系统(如 HDFS、GCS、ADLS 等)的 Java 客户端,那么您应该使用纯粹的 AWS SDK。
如果您正在尝试在 127.0.0.1
上运行一些模拟的 S3 服务(或 MinIO),那么这不是对真实 S3 服务的适当基准测试。
英文:
Performance shouldn't be the only factor. If you want higher performance, or at least better file operation consistency guarantees, look into S3Guard.
But if you have to create a Java client that will only ever talk to S3, and never needs to integrate with Hadoop ecosystem, or use other Hadoop compatible filesystems (HDFS, GCS, ADLS, etc), then you should use plain AWS SDK.
If you're trying to run some mocked S3 service (or MinIO) on 127.0.0.1
, then that's not a proper benchmark to a real S3 service
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论