2023年2月14日 21:05:09go评论51阅读模式

英文:

Is there any difference in performance when we connect to S3 via S3 API versus via Hadoop Filesystem

问题

Approach 1: 使用S3 API

// 创建AmazonS3客户端
AmazonS3 s3client = AmazonS3ClientBuilder
                    .standard()
                    .withCredentials(new AWSStaticCredentialsProvider(credentials))
                    .withRegion(Regions.valueOf(region))
                    .build();

Approach 2: 使用Hadoop文件系统：

// 设置Hadoop配置信息
configuration.set("fs.s3a.access.key","XXXXXXXXXXX");
configuration.set("fs.s3a.secret.key","XXXXXXXXXXX");				 
configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem");
configuration.set("fs.s3a.endpoint","http://127.0.0.1:8080");
UserGroupInformation.setConfiguration(configuration);
// 获取S3文件系统
fileSystem = new Path("s3a://"+ bucketName).getFileSystem(configuration);

我们知道何时使用哪种方法吗？哪种方法更有效地读取数据？

根据我的观察，文件系统路线较慢。但我没有找到任何支持性能差异的文档。

英文:

I want to create a java utility to read S3 bucket information.
We can connect to s3 via native s3 APIs and the Hadoop filesystem approach.

Approach 1: Using S3 APIs

AmazonS3 s3client = AmazonS3ClientBuilder
                        .standard()
                        .withCredentials(new AWSStaticCredentialsProvider(credentials))
                        .withRegion(Regions.valueOf(region))
                        .build();

Approach 1: Using Hadoop Filesystem:

            configuration.set(&quot;fs.s3a.access.key&quot;,&quot;XXXXXXXXXXX&quot;);
			configuration.set(&quot;fs.s3a.secret.key&quot;,&quot;XXXXXXXXXXX&quot;);				 
            configuration.set(&quot;fs.s3a.impl&quot;,&quot;org.apache.hadoop.fs.s3a.S3AFileSystem&quot;);
			configuration.set(&quot;fs.s3a.endpoint&quot;,&quot;http://127.0.0.1:8080&quot;);
			UserGroupInformation.setConfiguration(configuration);
			fileSystem = new Path(&quot;s3a://&quot;+ bucketName).getFileSystem(configuration);

Do we know when we use which approach? Which approach is more efficient to read data?

In my observation, the filesystem route is slower. But I have not found any documentation supporting the performance difference.

答案1

得分: 0

性能不应该是唯一的考虑因素。如果您想要更高的性能，或者至少希望获得更好的文件操作一致性保证，可以考虑使用 S3Guard。

但如果您必须创建一个仅会与 S3 通信且不需要集成 Hadoop 生态系统或使用其他 Hadoop 兼容文件系统（如 HDFS、GCS、ADLS 等）的 Java 客户端，那么您应该使用纯粹的 AWS SDK。

如果您正在尝试在 127.0.0.1 上运行一些模拟的 S3 服务（或 MinIO），那么这不是对真实 S3 服务的适当基准测试。

英文:

Performance shouldn't be the only factor. If you want higher performance, or at least better file operation consistency guarantees, look into S3Guard.

But if you have to create a Java client that will only ever talk to S3, and never needs to integrate with Hadoop ecosystem, or use other Hadoop compatible filesystems (HDFS, GCS, ADLS, etc), then you should use plain AWS SDK.

If you're trying to run some mocked S3 service (or MinIO) on 127.0.0.1, then that's not a proper benchmark to a real S3 service

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

S3 API 与 Hadoop 文件系统相比，在连接到 S3 时性能有什么不同？

问题

答案1

计划在将数据导入到S3后清理DynamoDB。

AWS SDK JS的httpOptions超时间歇性不起作用。

无法使用AWS SDK在Go中获取s3.Object的ACL。

如何将 multipart.FileHeader 文件类型转换为 os.File 在 golang 中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论