2020年4月10日 01:02:11go评论201阅读模式

英文:

How to read Parquet file from S3 without spark? Java

问题

目前，我正在使用 Apache ParquetReader 来读取本地 Parquet 文件，大致如下所示：

ParquetReader<GenericData.Record> reader = null;
Path path = new Path("userdata1.parquet");
try {
    reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
    GenericData.Record record;
    while ((record = reader.read()) != null) {
        System.out.println(record);

然而，我正在尝试通过 S3 访问 Parquet 文件，而无需下载它。是否有一种方法可以直接使用 Parquet 读取器解析 InputStream？

英文:

Currently, I am using the Apache ParquetReader for reading local parquet files,
which looks something like this:

ParquetReader&lt;GenericData.Record&gt; reader = null;
    Path path = new Path(&quot;userdata1.parquet&quot;);
    try {
        reader = AvroParquetReader.&lt;GenericData.Record&gt;builder(path).withConf(new Configuration()).build();
        GenericData.Record record;
        while ((record = reader.read()) != null) {
            System.out.println(record);

However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?

答案1

得分: 6

是的，最新版本的Hadoop已经包括对S3文件系统的支持。使用来自hadoop-aws库的s3a客户端可以直接访问S3文件系统。

HadoopInputFile的路径应构造为s3a://bucket-name/prefix/key，同时使用属性配置认证凭据access_key和secret_key：

fs.s3a.access.key
fs.s3a.secret.key

此外，您还需要这些依赖库：

hadoop-common JAR
aws-java-sdk-bundle JAR

了解更多信息：相关配置属性

英文:

Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a client from hadoop-aws library to directly access the S3 filesystem.

The HadoopInputFile Path should be constructed as s3a://bucket-name/prefix/key along with the authentication credentials access_key and secret_key configured using the properties

fs.s3a.access.key
fs.s3a.secret.key

Additionally, you would require these dependant libraries

hadoop-common JAR
aws-java-sdk-bundle JAR

答案2

得分: 2

我使用以下依赖项使其正常工作：

compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'
compile 'org.apache.hadoop:hadoop-common:3.3.0'
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'

示例：

Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);

InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}

英文:

I got it working with this following dependencies

compile &#39;org.slf4j:slf4j-api:1.7.5&#39;
compile &#39;org.slf4j:slf4j-log4j12:1.7.5&#39;
compile &#39;org.apache.parquet:parquet-avro:1.12.0&#39;
compile &#39;org.apache.avro:avro:1.10.2&#39;
compile &#39;com.google.guava:guava:11.0.2&#39;
compile &#39;org.apache.hadoop:hadoop-client:2.4.0&#39;
compile &#39;org.apache.hadoop:hadoop-aws:3.3.0&#39;   
compile &#39;org.apache.hadoop:hadoop-common:3.3.0&#39;      
compile &#39;com.amazonaws:aws-java-sdk-core:1.11.563&#39;
compile &#39;com.amazonaws:aws-java-sdk-s3:1.11.563&#39;

Example

Path path = new Path(&quot;s3a://yours3path&quot;);
Configuration conf = new Configuration();
conf.set(&quot;fs.s3a.access.key&quot;, &quot;KEY&quot;);
conf.set(&quot;fs.s3a.secret.key&quot;, &quot;SECRET&quot;);
conf.set(&quot;fs.s3a.impl&quot;, &quot;org.apache.hadoop.fs.s3a.S3AFileSystem&quot;);
conf.setBoolean(&quot;fs.s3a.path.style.access&quot;, true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);

InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader&lt;GenericRecord&gt; reader = AvroParquetReader.&lt;GenericRecord&gt;builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}

答案3

得分: 1

只在 @franklinsijo 的基础上添加，对于刚开始学习 S3 的新手，请注意要为 Hadoop 配置设置访问密钥和秘密密钥：
以下是一段可能有用的代码片段：

public static void main(String[] args) throws IOException {
    String PATH_SCHEMA = "s3a://xxx/xxxx/userdata1.parquet";
    Path path = new Path(PATH_SCHEMA);
    Configuration conf = new Configuration();
    conf.set("fs.s3a.access.key", "xxxxx");
    conf.set("fs.s3a.secret.key", "xxxxx");
    InputFile file = HadoopInputFile.fromPath(path, conf);
    ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
    GenericRecord record;
    while ((record = reader.read()) != null) {
        System.out.println(record.toString());
    }
}

我的导入语句：

import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
import org.apache.hadoop.fs.Path;

英文:

Just adding on top of @franklinsijo , for freshers starting S3, Please note that access key and secret key is set for Hadoop Configuration:
Here is a snippet of code that might be useful:

public static void main(String[] args) throws IOException {
String PATH_SCHEMA = &quot;s3a://xxx/xxxx/userdata1.parquet&quot;;
Path path = new Path(PATH_SCHEMA);
Configuration conf = new Configuration();
conf.set(&quot;fs.s3a.access.key&quot;, &quot;xxxxx&quot;);
conf.set(&quot;fs.s3a.secret.key&quot;, &quot;xxxxx&quot;);
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader&lt;GenericRecord&gt; reader = AvroParquetReader.&lt;GenericRecord&gt;builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record.toString());
}

My imports:

import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
import org.apache.hadoop.fs.Path;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在不使用Spark的情况下从S3读取Parquet文件？ Java

问题

答案1

答案2

答案3

使用换行正则表达式后跟一个数字来拆分字符串 (JAVA)

同步多个线程，以确保只有一个线程对结果进行缓存（使用 Spring 缓存）。

使用正弦波实现流畅过渡

Unable to wire in dependency using MockBean in WebMVCTest.

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论