英文:
How to read Parquet file from S3 without spark? Java
问题
目前,我正在使用 Apache ParquetReader 来读取本地 Parquet 文件,大致如下所示:
ParquetReader<GenericData.Record> reader = null;
Path path = new Path("userdata1.parquet");
try {
    reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
    GenericData.Record record;
    while ((record = reader.read()) != null) {
        System.out.println(record);
然而,我正在尝试通过 S3 访问 Parquet 文件,而无需下载它。是否有一种方法可以直接使用 Parquet 读取器解析 InputStream?
英文:
Currently, I am using the Apache ParquetReader for reading local parquet files,
which looks something like this:
ParquetReader<GenericData.Record> reader = null;
    Path path = new Path("userdata1.parquet");
    try {
        reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
        GenericData.Record record;
        while ((record = reader.read()) != null) {
            System.out.println(record);
However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?
答案1
得分: 6
是的,最新版本的Hadoop已经包括对S3文件系统的支持。使用来自hadoop-aws库的s3a客户端可以直接访问S3文件系统。
HadoopInputFile的路径应构造为s3a://bucket-name/prefix/key,同时使用属性配置认证凭据access_key和secret_key:
fs.s3a.access.keyfs.s3a.secret.key
此外,您还需要这些依赖库:
hadoop-commonJARaws-java-sdk-bundleJAR
了解更多信息:相关配置属性
英文:
Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a client from hadoop-aws library to directly access the S3 filesystem.
The HadoopInputFile Path should be constructed as s3a://bucket-name/prefix/key along with the authentication credentials access_key and secret_key configured using the properties
fs.s3a.access.keyfs.s3a.secret.key
Additionally, you would require these dependant libraries
hadoop-commonJARaws-java-sdk-bundleJAR
Read more: Relevant configuration properties
答案2
得分: 2
我使用以下依赖项使其正常工作:
compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'
compile 'org.apache.hadoop:hadoop-common:3.3.0'
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'
示例:
Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}
英文:
I got it working with this following dependencies
compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'   
compile 'org.apache.hadoop:hadoop-common:3.3.0'      
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'
Example
Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}
答案3
得分: 1
只在 @franklinsijo 的基础上添加,对于刚开始学习 S3 的新手,请注意要为 Hadoop 配置设置访问密钥和秘密密钥:
以下是一段可能有用的代码片段:
public static void main(String[] args) throws IOException {
    String PATH_SCHEMA = "s3a://xxx/xxxx/userdata1.parquet";
    Path path = new Path(PATH_SCHEMA);
    Configuration conf = new Configuration();
    conf.set("fs.s3a.access.key", "xxxxx");
    conf.set("fs.s3a.secret.key", "xxxxx");
    InputFile file = HadoopInputFile.fromPath(path, conf);
    ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
    GenericRecord record;
    while ((record = reader.read()) != null) {
        System.out.println(record.toString());
    }
}
我的导入语句:
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
英文:
Just adding on top of @franklinsijo , for freshers starting S3, Please note that access key and secret key is set for Hadoop Configuration:
Here is a snippet of code that might be useful:
public static void main(String[] args) throws IOException {
String PATH_SCHEMA = "s3a://xxx/xxxx/userdata1.parquet";
Path path = new Path(PATH_SCHEMA);
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "xxxxx");
conf.set("fs.s3a.secret.key", "xxxxx");
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record.toString());
}
My imports:
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论