问题

我正在尝试找到一种将S3上的parquet文件的pyarrow模式转换为可用的Glue模式的方法。

为了更好地理解，我有一堆位于S3上的parquet文件，这些文件没有按照正确的结构进行组织，无法被爬虫解析，我想创建自定义的Glue表，调用Glue Catalog API并指定模式。

我考虑使用pyarrow读取parquet文件，然后使用该模式创建Glue表，但我遇到了兼容性问题。

另一个选项是使用爬虫，但不确定如何仅使用爬虫来查找位置的模式。

有什么建议吗？

谢谢。

英文:

I am trying to find a way to convert a pyarrow schema from a parquet file on s3 to a viable Glue schema.

To put it in context, I have a bunch of locations on S3 with parquet files which are not structured properly to be parsed by a crawler and I want to create custom Glue tables calling the Glue Catalog API with a specified schema.

I thought about using pyarrow to read a parquet file and then create the Glue table using that schema, but I'm having compatibility issues.

Another option would be to use a crawler, but not sure how you can use a crawler just to find out the schema from a location.

Any suggestions?

Thanks.

答案1

得分: 1

Glue Crawler支持在S3中使用parquet格式。只需将爬虫指向parquet文件的位置，它就应该能够自动推断模式。

更多信息请参阅文档。

英文:

Glue Crawler support parquet in S3. Just point the crawler to the location of your parquet files, and it should be able to infer the schema automatically.

More info in the docs.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Parquet pyarrow schema 转换为 Glue schema AWS

问题

答案1

如何使用Glue动态数据框架Python将包含多种类型值的列转换为单一数据类型？

AWS RDS代理无法自动关闭数据库连接。

如何列出 AWS S3 存储桶目录中的项目。

Lambda函数的URL在静态前端收到CORS失败。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论