英文:
Running R in an AWS Glue job
问题
可以将R脚本作为Python子进程(或包装一组R脚本的bash脚本)在具有Python和R依赖项的AWS Glue作业中运行吗?如果可以的话,请概述所需步骤和关键考虑事项。
英文:
Imagine you had a set of R scripts that form an ETL pipeline that you wanted to run as an AWS Glue job. AWS Glue supports Python and Scala.
Is it possible to call an R as a Python subprocess (or a bash script that wraps a set of R scripts) within an AWS Glue job running in a container with Python and R dependencies?
If so, please outline the steps required and key considerations.
答案1
得分: 1
由于Glue不支持本地运行R脚本,您可以考虑以下替代方法:
- 自定义您自己的Docker镜像
- 推送镜像到ECR
- 使用AWS Batch配置计算资源和计划任务
示例文件夹结构
.
├── Dockerfile
└── scripts
└── rtest.R
基于https://hub.docker.com/r/rocker/tidyverse的示例Dockerfile
FROM rocker/tidyverse:4.2.2
WORKDIR /scripts
COPY scripts/* /scripts
RUN chmod 755 ./*
# 安装额外的R库
将镜像推送到ECR的示例命令
aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com
docker build -t rdev .
docker tag rdev:latest aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
docker push aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
参考链接:https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html
然后按照此指南配置Fargate上的ECS集群,创建并执行作业:https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html
英文:
As Glue doesn't natively support running R scripts, you can consider the following as an alternative:
- Customise your own Docker image
- Push the image to ECR
- Configure the compute resources and schedule using AWS Batch
Example folder structure
.
├── Dockerfile
└── scripts
└── rtest.R
Example Dockerfile based on https://hub.docker.com/r/rocker/tidyverse
FROM rocker/tidyverse:4.2.2
WORKDIR /scripts
COPY scripts/* /scripts
RUN chmod 755 ./*
# Install additional R libraries
Example commands to push the image to ECR
aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com
docker build -t rdev .
docker tag rdev:latest aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
docker push aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
Ref: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html
Then follow this guide to configure an ECS cluster on Fargate, create and execute a job: https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html
答案2
得分: 0
这是不可能的
虽然在Glue中可以运行自定义代码,因为它基于Spark,只支持Scala和Python。关于Python子进程的问题,根据文档的描述,似乎不是一个选项:
只能使用纯Python库。依赖于C扩展的库,如pandas Python数据分析库,目前尚不受支持。
正如@Isc评论所述,我建议使用Docker和ECS来运行使用R的批量ETL作业。
英文:
It is not possible
While possible to run custom code in Glue, as it is based on Spark only Scala and Python are supported. Regarding the question if Python subprocess, it seems not to be an option as mentioned in the documentation:
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.
As @Isc commented, I would recommend using Docker with ECS to run batch ETL jobs using R.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论