在 AWS Glue 作业中运行 R。

huangapple go评论73阅读模式
英文:

Running R in an AWS Glue job

问题

可以将R脚本作为Python子进程(或包装一组R脚本的bash脚本)在具有Python和R依赖项的AWS Glue作业中运行吗?如果可以的话,请概述所需步骤和关键考虑事项。

英文:

Imagine you had a set of R scripts that form an ETL pipeline that you wanted to run as an AWS Glue job. AWS Glue supports Python and Scala.

Is it possible to call an R as a Python subprocess (or a bash script that wraps a set of R scripts) within an AWS Glue job running in a container with Python and R dependencies?

If so, please outline the steps required and key considerations.

答案1

得分: 1

由于Glue不支持本地运行R脚本,您可以考虑以下替代方法:

  1. 自定义您自己的Docker镜像
  2. 推送镜像到ECR
  3. 使用AWS Batch配置计算资源和计划任务

示例文件夹结构

.
├── Dockerfile
└── scripts
    └── rtest.R

基于https://hub.docker.com/r/rocker/tidyverse的示例Dockerfile

FROM rocker/tidyverse:4.2.2
WORKDIR /scripts
COPY scripts/* /scripts
RUN chmod 755 ./*
# 安装额外的R库

将镜像推送到ECR的示例命令

aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com

docker build -t rdev .

docker tag rdev:latest aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest

docker push aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest

参考链接:https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html

然后按照此指南配置Fargate上的ECS集群,创建并执行作业:https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html

英文:

As Glue doesn't natively support running R scripts, you can consider the following as an alternative:

  1. Customise your own Docker image
  2. Push the image to ECR
  3. Configure the compute resources and schedule using AWS Batch

Example folder structure

.
├── Dockerfile
└── scripts
    └── rtest.R

Example Dockerfile based on https://hub.docker.com/r/rocker/tidyverse

FROM rocker/tidyverse:4.2.2
WORKDIR /scripts
COPY scripts/* /scripts
RUN chmod 755 ./*
# Install additional R libraries

Example commands to push the image to ECR

aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com

docker build -t rdev .

docker tag rdev:latest aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest

docker push aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest

Ref: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html

Then follow this guide to configure an ECS cluster on Fargate, create and execute a job: https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html

答案2

得分: 0

这是不可能的

虽然在Glue中可以运行自定义代码,因为它基于Spark,只支持Scala和Python。关于Python子进程的问题,根据文档的描述,似乎不是一个选项:

只能使用纯Python库。依赖于C扩展的库,如pandas Python数据分析库,目前尚不受支持。

正如@Isc评论所述,我建议使用Docker和ECS来运行使用R的批量ETL作业。

英文:

It is not possible

While possible to run custom code in Glue, as it is based on Spark only Scala and Python are supported. Regarding the question if Python subprocess, it seems not to be an option as mentioned in the documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

As @Isc commented, I would recommend using Docker with ECS to run batch ETL jobs using R.

huangapple
  • 本文由 发表于 2023年5月25日 03:43:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76326916.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定