Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

huangapple go评论115阅读模式
英文:

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

问题

I am on Dataproc managed spark cluster

  • 操作系统 = Ubuntu 18.04
  • Spark版本 = 3.3.0

我的集群配置如下:

  • 主节点
    • 内存 = 7.5 GiB
    • 内核数 = 2
    • 主磁盘大小 = 32 GB
  • 工作节点
    • 内核数 = 16
    • 内存 = 16 GiB
    • 可用于Yarn = 13536 MiB
    • 主磁盘大小 = 32 GB

必要的导入:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

我使用以下代码启动SparkSession(请注意maxPartitionBytes的更改):

spark = SparkSession.builder.\
config("spark.executor.cores","15").\
config("spark.executor.instances","2").\
config("spark.executor.memory","12100m").\
config("spark.dynamicAllocation.enabled", False).\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","10g").\
getOrCreate()

我有一个在磁盘上占用约40GiB的CSV文件。

我使用以下代码读取它并缓存:

df_covid = spark.read.csv("gs://xxxxxxxx.appspot.com/spark_datasets/covid60g.csv",
                          header=True, inferSchema=False)
df_covid.cache()
df_covid.count()
df_covid.rdd.getNumPartitions()
#输出:30

接下来是我的存储选项卡的内容:

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

内存中反序列化了10.3GiB,磁盘上序列化了3.9GiB

现在,我想要检查YARN UI上的CPU使用情况,并将其与各个工作节点上的htop结果进行比较。问题是:

  1. Dataproc YARN UI的min_alignment_period为1分钟。每分钟的数据点会合并成一个单一点并呈现出来。因此,我确保创建一个相对重的转换序列,每个分区运行超过一分钟。这可以排除可能消耗时间的其他工作负载(比如从存储加载数据到执行内存)

我使用以下转换:

@udf(returnType=StringType())
def f1(x):
    out = ''
    for i in x:
        out += chr(ord(i)+1)
    return out

@udf(returnType=StringType())
def f2(x):
    out = ''
    for i in x:
        out += chr(ord(i)-1)
    return out

df_covid = df_covid.withColumn("_catted", F.concat_ws('',*df_covid.columns))

for i in range(10):
    df_covid = df_covid.withColumn("_catted", f1(F.col("_catted")))
    df_covid = df_covid.withColumn("_catted", f2(F.col("_catted")))
df_covid = df_covid.withColumn("esize1", F.length(F.split("_catted", "e").getItem(1)))
df_covid = df_covid.withColumn("asize1", F.length(F.split("_catted", "a").getItem(1)))
df_covid = df_covid.withColumn("isize1", F.length(F.split("_catted", "i").getItem(1)))
df_covid = df_covid.withColumn("nsize1", F.length(F.split("_catted", "n").getItem(1)))
df_covid = df_covid.filter((df_covid.esize1 > 5) & (df_covid.asize1 > 5) & (df_covid.isize1 > 5) & (df_covid.nsize1 > 5))

现在,我调用一个操作来开始计算:

df_covid.count()

我监视我的两个工作节点上的htop。在调用操作一分钟后,两个htop都显示所有核心都被充分利用,并且它们保持充分利用约3-4分钟

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

从图像的负载平均值中可以看出,我的核心正处于全速运行状态,并且16个核心都被充分利用。您还可以从屏幕截图中的uptime看出,核心充分利用了超过2分钟的时间。实际上,它们会充分利用约3分钟以上。

我的问题是,Dataproc监控中的YARN度量与htop存在差异。以下是同一时间的CPU利用率图表:

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

显示最大CPU使用率约为70%。

为什么YARN监控和htop之间存在差异?我已经看到其他人的YARN CPU利用率超过90%。快速谷歌搜索也会显示相同的情况。如何实现这种高CPU利用率?

英文:

I am on Dataproc managed spark cluster

  • OS = Ubuntu 18.04
  • Spark version = 3.3.0

My cluster configuration is as follows:

  • Master
    • Memory = 7.5 GiB
    • Cores = 2
    • Primary disk size = 32 GB
  • Workers
    • Cores = 16
    • Ram = 16 GiB
    • Available to Yarn = 13536 MiB
    • Primary disk size = 32 GB

Necessary imports:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

I start the SparkSession with (Notice the change to the maxPartitionBytes):

spark = SparkSession.builder.\
config("spark.executor.cores","15").\
config("spark.executor.instances","2").\
config("spark.executor.memory","12100m").\
config("spark.dynamicAllocation.enabled", False).\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","10g").\
getOrCreate()

I have a csv file that takes up ~40GiB on the disk.

I read it in and cache with the following:

df_covid = spark.read.csv("gs://xxxxxxxx.appspot.com/spark_datasets/covid60g.csv",
                          header=True, inferSchema=False)
df_covid.cache()
df_covid.count()
df_covid.rdd.getNumPartitions()
#output: 30

The following is my storage tab post that:

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

10.3GiB deserialized in memory and 3.9 Serialized on disk

Now, I want to check the CPU usage from my YARN UI and compare it with my htop results on individual workers. The issue is:

  1. Dataproc YARN UI has min_alignment_period of 1 min. The datapoints for each minute are combined into a single point and presented. Hence I ensure to create a relatively heavy sequence of transformations that run for more than a minute per partition. This removes other workloads that might consume time (like loading data from storage to execution memory)

I use the following transformations:

@udf(returnType=StringType())
def f1(x):
    out = ''
    for i in x:
        out += chr(ord(i)+1)
    return out

@udf(returnType=StringType())
def f2(x):
    out = ''
    for i in x:
        out += chr(ord(i)-1)
    return out

df_covid = df_covid.withColumn("_catted", F.concat_ws('',*df_covid.columns))

for i in range(10):
    df_covid = df_covid.withColumn("_catted", f1(F.col("_catted")))
    df_covid = df_covid.withColumn("_catted", f2(F.col("_catted")))
df_covid = df_covid.withColumn("esize1", F.length(F.split("_catted", "e").getItem(1)))
df_covid = df_covid.withColumn("asize1", F.length(F.split("_catted", "a").getItem(1)))
df_covid = df_covid.withColumn("isize1", F.length(F.split("_catted", "i").getItem(1)))
df_covid = df_covid.withColumn("nsize1", F.length(F.split("_catted", "n").getItem(1)))
df_covid = df_covid.filter((df_covid.esize1 > 5) & (df_covid.asize1 > 5) & (df_covid.isize1 > 5) & (df_covid.nsize1 > 5))

Now I call an action to start the computations:

df_covid.count()

I monitor htop on my two worker nodes. After a minute of calling the action both the htops show all the cores being fully utilized and they remain fully utilized for about 3-4 minutes

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

As you can see from the load average from the images my cores are going full-tilt and the 16 cores are getting utilized completely. You can also check from the uptime on the screenshots that the cores are fully utilized for well over 2 minutes. Actually, they get utilized for about 3+ minutes

My issue is that the CPU utilization from the yarn metrics usage on dataproc monitoring doesn't concur. The following are the CPU utilization charts from the same time:

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

which shows a maximum CPU usage of ~70%.

What is the reason for the discrepancy between the YARN monitoring and htop. I have seen CPU utilization from yarn going 90%+ for other people. A quick google search would show the same as well. How is that achieved?

答案1

得分: 0

Spark固定成本占用了我运行查询的微小集群的相当大比例。在将集群大小扩展到相同配置的12个工作节点后,CPU使用率达到了93.5%。

英文:

Spark fixed costs are a significant proportion of the tiny cluster that I was running my queries on. The CPU usage is 93.5% upon scaling up the cluster size to 12 worker nodes of the same configuration

huangapple
  • 本文由 发表于 2023年6月12日 21:09:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76456997.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定