RU消耗为什么高于所配置吞吐量与自动缩放最大吞吐量的比率?

huangapple go评论76阅读模式
英文:

Why is RU consumption higher than the ratio of provisioned throughput to autoscale max throughput?

问题

Why is RU consumption higher than ratio of provisioned throughput to autoscale max throughput?

Autoscale max throughput is 220k
Provisioned throughput is only 153k
But RU consumption is 100%!

How come RU consumption is 100% when the provisioned throughput is nowhere close to Autoscale max throughput? Is RU consumption based on provisioned throughput even in autoscale mode? If that's the case, then doesn't that mean that RU consumption can't be used to determine if the database/container's autoscale max throughput is set too low or too high?

How do I determine if I have over or under-provisioned when in autoscale mode?

英文:

Why is RU consumption higher than ratio of provisioned throughput to autoscale max throughput?

What I'm seeing:

  • Autoscale max throughput is 220k
  • Provisioned throughput is only 153k
  • But RU consumption is 100%!

How come RU consumption is 100% when the provisioned throughput is nowhere close to Autoscale max throughput? Is RU consumption based on provisioned throughput even in autoscale mode? If that's the case, then doesn't that mean that RU consumption can't be used to determine if the database/container's autoscale max throughput is set too low or too high?

How do I determine if I have over or under-provisioned when in autoscale mode?

RU消耗为什么高于所配置吞吐量与自动缩放最大吞吐量的比率?

RU消耗为什么高于所配置吞吐量与自动缩放最大吞吐量的比率?

答案1

得分: 1

Normalized RU消耗率为100%仅意味着至少一个物理分区在至少一秒内使用了其整个RU预算(220k / 物理分区数量)。

自动缩放只会在滑动窗口的几秒内的平均标准化RU消耗率需要时提高预配置吞吐量。具体来说,文档中指出...

仅当标准化RU消耗率在5秒间隔内持续、连续地达到100%时,Azure Cosmos DB才会将RU/s缩放到最大吞吐量。

由于自动缩放的计费基于任何一个小时内达到的峰值预配置RU,直接使用标准化RU消耗率指标可能对那一小时的计费产生相当可怕的乘法效应(一个繁忙的分区在一个非典型的一秒内会导致该效应通过分区集合中的分区数量一小时内的秒数账户复制到的地区数量进行乘法运算)

关于是否过度预配置或不足预配置,您的图表上确实有一些达到自动缩放最小值并且在所涵盖的时间段内从未达到自动缩放最大值的情况(尽管在5月8日接近最大值),因此从这个角度来看,您可能有点过度预配置——这真的取决于您愿意为降低看到限流风险而支付多少高级费用。

我认为单独使用标准化RU消耗率这个指标并不是非常有用,因为它无法区分集合是在执行周期性昂贵的操作还是在持续压力下运行。例如,如果您有一个40,000 RU的集合和80个物理分区,每个分区每秒的预算为500 RU。如果文档很大并且启用了通配符索引,那么单个插入可能会达到这个值,因此持续插入可能会使集合看起来在该指标上永久达到最大值(不过您可以按物理分区拆分这个指标,并验证是否只有特定分区达到最高峰,而其他分区更加空闲,以及是否一直是相同的“热”分区,正如在“经典”指标中的热图中所显示的)。

另一种看待的方式是,您的最大吞吐量为每秒220k。理论上,您可以维持每分钟13,200,000 RU。您可以查看请求单位使用的指标,以查看您的最高峰分钟,以了解您是否接近这个理论最大值。如果您从未接近这个值(只要您的工作均匀分布在分区中),那么您可能会得出结论,有可能进行缩减——并且只需在受到限流的请求上重试,因为任何峰值都可能非常短暂。

英文:

Normalized RU consumption metric of 100% just means that at least one physical partition used its entire RU budget (220k / number of physical partitions) in at least one second.

Autoscale only raises the provisioned throughput when the average Normalized RU consumption over a sliding window of a few seconds would warrant that. Specifically the documentation states...

> Azure Cosmos DB only scales the RU/s to the maximum throughput when
> the normalized RU consumption is 100% for a sustained, continuous
> period of time in a 5-second interval

As the billing for autoscale is based on the peak provisioned RU reached in any one hour using the Normalized RU consumption metric directly could have a fairly horrible multiplactive effect on the billing for that hour (a single busy partition for an atypical single second would then have an effect that is multiplied out by number of partitions in the collection * number of seconds in an hour * number of regions the account is georeplicated to)

Regarding whether you are over provisioned or under provisioned you do have some flat lines on your graph where you have reached the autoscale min - and you never reached the autoscale max in the time period covered (though did come quite close to it on May 8th) so on that basis you are maybe somewhat over provisioned - it really depends how much premium you are happy to pay to reduce the risk of seeing throttling.

I don't find Normalized RU consumption a very useful metric on its own because it does not help distinguish between a collection performing periodic expensive operations and one under sustained pressure. E.g. If you have a collection at 40,000 RU and 80 physical partitions the "per partition per second" budget is 500 RU. If the documents are large and you have wildcard indexing it is possible that a single insert can do that so a consistent trickle of inserts can make the collection appear permanently maxed out on that metric (you can split this metric by physical partition though and validate whether you are just getting specific partitions peaking whilst others are much more idle and whether or not it is consistently the same "hot" partition - as is also shown in the heatmap in the "classic" metrics).

Another way of looking at is that you have a max throughput of 220k per second. So in theory can sustain 13,200,000 RU per minute. You can look at the metrics for Request Units used and see what your peak minute was to see how close you are to this theoretical max. If you never get anywhere near this then (as long as your work is evenly distributed across partitions) you might conclude that there is scope to scale down - and just retry on any throttled requests as likely any peaks are very transient.

huangapple
  • 本文由 发表于 2023年5月18日 12:29:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76277737.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定