在编译时灵活设置浮点数精度

huangapple go评论91阅读模式
英文:

Flexibly set floating point number precision at compile time

问题

以下是你要翻译的内容:

I have a C++ program that can be compiled for single or double precision floating point numbers. Similar as explained here (https://stackoverflow.com/questions/14511910/switching-between-float-and-double-precision-at-compile-time), I have a header file which defines:

typedef double dtype

or:

typedef float dtype

depending on whether single or double precision is required by the user. When declaring variables and arrays I always use the data type dtype, so the correct precision is used throughout the code.

My question is how can I, in a similar fashion, set the data type of hard-coded numbers in the code, like for instance in this example:

dtype var1 = min(var0, 3.65)

As far as I know, 3.65 is by default double precision and will be single precision if I write:

dtype var1 = min(var0, 3.65f)

But is there a way to define a literal, for instance like this:

dtype var1 = min(var0, 3.65_dt)

that can either be defined as float or double at compile time to ensure that also hard-coded numbers in the code will have the right precision?

Currently, I cast the number to dtype like this:

dtype var1 = min(var0, (dtype)3.65)

but I was concerned that this might create overhead in the case of single precision since the program might actually create a double precision number which is then cast to a single precision number. Is this indeed the case?

英文:

I have a C++ program that can be compiled for single or double precision floating point numbers. Similar as explained here (https://stackoverflow.com/questions/14511910/switching-between-float-and-double-precision-at-compile-time), I have a header file which defines:

typedef double dtype

or:

typedef float dtype

depending on whether single or double precision is required by the user. When declaring variables and arrays I always use the data type dtype, so the correct precision is used throughout the code.

My question is how can I, in a similar fashion, set the data type of hard-coded numbers in the code, like for instance in this example:

dtype var1 = min(var0, 3.65)

As far as I know, 3.65 is by default double precision and will be single precision if I write:

dtype var1 = min(var0, 3.65f)

But is there a way to define a literal, for instance like this:

dtype var1 = min(var0, 3.65_dt)

that can either be defined as float or double at compile time to ensure that also hard-coded numbers in the code will have the right precision?

Currently, I cast the number to dtype like this:

dtype var1 = min(var0, (dtype)3.65)

but I was concerned that this might create overhead in the case of single precision since the program might actually create a double precision number which is then cast to a single precision number. Is this indeed the case?

答案1

得分: 1

你可以通过一个宏来实现这个,为 float 添加 f 后缀,例如 #define foo(x) x##f,对于 double 则不添加,如 #define foo(x) x

虽然你也可以使用强制类型转换或各种引发的转换将常量转换为 float 值,但这会创建一个双重舍入过程:源文本中的字面值首先被转换为 double,然后再转换为 float。大约在 2^29 次中的一次,这会产生与将字面值直接转换为 float 不同的结果。

(2^29 之所以出现是因为通常用于 floatdouble 的格式的尾数位数不同,分别为 24 和 53。这假设了在表示中的比特模式具有均匀分布。实际数据可能有不同的分布。)

英文:

You can do this with a macro that appends an f suffix for float, as with #define foo(x) x##f, and does not for double, as with #define foo(x) x.

While you can also coerce constants to become float values with casts or various induced conversions, this creates a double-rounding process: The literal in source text is first converted to double and then converted to float. In about one instance in 2<sup>29</sup>, this produces a different result than if the literal is directly converted to float.

(2<sup>29</sup> is due to the difference in the numbers of bits in the significands of the formats commonly used for float and double, 24 and 53. This assumes a uniform distribution for the bit patterns in the representation. Practical data may have a different distribution.)

huangapple
  • 本文由 发表于 2023年3月4日 04:11:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定