在GO语言中写入parquet文件时,如何处理NaN值?

huangapple go评论75阅读模式
英文:

How to handle NaN values when writing to parquet in GO?

问题

我正在尝试在GO中写入parquet文件。在写入该文件时,我可能会遇到NaN值。由于NaN既不在原始类型中定义,也不在逻辑类型中定义,那么我该如何在GO中处理这个值?是否有任何现有的模式可以处理它?

我正在使用来自这里的parquet GO库。您可以在这里找到使用JSON模式写入parquet的代码示例,使用的是这个库。

英文:

I am trying to write to a parquet file in GO. While writing to this file, I can get NaN values. Since NaN is neither defined in the primitive types nor in logical type then how do I handle this value in GO? Does any existing schema work for it?

I am using the parquet GO library from here. You can find an example of the code using JSON schema for writing to parquet here using this library.

答案1

得分: 0

这个问题在xitongsys/parquet-go issue 281中进行了详细讨论,建议使用OPTIONAL类型。即使你不给它赋值(就像你的代码一样),非指针值也会被赋予默认值。所以parquet-go不知道它是null还是默认值。

然而:

问题在于我不能使用OPTIONAL类型,换句话说,我不能将我的结构体转换为使用指针。我尝试使用repetitiontype=OPTIONAL作为标签,但这会导致一些奇怪的行为。我希望这个标签的行为方式与Golang标准库中的omitempty标签相同,即如果值不存在,则不会放入JSON中。

这是一个说明该问题的示例:

package main

import (
	"encoding/json"
	"io/ioutil"
)

type Salary struct {
	Basic, HRA, TA float64 `json:",omitempty"`
}

type Employee struct {
	FirstName, LastName, Email string `json:",omitempty"`
	Age                        int
	MonthlySalary              []Salary `json:",omitempty"`
}

func main() {
	data := Employee{
		Email: "mark@gmail.com",
		MonthlySalary: []Salary{
			{
				Basic: 15000.00,
			},
		},
	}

	file, _ := json.MarshalIndent(data, "", " ")

	_ = ioutil.WriteFile("test.json", file, 0o644)
}

生成的JSON如下所示:

{
 "Email": "mark@gmail.com",
 "Age": 0,
 "MonthlySalary": [
  {
   "Basic": 15000
  }
 ]
}

正如你所看到的,结构体中具有omitempty标签且未赋值的项不会出现在JSON中,即HRATA。但是,Age没有这个标签,因此它仍然包含在JSON中。

这是一个问题,因为当这个Golang库写入parquet时,结构体中的所有字段都被分配了内存,所以如果你有一个只有很少填充的大结构体,它仍然会占用完整的内存。当再次读取文件时,这是一个更大的问题,因为无法知道放入parquet文件的值是空值还是未赋值。

如果我能说服你它的价值,我很乐意帮助实现这个库的omitempty标签。

这与issue 403 "No option to omitempty when not using pointers"相呼应。

英文:

The isse was discussed at lenght in xitongsys/parquet-go issue 281, with the recommandation being to

> use OPTIONAL type.
Even you don't assign a value (like you code), the non-point value will be assigned a default value.
So parquet-go don't know it's null or default value.

However:

> What is comes down to is that I cannot use the OPTIONAL type, in other words I cannot convert my structure to use pointers.
I have tried to use repetitiontype=OPTIONAL as a tag, but this leads to some weird behavior.
I would expect that tag to behave the same way that the omitempty tag in the Golang standard library, i.e. if the value is not present then it is not put into the JSON.
>
> The reason this is important is that if the field is missing or not set, when it is encoded to parquet then there is no way of telling if the value was 0 or just not set in the case of int64.

This illustrates the issue:

package main

import (
	"encoding/json"
	"io/ioutil"
)

type Salary struct {
	Basic, HRA, TA float64 `json:",omitempty"`
}

type Employee struct {
	FirstName, LastName, Email string `json:",omitempty"`
	Age                        int
	MonthlySalary              []Salary `json:",omitempty"`
}

func main() {
	data := Employee{
		Email: "mark@gmail.com",
		MonthlySalary: []Salary{
			{
				Basic: 15000.00,
			},
		},
	}

	file, _ := json.MarshalIndent(data, "", " ")

	_ = ioutil.WriteFile("test.json", file, 0o644)
}

with a JSON produced as:

{
 "Email": "mark@gmail.com",
 "Age": 0,
 "MonthlySalary": [
  {
   "Basic": 15000
  }
 ]
}

> As you can see, the item in the struct that have the omit empty tag and that are not assigned do no appear in the JSON, i.e. HRA TA.
But on the other hand Age does not have this tag and hence it is still included in the JSON.
>
> This is problematic as all fields in the struct are assigned memory when this golang library writes to parquet- so if you have a big struct that is only sparsely populated it will still take the full amount of memory.
It is a bigger problem when the file is read again as there is no way of know if the value that was put in the parquet file was the empty value or it is was just not assigned.
>
> I am happy to help implement an omitempty tag for this library if I can convince you of the value of having it.

That echoes issue 403 "No option to omitempty when not using pointers".

huangapple
  • 本文由 发表于 2022年6月20日 23:57:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/72689835.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定