英文:
Schema-less JSON to Apache Beam "Row" type?
问题
有没有办法将任意的 无结构模式(schema-less) JSON 字符串转换成使用 Java SDK 的 Apache Beam “Row” 类型?如果没有,是否可以从现有对象中推导出 Beam Schema 类型?
我找到了关于 JsonToRow 和 ParseJsons 的文档,但它们要求提供 Schema 或 POJO 类才能正常工作。我还发现可以将 JSON 字符串读取到 BigQuery 的 TableRow 中,但似乎没有一种方法可以将 TableRow 转换为 Row,除非已经有一个 schema。
英文:
Is there a way to convert arbitrary schema-less JSON strings into Apache Beam "Row" types using the Java SDK? If not, is it possible to derive a Beam Schema type from an existing Object?
I've found the documentation for JsonToRow and ParseJsons, but they either require a Schema or POJO class to be provided in order to work. I also found that you can read JSON strings into a BigQuery TableRow but there doesn't seem to be a way to convert TableRow into Row that doesn't involve already having a schema.
答案1
得分: 2
不,这是不可能的,因为Row(以及使用它的框架)要求在构建时知道模式。一种选项是在构建时读取数据的一小部分以推断出模式,然后使用这个模式来调用JsonToRow转换。
英文:
No, this is not possible as Row (and the frameworks that use it) require knowing the schema at construction time. On option is, at construction time, to read a small portion of your data to infer your schema, and use this to invoke your JsonToRow transform.
答案2
得分: 2
很遗憾,一般性的答案似乎是“不行”,尽管在某些特定情况下可能是“可以”。
问题在于 Schema 与 JSON 数据类型并不完全兼容,特别是因为 ARRAY 字段类型。在 JSON 中,列表中的元素可能具有不同的数据类型,但在 Beam 的 Schema 中,每个元素都需要是相同类型的 ARRAY。这种类型可以是另一个 ROW,甚至是逻辑类型,但所有元素必须相同。
然而,用 ROW 替换 ARRAY 并不完全可行。虽然 ROW 字段是按位置排列的,但它们也有命名,使它们更接近于 MAP。此外,如果您的数据集包含长度不同的 JSON 列表,每个 Row 都会有不同的 Schema,这将产生不良后果。
因此,如果您的 JSON 数据不使用任意类型的列表,那么应该是可以的。尽管如此,Beam 并没有提供从 JSON 推导 Schema 的任何工具,所以您需要自己创建解决方案。
英文:
Unfortunately, the generic answer appears to be "no", although there are some specific situations where the answer may be "yes".
The issue is that Schemas aren't 100% compatible with JSON data types, specifically because of the ARRAY field type.
In JSON, elements of a list may have different data types, but Schemas in Beam require each element to be of an ARRAY to be the same type. This type can be another ROW, or even a logical type, but all elements must be the same.
Unfortunately, using a ROW to replace an ARRAY doesn't entirely work. Although ROW fields are positional, they're also named, making them closer to a MAP. Furthermore, if your dataset contains JSON lists of differing lengths, you'll end up with each Row having a different Schema which will have undesirable consequences.
So if your JSON data doesn't use lists of arbitrary types, you should be ok. That said, Beam doesn't provide any utilities for deriving schemas from JSON, so you'll need to create that solution yourself.
答案3
得分: 1
我在工作中遇到了与您所说的完全相同的情况,关于有限的JSON处理选项,我表示赞同。事实证明,Beam确实提供了一种可以将BigQuery的TableRow/TableSchema转换为Beam的Row/Schema的方法。如果您选择将其作为模式,并继续使用JsonToRow,您将需要解决JSON和由BigQueryUtils.fromTableSchema生成的Beam数据类型之间的数据类型不匹配问题(BYTES、DATETIME、由BigQueryUtils生成的逻辑类型)。1: https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.html 2: https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/schemas/logicaltypes/SqlTypes.html
英文:
I'm running into this exact situation at work, and I second everything you said about limited JSON processing options. It turns out that Beam does provide something that can convert a BigQuery TableRow/TableSchema to Beam Row/Schema. If you choose that as your schema and proceed with JsonToRow, you'll need to bridge the data type impedance mismatch between JSON and Beam data types produced by BigQueryUtils.fromTableSchema (BYTES, DATETIME, logical types generated by BigQueryUtils).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论