英文:
Empty Column not being listed in S3 select in databricks
问题
我正在查询一个具有多列的S3中的JSON文件:
SELECT a, b, c FROM json.`s3://my-bucket/file.json.gz`
文件的内容如下:
{a: {}, b: 0, c: 1}
{a: {}, b: 1, c: 2}
{a: {}, b: 2, c: 3}
上面的查询失败并返回以下错误消息:
[UNRESOLVED_COLUMN.WITH_SUGGESTION] 无法解析具有名称 `a` 的列或函数参数。您是否想指定以下之一? [`b`, `c`]
当我执行以下查询时:
SELECT * FROM json.`s3://my-bucket/file.json.gz`
我只获得了列b和c。
是否有一种方法可以获取列a,并且还可以看到它是一个空的JSON?
英文:
I'm querying a JSON file in S3 with multiple columns:
SELECT a, b, c FROM json.`s3://my-bucket/file.json.gz`
And the file looks like this:
{a: {}, b: 0, c: 1}
{a: {}, b: 1, c: 2}
{a: {}, b: 2, c: 3}
The query above fails and returns
UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `a` cannot be resolved. Did you mean one of the following? [`b`, `c`]
And when I perform
SELECT * FROM json.`s3://my-bucket/file.json.gz`
I get only the columns b and c.
Is there a way where I can also get column a, and also see that it is an empty JSON?
答案1
得分: 2
你可以使用Python或Scala语法吗?
在读取JSON文件时,你需要对其施加模式,据我所知,单凭SQL查询是无法做到的。
使用Python语法的解决方案如下:
from pyspark.sql.types import *
# 不确定列a的数据类型应该是什么,因此请应用正确的数据类型。
schema = StructType([
StructField('a', StringType(), True),
StructField('b', IntegerType(), True),
StructField('c', IntegerType(), True),
])
df = spark.read.schema(schema).json('s3://my-bucket/file.json.gz')
英文:
Can you use Python or Scala syntax?
You need to impose schema on the json file during reading the json files, and as far as I know it's not possible through SQL queries alone.
The solution using Python syntax would look like this:
from pyspark.sql.types import *
# Not sure what the data type for column a is supposed be, so apply the correct data type.
schema = StructType([
StructField('a', StringType(), True),
StructField('b', IntegerType(), True),
StructField('c', IntegerType(), True),
])
df = spark.read.schema(schema).json('s3://my-bucket/file.json.gz')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论