英文:
Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap
问题
我在使用PySpark从API收集的数据创建数据帧时遇到了问题。我能够通过API连接,并使用requests库进行数据收集:
def gather_data():
url = www.data.com
response = requests.get(URL)
if response.status_code == 200:
return response.json()
else:
print("error")
gather_data()
这样做后,我可以看到我的数据以JSON文件的形式,但是当尝试使用以下函数之一进行读取时:
data = gather_data()
df = spark.read.option("multiline","true").json(data)
我得到了错误代码:java.lang.ArrayStoreException: java.util.HashMap
数据看起来像这样:
Out[72]:[{'name':'Johnny', 'id_number':12, 'birthday': '2023-04-03 08:00:00.0'},
{'name': 'Dave', 'id_number':56, 'birthday': '2023-04-03 08:00:01.0'}]
当我尝试以下操作时,也会出现相同的问题:
- type(gather_data())
- df = spark.read.json(data)
- df = spark.createDataFrame([Row(**i) for i in data])
起初,我以为缺少"multiline"选项,但事实并非如此。
通常在处理CSV文件时,我可以直接将其读入数据帧,但在这里不适用。
通过论坛,我看到response.json()方法返回的似乎是一个字典类型,尽管我无法确认,因为该类型方法无法在其上完成。
我已经查看了多个答案,但似乎找不到一个能够正常工作的方法。
我已经看到其他方法定义模式及其类型,但我使用了许多日期时间戳,并且希望保留数据,但不确定如何在非常大的数据集中完成这一操作。
英文:
I am having an issue using with creating a data frame using PySpark from data gathered using an API into Databricks.
I am able to connect using an API, and using the requests library I gather:
def gather_data():
url = www.data.com
response = requests.get(URL)
if response.status_code == 200:
return response.json()
else:
print("error")
gather_data()
Doing this I am able to see my data as a json file, although when trying to read using a function like:
data = gather_data()
df = spark.read.option("multiline","true").json(data)
I get error code: java.lang.ArrayStoreException: java.util.HashMap
The data looks like:
Out[72]:[{'name':'Johnny',
'id_number':12,
'birthday': '2023-04-03 08:00:00.0'},
{'name: 'Dave',
'id_number':56,
'birthday': 2023-04-03 08:00:01.0'}]
The same happens when I use:
- type(gather_data())
- df = spark.read.json(data)
- df = spark.createDataFrame([Row(**i) for I in data])
At first I assumed I was missing the multiline, but that was not the case.
Typically when working with csv files, I can just read it straight into a data frame but this isn't the case here.
Through forums I have seen that the response.json() method returns what seems to be a dict type, although I can't confirm this as that type method won't complete on it.
I have gone through multiple answers and can't seem to find one that will work yet.
I have seen other methods define the schema and its types, but I am using many date time stamps, and would like to preserve the data, and am unsure how to accomplish this with a very large dataset.
答案1
得分: 0
The problem arises because you're not returning the JSON as text, but you're using response.json()
function that returns parsed representation of that JSON.
You have following choice:
- Just create a dataframe from your parsed object using the
createDataFrame
:
data = gather_data()
df = spark.createDataFrame([data])
- not parse the data, and use
spark.read.json
, but you'll need to replaceresponse.json()
withresponse.text
in yourgather_data
function:
data = gather_data()
df = spark.read.json(sc.parallelize([data]))
英文:
The problem arises because you're not returning the JSON as text, but you're using response.json()
function that returns parsed representation of that JSON.
You have following choice:
- Just create a dataframe from your parsed object using the
createDataFrame
:
data = gather_data()
df = spark.createDataFrame([data])
- not parse the data, and use
spark.read.json
, but you'll need to replaceresponse.json()
withresponse.text
in yourgather_data
function:
data = gather_data()
df = spark.read.json(sc.parallelize([data]))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论