Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap

huangapple go评论65阅读模式
英文:

Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap

问题

我在使用PySpark从API收集的数据创建数据帧时遇到了问题。我能够通过API连接,并使用requests库进行数据收集:

def gather_data():
  url = www.data.com
  response = requests.get(URL)
  if response.status_code == 200:
     return response.json()
  else:
     print("error")

gather_data()

这样做后,我可以看到我的数据以JSON文件的形式,但是当尝试使用以下函数之一进行读取时:

data = gather_data()
df = spark.read.option("multiline","true").json(data)

我得到了错误代码:java.lang.ArrayStoreException: java.util.HashMap

数据看起来像这样:

Out[72]:[{'name':'Johnny', 'id_number':12, 'birthday': '2023-04-03 08:00:00.0'},
{'name': 'Dave', 'id_number':56, 'birthday': '2023-04-03 08:00:01.0'}]

当我尝试以下操作时,也会出现相同的问题:

- type(gather_data())
- df = spark.read.json(data)
- df = spark.createDataFrame([Row(**i) for i in data])

起初,我以为缺少"multiline"选项,但事实并非如此。

通常在处理CSV文件时,我可以直接将其读入数据帧,但在这里不适用。

通过论坛,我看到response.json()方法返回的似乎是一个字典类型,尽管我无法确认,因为该类型方法无法在其上完成。

我已经查看了多个答案,但似乎找不到一个能够正常工作的方法。

我已经看到其他方法定义模式及其类型,但我使用了许多日期时间戳,并且希望保留数据,但不确定如何在非常大的数据集中完成这一操作。

英文:

I am having an issue using with creating a data frame using PySpark from data gathered using an API into Databricks.

I am able to connect using an API, and using the requests library I gather:

def gather_data():
  url = www.data.com
  response = requests.get(URL)
  if response.status_code == 200:
     return response.json()
  else:
     print("error")


gather_data()

Doing this I am able to see my data as a json file, although when trying to read using a function like:

data = gather_data()
df = spark.read.option("multiline","true").json(data)

I get error code: java.lang.ArrayStoreException: java.util.HashMap

The data looks like:

Out[72]:[{'name':'Johnny',
'id_number':12,
'birthday': '2023-04-03 08:00:00.0'},
{'name: 'Dave',
'id_number':56,
'birthday': 2023-04-03 08:00:01.0'}]

The same happens when I use:

- type(gather_data())
- df = spark.read.json(data)
- df = spark.createDataFrame([Row(**i) for I in data])

At first I assumed I was missing the multiline, but that was not the case.

Typically when working with csv files, I can just read it straight into a data frame but this isn't the case here.

Through forums I have seen that the response.json() method returns what seems to be a dict type, although I can't confirm this as that type method won't complete on it.

I have gone through multiple answers and can't seem to find one that will work yet.

I have seen other methods define the schema and its types, but I am using many date time stamps, and would like to preserve the data, and am unsure how to accomplish this with a very large dataset.

答案1

得分: 0

The problem arises because you're not returning the JSON as text, but you're using response.json() function that returns parsed representation of that JSON.

You have following choice:

  • Just create a dataframe from your parsed object using the createDataFrame:
data = gather_data()
df = spark.createDataFrame([data])
  • not parse the data, and use spark.read.json, but you'll need to replace response.json() with response.text in your gather_data function:
data = gather_data()
df = spark.read.json(sc.parallelize([data]))
英文:

The problem arises because you're not returning the JSON as text, but you're using response.json() function that returns parsed representation of that JSON.

You have following choice:

  • Just create a dataframe from your parsed object using the createDataFrame:
data = gather_data()
df = spark.createDataFrame([data])
  • not parse the data, and use spark.read.json, but you'll need to replace response.json() with response.text in your gather_data function:
data = gather_data()
df = spark.read.json(sc.parallelize([data]))

huangapple
  • 本文由 发表于 2023年4月4日 13:29:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75925793.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定