2023年4月4日 13:29:10go评论65阅读模式

英文:

Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap

问题

我在使用PySpark从API收集的数据创建数据帧时遇到了问题。我能够通过API连接，并使用requests库进行数据收集：

def gather_data():
  url = www.data.com
  response = requests.get(URL)
  if response.status_code == 200:
     return response.json()
  else:
     print("error")

gather_data()

这样做后，我可以看到我的数据以JSON文件的形式，但是当尝试使用以下函数之一进行读取时：

data = gather_data()
df = spark.read.option("multiline","true").json(data)

我得到了错误代码：java.lang.ArrayStoreException: java.util.HashMap

数据看起来像这样：

Out[72]:[{'name':'Johnny', 'id_number':12, 'birthday': '2023-04-03 08:00:00.0'},
{'name': 'Dave', 'id_number':56, 'birthday': '2023-04-03 08:00:01.0'}]

当我尝试以下操作时，也会出现相同的问题：

- type(gather_data())
- df = spark.read.json(data)
- df = spark.createDataFrame([Row(**i) for i in data])

起初，我以为缺少"multiline"选项，但事实并非如此。

通常在处理CSV文件时，我可以直接将其读入数据帧，但在这里不适用。

通过论坛，我看到response.json()方法返回的似乎是一个字典类型，尽管我无法确认，因为该类型方法无法在其上完成。

我已经查看了多个答案，但似乎找不到一个能够正常工作的方法。

我已经看到其他方法定义模式及其类型，但我使用了许多日期时间戳，并且希望保留数据，但不确定如何在非常大的数据集中完成这一操作。

英文:

I am having an issue using with creating a data frame using PySpark from data gathered using an API into Databricks.

I am able to connect using an API, and using the requests library I gather:

def gather_data():
  url = www.data.com
  response = requests.get(URL)
  if response.status_code == 200:
     return response.json()
  else:
     print(&quot;error&quot;)


gather_data()

Doing this I am able to see my data as a json file, although when trying to read using a function like:

data = gather_data()
df = spark.read.option(&quot;multiline&quot;,&quot;true&quot;).json(data)

I get error code: java.lang.ArrayStoreException: java.util.HashMap

The data looks like:

Out[72]:[{'name':'Johnny',
'id_number':12,
'birthday': '2023-04-03 08:00:00.0'},
{'name: 'Dave',
'id_number':56,
'birthday': 2023-04-03 08:00:01.0'}]

The same happens when I use:

- type(gather_data())
- df = spark.read.json(data)
- df = spark.createDataFrame([Row(**i) for I in data])

At first I assumed I was missing the multiline, but that was not the case.

Typically when working with csv files, I can just read it straight into a data frame but this isn't the case here.

Through forums I have seen that the response.json() method returns what seems to be a dict type, although I can't confirm this as that type method won't complete on it.

I have gone through multiple answers and can't seem to find one that will work yet.

I have seen other methods define the schema and its types, but I am using many date time stamps, and would like to preserve the data, and am unsure how to accomplish this with a very large dataset.

答案1

得分: 0

The problem arises because you're not returning the JSON as text, but you're using response.json() function that returns parsed representation of that JSON.

You have following choice:

Just create a dataframe from your parsed object using the createDataFrame:

data = gather_data()
df = spark.createDataFrame([data])

not parse the data, and use spark.read.json, but you'll need to replace response.json() with response.text in your gather_data function:

data = gather_data()
df = spark.read.json(sc.parallelize([data]))

英文:

The problem arises because you're not returning the JSON as text, but you're using response.json() function that returns parsed representation of that JSON.

You have following choice:

Just create a dataframe from your parsed object using the createDataFrame:

data = gather_data()
df = spark.createDataFrame([data])

not parse the data, and use spark.read.json, but you'll need to replace response.json() with response.text in your gather_data function:

data = gather_data()
df = spark.read.json(sc.parallelize([data]))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap

问题

答案1

欺骗 Python 中正在进行的日期时间

在Python中在同一图中拥有多条线。

无法在以下目录中找到可用的init.tcl。这可能意味着Tcl没有正确安装。

如何允许在FastAPI端点中将None类型作为参数？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论