英文:
Spark Streaming - Accessing an array of custom case class objects in a Spark SQL dataframe
问题
在我的Spark Streaming查询中,我想要使用一个名为URL的案例类,它包含3个字符串成员,如下所示:
url: string
domain: string
topLevelDomain: string
我想要创建一个DataFrame,其中一个成员是URL对象的数组。模式如下:
root
|-- AccountId: integer (nullable = true)
|-- url1: struct (nullable = true)
| |-- url: string (nullable = true)
| |-- domain: string (nullable = true)
| |-- topLevelDomain: string (nullable = true)
|-- finalURLs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- url: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- topLevelDomain: string (nullable = true)
列finalURLs
是一个URL对象的数组。
随后,我想对此列应用转换,将其转换为一个字符串列表,该列表可以是domain
或topLevelDomain
,具体取决于其他列中的值。
首先,是否可以有一个列,它是一个案例类对象的数组?如果可以,如何应用上述转换以将其减少为一个字符串数组?
英文:
In my Spark Streaming query I would like to use a case class called URL with 3 string members as follows:
url: string
domain: string
topLevelDomain: string
I would like to create a DataFrame
where one of the members is an array of URL objects. Schema as follows:
root
|-- AccountId: integer (nullable = true)
|-- url1: struct (nullable = true)
| |-- url: string (nullable = true)
| |-- domain: string (nullable = true)
| |-- topLevelDomain: string (nullable = true)
|-- finalURLs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- url: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- topLevelDomain: string (nullable = true)
The column finalURLs
is an array of URL objects.
Later, I would like to apply a transformation on this column to convert it to a list of strings which can either be domain
or topLevelDomain
depending on the values in the other columns.
First of all, is it possible to have a column which is an array of case class objects ? If yes, how can the above transformation be applied to reduce it to an array of strings ?
答案1
得分: 0
如果您想使用案例类,您需要将DataFrame转换为数据集。如果这样做,您需要转换整个记录,而不仅仅是URL。像这样的内容会起作用:
case class URL(url:String, domain:String, topLevelDomain:String)
case class MyRow(AccountId:Int, url1:URL, finalURLs:Seq[URL])
df.as[MyRow].map{case MyRow(accountId,url1,finalURLs) => (accountId,url1,finalURLs.map{case URL(url,domain,topLevelDomain) => /*在此处添加您的逻辑*/ })}
英文:
If you want to use case-classes, you need to convert the DataFrame to a dataset. If you do so, you need to convert the entire record, not just the URL. Something like this would work :
case class URL(url:String, domain:String, topLevelDomain:String)
case class MyRow(AccountId:Int, url1:URL, finalURLs:Seq[URL])
df.as[MyRow].map{case MyRow(accountId,url1,finalURLs) => (accountId,url1,finalURLs.map{case URL(url,domain,topLevelDomain) => /*your logic here*/ })}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论