Spark Streaming – 在Spark SQL数据框中访问自定义案例类对象的数组

huangapple go评论58阅读模式
英文:

Spark Streaming - Accessing an array of custom case class objects in a Spark SQL dataframe

问题

在我的Spark Streaming查询中,我想要使用一个名为URL的案例类,它包含3个字符串成员,如下所示:

url: string
domain: string
topLevelDomain: string

我想要创建一个DataFrame,其中一个成员是URL对象的数组。模式如下:

root
 |-- AccountId: integer (nullable = true)
 |-- url1: struct (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- domain: string (nullable = true)
 |    |-- topLevelDomain: string (nullable = true)
 |-- finalURLs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- topLevelDomain: string (nullable = true)

finalURLs是一个URL对象的数组。

随后,我想对此列应用转换,将其转换为一个字符串列表,该列表可以是domaintopLevelDomain,具体取决于其他列中的值。

首先,是否可以有一个列,它是一个案例类对象的数组?如果可以,如何应用上述转换以将其减少为一个字符串数组?

英文:

In my Spark Streaming query I would like to use a case class called URL with 3 string members as follows:

  url: string            
  domain: string         
  topLevelDomain: string 

I would like to create a DataFrame where one of the members is an array of URL objects. Schema as follows:

root
 |-- AccountId: integer (nullable = true)
 |-- url1: struct (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- domain: string (nullable = true)
 |    |-- topLevelDomain: string (nullable = true)
 |-- finalURLs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- topLevelDomain: string (nullable = true)

The column finalURLs is an array of URL objects.

Later, I would like to apply a transformation on this column to convert it to a list of strings which can either be domain or topLevelDomain depending on the values in the other columns.

First of all, is it possible to have a column which is an array of case class objects ? If yes, how can the above transformation be applied to reduce it to an array of strings ?

答案1

得分: 0

如果您想使用案例类,您需要将DataFrame转换为数据集。如果这样做,您需要转换整个记录,而不仅仅是URL。像这样的内容会起作用:

case class URL(url:String, domain:String, topLevelDomain:String)
case class MyRow(AccountId:Int, url1:URL, finalURLs:Seq[URL])

df.as[MyRow].map{case MyRow(accountId,url1,finalURLs) => (accountId,url1,finalURLs.map{case URL(url,domain,topLevelDomain) => /*在此处添加您的逻辑*/ })}
英文:

If you want to use case-classes, you need to convert the DataFrame to a dataset. If you do so, you need to convert the entire record, not just the URL. Something like this would work :

case class URL(url:String, domain:String, topLevelDomain:String)
case class MyRow(AccountId:Int, url1:URL, finalURLs:Seq[URL])

df.as[MyRow].map{case MyRow(accountId,url1,finalURLs) => (accountId,url1,finalURLs.map{case URL(url,domain,topLevelDomain) => /*your logic here*/ })}

huangapple
  • 本文由 发表于 2020年1月6日 19:09:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/59611014.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定