获取Apache Spark中单列的值,以Java编写,作为一个扁平列表。

huangapple go评论92阅读模式
英文:

Get a single column values as a flat list in Apache spark using java

问题

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;

// Create a Spark session
SparkSession sparkSession = SparkSession.builder()
    .appName("ColumnValuesExample")
    .master("local[*]")  // Use appropriate master URL for your environment
    .getOrCreate();

// Read data from a source and create a Dataset
Dataset<Row> sampleData = sparkSession.read()
    // ... other options
    .option("query", "SELECT COLUMN1, column2 from table1")
    .load();

// Select the desired column and collect values
List<String> columnValuesList = sampleData
    .select("COLUMN1")
    .where(sampleData.col("COLUMN1").isNotNull())
    .as(Encoders.STRING())  // Cast the column to String type
    .collectAsList();

String result = StringUtils.join(columnValuesList, ", ");
// Result will be the desired comma-separated string of values

Please note that you need to import the necessary packages and make sure you have set up your Spark session correctly with the appropriate master URL and other configurations. The key part in achieving your desired result is using the as(Encoders.STRING()) method to cast the column values to strings and then using collectAsList() to gather the values into a list.

英文:

I am new to Java and Apache spark and trying to figure out how to get values of a single column from a Dataset in spark as a flat list.

Dataset&lt;Row&gt; sampleData = sparkSession.read()
                          .....
                          .option(&quot;query&quot;, &quot;SELECT COLUMN1, column2 from table1&quot;)
                          .load();

List&lt;Row&gt; columnsList = sampleData.select(&quot;COLUMN1&quot;)
    .where(sampleData.col(&quot;COLUMN1&quot;).isNotNull()).collectAsList();

String result = StringUtils.join(columnsList, &quot;, &quot;);
// Result I am getting is
[15230321], [15306791], [15325784], [15323326], [15288338], [15322001], [15307950], [15298286], [15327223]
// What i want is&quot;:
15230321, 15306791......

How do I achieve this in spark using java?

答案1

得分: 1

Spark行可以通过编码器转换为字符串:

List<String> result = sampleData.select("COLUMN1").as(Encoders.STRING()).collectAsList();

请注意,由于您要求只返回翻译好的代码部分,我已省略了任何额外的回答或解释。如果您有任何其他需要或问题,欢迎随时提问。

英文:

Spark row can be converted to String by Encoders:

    List&lt;String&gt; result = sampleData.select(&quot;COLUMN1&quot;).as(Encoders.STRING()).collectAsList();

答案2

得分: 1

我将答案粘贴在Scala中。您可以将其转换为Java,因为有可用的在线工具。

另外,我不会像您指定的方式创建String result,因为那需要创建表格并根据您的过程执行查询,但我直接使用以下方式复制问题变量

import org.apache.spark.sql.Row    
val a = List(Row("123"),Row("222"),Row("333"))

打印a会得到

List([123], [222], [333])

因此,使用简单的映射操作以及mkString方法来展开List

a.map(x => x.mkString(","))

会得到

List(123, 222, 333)

我认为这符合您的期望。如果这解决了您的问题,请告诉我。

英文:

I am pasting the answer in Scala. You can convert it into Java as there are online tools available.

Also I am not creating String result as the way you specified because it would require creating table and doing the query per your process but I am replicating the problem variable directly using

import org.apache.spark.sql.Row    
val a = List(Row(&quot;123&quot;),Row(&quot;222&quot;),Row(&quot;333&quot;))

Printing a is giving me

List([123], [222], [333])

So apply a simple map operation along with mkString method to flatten the List

 a.map(x =&gt; x.mkString(&quot;,&quot;))

gives

List(123, 222, 333) which I assume is your expectation. 

Let me know if this sorts out your issue.

huangapple
  • 本文由 发表于 2020年4月6日 19:26:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/61058707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定