Spark在Java中如何过滤数据集中的列表值?

huangapple go评论62阅读模式
英文:

How does Spark in Java filter the values in the list in dataset?

问题

我有两个类,一个是NewsArticle:String id,String title,List contents,另一个是ContentItem:String content,String subtype,String url。

我想筛选出subtype值等于"paragraph"的内容,并将其拼接成一个长字符串(不需要url)。

以下是NewsArticle Dataset的样子:

1, "TiTle", [{htt..., paragraph, rem...},{htt..., paragraph, rem...},{htt..., paragraph, rem...}]

其中包括id, title, List<ContentItem>

我提取出了contents列,每一行代表一篇文章,它的格式如下:

[{http..., others, con...},{http..., paragraph, rem...},{http..., paragraph, rem...}]

其中包括url, subtype, content

现在我想让每篇文章(行)看起来像:

1, "Title", "这是subtype等于paragraph的内容"

有谁能帮我用Java实现这个功能?

英文:

I have two class, one is NewsArticle: String id, String title, List<ContentItem> contents, the other is ContentItem: String content, String subtype, String url.

I want to filter out the content whose subtype value is equal to "paragraph", and spliced into one long string. (don't need url)

here is the NewsArticle Dataset like:

 1, &quot;TiTle&quot;, [{htt..., paragraph, rem...},{htt..., paragraph, rem...},{htt..., paragraph, rem...}]

which is id, title, List&lt;ContentItem&gt;

I took out the contents column, and each single row is one article, it like this:

[{http..., others, con...},{http..., paragraph, rem...},{http..., paragraph, rem...}]

which is url, subtype, content

and now I want to make each article(row) look like:

1, &quot;Title&quot;, &quot;this is content which subtype equals paragraph&quot;

can anyone help me with java?

答案1

得分: 1

这将起作用:

df
    .withColumn("newContent", functions.explode(functions.col("items")))
    .filter("newContent.subtype=='paragraph'")
    .selectExpr("id", "title", "newContent.content as content")
    .show(false);

输入:

+---+--------------------------------------------------------------------------------------------------------+-----+
|id |items                                                                                                   |title|
+---+--------------------------------------------------------------------------------------------------------+-----+
|id |[[Content1, subtype1, someurl], [ContentOfParagraph, paragraph, someurl], [Content2, subtype2, someurl]]|Title|
+---+--------------------------------------------------------------------------------------------------------+-----+

输出:

+---+-----+------------------+
|id |title|content           |
+---+-----+------------------+
|id |Title|ContentOfParagraph|
+---+-----+------------------+
英文:

This would work:

df
     .withColumn(&quot;newContent&quot;, functions.explode(functions.col(&quot;items&quot;)))
     .filter(&quot;newContent.subtype==&#39;paragraph&#39;&quot;)
     .selectExpr(&quot;id&quot;, &quot;title&quot;, &quot;newContent.content as content&quot;)
     .show(false);

Input:

+---+--------------------------------------------------------------------------------------------------------+-----+
|id |items                                                                                                   |title|
+---+--------------------------------------------------------------------------------------------------------+-----+
|id |[[Content1, subtype1, someurl], [ContentOfParagraph, paragraph, someurl], [Content2, subtype2, someurl]]|Title|
+---+--------------------------------------------------------------------------------------------------------+-----+

Output:

+---+-----+------------------+
|id |title|content           |
+---+-----+------------------+
|id |Title|ContentOfParagraph|
+---+-----+------------------+

huangapple
  • 本文由 发表于 2023年2月18日 23:55:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75494536.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定