2023年2月18日 23:55:46go评论85阅读模式

英文:

How does Spark in Java filter the values in the list in dataset?

问题

我有两个类，一个是NewsArticle：String id，String title，List contents，另一个是ContentItem：String content，String subtype，String url。

我想筛选出subtype值等于"paragraph"的内容，并将其拼接成一个长字符串（不需要url）。

以下是NewsArticle Dataset的样子：

1, "TiTle", [{htt..., paragraph, rem...},{htt..., paragraph, rem...},{htt..., paragraph, rem...}]

其中包括id, title, List<ContentItem>

我提取出了contents列，每一行代表一篇文章，它的格式如下：

[{http..., others, con...},{http..., paragraph, rem...},{http..., paragraph, rem...}]

其中包括url, subtype, content

现在我想让每篇文章（行）看起来像：

1, "Title", "这是subtype等于paragraph的内容"

有谁能帮我用Java实现这个功能？

英文:

I have two class, one is NewsArticle: String id, String title, List<ContentItem> contents, the other is ContentItem: String content, String subtype, String url.

I want to filter out the content whose subtype value is equal to "paragraph", and spliced into one long string. (don't need url)

here is the NewsArticle Dataset like:

 1, &quot;TiTle&quot;, [{htt..., paragraph, rem...},{htt..., paragraph, rem...},{htt..., paragraph, rem...}]

which is id, title, List<ContentItem>

I took out the contents column, and each single row is one article, it like this:

[{http..., others, con...},{http..., paragraph, rem...},{http..., paragraph, rem...}]

which is url, subtype, content

and now I want to make each article(row) look like:

1, &quot;Title&quot;, &quot;this is content which subtype equals paragraph&quot;

can anyone help me with java?

答案1

得分: 1

这将起作用：

df
    .withColumn("newContent", functions.explode(functions.col("items")))
    .filter("newContent.subtype=='paragraph'")
    .selectExpr("id", "title", "newContent.content as content")
    .show(false);

输入：

+---+--------------------------------------------------------------------------------------------------------+-----+
|id |items                                                                                                   |title|
+---+--------------------------------------------------------------------------------------------------------+-----+
|id |[[Content1, subtype1, someurl], [ContentOfParagraph, paragraph, someurl], [Content2, subtype2, someurl]]|Title|
+---+--------------------------------------------------------------------------------------------------------+-----+

输出：

+---+-----+------------------+
|id |title|content           |
+---+-----+------------------+
|id |Title|ContentOfParagraph|
+---+-----+------------------+

英文:

This would work:

df
     .withColumn(&quot;newContent&quot;, functions.explode(functions.col(&quot;items&quot;)))
     .filter(&quot;newContent.subtype==&#39;paragraph&#39;&quot;)
     .selectExpr(&quot;id&quot;, &quot;title&quot;, &quot;newContent.content as content&quot;)
     .show(false);

Input:

+---+--------------------------------------------------------------------------------------------------------+-----+
|id |items                                                                                                   |title|
+---+--------------------------------------------------------------------------------------------------------+-----+
|id |[[Content1, subtype1, someurl], [ContentOfParagraph, paragraph, someurl], [Content2, subtype2, someurl]]|Title|
+---+--------------------------------------------------------------------------------------------------------+-----+

Output:

+---+-----+------------------+
|id |title|content           |
+---+-----+------------------+
|id |Title|ContentOfParagraph|
+---+-----+------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark在Java中如何过滤数据集中的列表值？

问题

答案1

I have this crash while testing my app. I am a complete beginner and cant understand how to fix this, any solution?

Unusual behavior regarding "detached entity passed to persist" exception when trying to persist a detached object?

Apache PDFBox的PDPageContentStream的showText和drawImage有时不起作用或上下颠倒显示。

ByteArrayInputStream真的是一个流吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。