2023年4月10日 22:17:13go评论81阅读模式

英文:

Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?

问题

我正在处理一个项目，其中我有一个包含两列（字符串，字符串计数）的pyspark dataframe，分别是字符串和bigint。数据集存在一些问题，因为有些单词附带了非字母字符（例如，'date'，'_date'，'!date' 和 'date,' 都是不同的项，但应该只是 'date'）。

我需要缩减这个数据框，以便 date，_date，!date 和 date, 都变成 'date'，并更新它们的计数。问题是：我需要避免与类似单词（如 'dates'，'dating'，'dated'，'todate' 等）合并。

目标

+------+------+
| 计数 | 单词 |
+------+------+
| 33253 | date |
| 532 | snap |
有没有关于如何处理这个问题的想法？

英文:

I'm working on a project where I have a pyspark dataframe of two columns (string, string count) that are string and bigint respectively. The dataset is dirty such that some words have a non-letter character attached to them (ex. 'date', '_date', '!date' and 'date,' are all separate items but should be just 'date')

print(dirty_df.schema)
output---&gt; StructType([StructField(&#39;count&#39;, LongType(), True), StructField(&#39;word&#39;, StringType(), True)])
dirty_df.show()
+------+------+
| count|  word|
+------+------+
|32375 |  date|
|359   | _date|
|306   | !date|
|213   | date,|
|209   |  snap|
|204   | ^snap|
|107   | +snap|
|12    | snap?|

I need to reduce the dataframe such that date, _date, !date, and date, are all just 'date' with their counts being updated to match. Problem is: I need to avoid joining on similar words like'dates', 'dating', 'dated', 'todate', etc.

Goal

+------+------+
| count|  word|
+------+------+
|33253 |  date|
|532   |  snap|

Any thoughts on how I could approach this?

答案1

得分: 2

You can use regex_replace to remove any special characters.

df = (df.withColumn('word', F.regexp_replace('word', '[^a-zA-Z]', ''))
      .groupby('word')
      .agg(F.sum('count').alias('count')))

[^a-zA-Z], 此正则表达式将匹配除小写和大写字母以外的任何字符。

英文:

You can use regex_replace to remove any special characters.

df = (df.withColumn(&#39;word&#39;, F.regexp_replace(&#39;word&#39;, &#39;[^a-zA-Z]&#39;, &#39;&#39;))
      .groupby(&#39;word&#39;)
      .agg(F.sum(&#39;count&#39;).alias(&#39;count&#39;)))

[^a-zA-Z], this regex will match any characters other(^ not operator) than lower and upper case alphabets.

答案2

得分: 2

Use regexp_replace 函数来替换所有特殊字符（[^a-zA-Z] 替换除字母之外的所有字符）。

示例:

df = spark.createDataFrame([(32375,'date'),(359,'_date'),(306,'[date'),(213,'date]'),(209,'snap'),(204,'_snap'),(107,'[snap'),(12,'snap]')],['count','word'])
df.withColumn("word",regexp_replace(col("word"),"[^a-zA-Z]","")).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)
#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

另一种方式:

如果您只想替换特定字符，可以使用**translate** 函数。

df.withColumn("word",expr('translate(word,"(_|]|[)","")')).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)

#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

英文:

Use regexp_replace function and replace all special characters([^a-zA-Z] replace all characters other than alphabets).

Example:

df = spark.createDataFrame([(32375,&#39;date&#39;),(359,&#39;_date&#39;),(306,&#39;[date&#39;),(213,&#39;date]&#39;),(209,&#39;snap&#39;),(204,&#39;_snap&#39;),(107,&#39;[snap&#39;),(12,&#39;snap]&#39;)],[&#39;count&#39;,&#39;word&#39;])
df.withColumn(&quot;word&quot;,regexp_replace(col(&quot;word&quot;),&quot;[^a-zA-Z]&quot;,&quot;&quot;)).groupBy(&quot;word&quot;).agg(sum(col(&quot;count&quot;)).alias(&quot;count&quot;)).show(10,False)
#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

Other way:

If you want to replace only specific characters then use translate function

df.withColumn(&quot;word&quot;,expr(&#39;translate(word,&quot;(_|]|[)&quot;,&quot;&quot;)&#39;)).groupBy(&quot;word&quot;).agg(sum(col(&quot;count&quot;)).alias(&quot;count&quot;)).show(10,False)

#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?

问题

答案1

答案2

显示pandas中的重复项

找到一个列表中具有相同重复次数的两个整数的序列的方法是什么？

PyQt/ PySide 控件小部分动作来自不同类别或文件

从特定列中填写先前的数值基于一个条件。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论