Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?

huangapple go评论81阅读模式
英文:

Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?

问题

我正在处理一个项目,其中我有一个包含两列(字符串,字符串计数)的pyspark dataframe,分别是字符串和bigint。数据集存在一些问题,因为有些单词附带了非字母字符(例如,'date','_date','!date' 和 'date,' 都是不同的项,但应该只是 'date')。

我需要缩减这个数据框,以便 date,_date,!date 和 date, 都变成 'date',并更新它们的计数。问题是:我需要避免与类似单词(如 'dates','dating','dated','todate' 等)合并。

目标

+------+------+
| 计数 | 单词 |
+------+------+
| 33253 | date |
| 532 | snap |
有没有关于如何处理这个问题的想法?

英文:

I'm working on a project where I have a pyspark dataframe of two columns (string, string count) that are string and bigint respectively. The dataset is dirty such that some words have a non-letter character attached to them (ex. 'date', '_date', '!date' and 'date,' are all separate items but should be just 'date')

print(dirty_df.schema)
output---> StructType([StructField('count', LongType(), True), StructField('word', StringType(), True)])
dirty_df.show()
+------+------+
| count|  word|
+------+------+
|32375 |  date|
|359   | _date|
|306   | !date|
|213   | date,|
|209   |  snap|
|204   | ^snap|
|107   | +snap|
|12    | snap?|

I need to reduce the dataframe such that date, _date, !date, and date, are all just 'date' with their counts being updated to match. Problem is: I need to avoid joining on similar words like'dates', 'dating', 'dated', 'todate', etc.

Goal

+------+------+
| count|  word|
+------+------+
|33253 |  date|
|532   |  snap|

Any thoughts on how I could approach this?

答案1

得分: 2

You can use regex_replace to remove any special characters.

df = (df.withColumn('word', F.regexp_replace('word', '[^a-zA-Z]', ''))
      .groupby('word')
      .agg(F.sum('count').alias('count')))

[^a-zA-Z], 此正则表达式将匹配除小写和大写字母以外的任何字符。

英文:

You can use regex_replace to remove any special characters.

df = (df.withColumn('word', F.regexp_replace('word', '[^a-zA-Z]', ''))
      .groupby('word')
      .agg(F.sum('count').alias('count')))

[^a-zA-Z], this regex will match any characters other(^ not operator) than lower and upper case alphabets.

答案2

得分: 2

Use regexp_replace 函数来替换所有特殊字符([^a-zA-Z] 替换除字母之外的所有字符)。

示例:

df = spark.createDataFrame([(32375,'date'),(359,'_date'),(306,'[date'),(213,'date]'),(209,'snap'),(204,'_snap'),(107,'[snap'),(12,'snap]')],['count','word'])
df.withColumn("word",regexp_replace(col("word"),"[^a-zA-Z]","")).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)
#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

另一种方式:

如果您只想替换特定字符,可以使用**translate** 函数。

df.withColumn("word",expr('translate(word,"(_|]|[)","")')).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)

#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+
英文:

Use regexp_replace function and replace all special characters([^a-zA-Z] replace all characters other than alphabets).

Example:

df = spark.createDataFrame([(32375,'date'),(359,'_date'),(306,'[date'),(213,'date]'),(209,'snap'),(204,'_snap'),(107,'[snap'),(12,'snap]')],['count','word'])
df.withColumn("word",regexp_replace(col("word"),"[^a-zA-Z]","")).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)
#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

Other way:

If you want to replace only specific characters then use translate function

df.withColumn("word",expr('translate(word,"(_|]|[)","")')).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)

#+----+-----+
#|word|count|
#+----+-----+
#|date|33253|
#|snap|532  |
#+----+-----+

huangapple
  • 本文由 发表于 2023年4月10日 22:17:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75977889.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定