如何使用正则表达式解决这个Pyspark代码块

huangapple go评论103阅读模式
英文:

How to solve this Pyspark Code Block using Regexp

问题

我有这个CSV文件

但是当我运行我的笔记本时,正则表达式显示一些错误

  1. from pyspark.sql.functions import regexp_replace
  2. path = "dbfs:/FileStore/df/test.csv"
  3. dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
  4. dff.show(truncate=False)
  5. for i in dffs_headers:
  6. columnLabel = i[0]
  7. print(columnLabel)
  8. newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  9. dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
  10. if columnLabel != newColumnLabel:
  11. dff = dff.drop(columnLabel)
  12. dff.show(truncate=False)

因此,我得到这个结果

可以有人改进这个代码吗,这将是一个很大的帮助。

预期输出是

|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|

但我得到了

��Id��,��Version��,��Questionnaire��,��Date��

第二列显示了截断的值

英文:

I have this CSV file

如何使用正则表达式解决这个Pyspark代码块

but when I am running my notebook regex shows some error

  1. from pyspark.sql.functions import regexp_replace
  2. path="dbfs:/FileStore/df/test.csv"
  3. dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
  4. dff.show(truncate=False)
  5. #dffs_headers = dff.dtypes
  6. for i in dffs_headers:
  7. columnLabel = i[0]
  8. print(columnLabel)
  9. newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  10. dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
  11. if columnLabel != newColumnLabel:
  12. dff = dff.drop(columnLabel)
  13. dff.show(truncate=False)

As and a result I am getting this

如何使用正则表达式解决这个Pyspark代码块

Can anyone improvise this code, it will be a great help.

Expected output is

|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|

But I am getting

��Id��,��Version��,��Questionnaire��,��Date��

Second column is showing Truncated value

答案1

得分: 1

从 pyspark.sql.functions 导入 regexp_replace 库,然后在 regexp_replace 调用之前将下面的代码放入单元格中,应该可以解决这个问题。

英文:

You will need to import the libraries you want to use first, to use them. The below code in a cell before the regexp_replace call should fix this issue

from pyspark.sql.functions import regexp_replace

答案2

得分: 0

这是工作答案:

  1. from pyspark.sql.functions import regexp_replace
  2. path = "dbfs:/FileStore/df/test.csv"
  3. dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
  4. #dffs_headers = dff.dtypes
  5. for i in dffs_headers:
  6. columnLabel = i[0]
  7. newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  8. dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
  9. if columnLabel != newColumnLabel:
  10. dff = dff.drop(columnLabel)
  11. dff.show(truncate=False)
英文:

This is working asnwer

  1. from pyspark.sql.functions import regexp_replace
  2. path="dbfs:/FileStore/df/test.csv"
  3. dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
  4. #dffs_headers = dff.dtypes
  5. for i in dffs_headers:
  6. columnLabel = i[0]
  7. newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  8. dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
  9. if columnLabel != newColumnLabel:
  10. dff = dff.drop(columnLabel)
  11. dff.show(truncate=False)

huangapple
  • 本文由 发表于 2023年1月9日 10:08:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052606.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定