英文:
How to solve this Pyspark Code Block using Regexp
问题
我有这个CSV文件
但是当我运行我的笔记本时,正则表达式显示一些错误
from pyspark.sql.functions import regexp_replace
path = "dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
dff.show(truncate=False)
for i in dffs_headers:
columnLabel = i[0]
print(columnLabel)
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
因此,我得到这个结果
可以有人改进这个代码吗,这将是一个很大的帮助。
预期输出是
|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|
但我得到了
��Id��,��Version��,��Questionnaire��,��Date��
第二列显示了截断的值
英文:
I have this CSV file
but when I am running my notebook regex shows some error
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
dff.show(truncate=False)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
print(columnLabel)
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
As and a result I am getting this
Can anyone improvise this code, it will be a great help.
Expected output is
|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|
But I am getting
��Id��,��Version��,��Questionnaire��,��Date��
Second column is showing Truncated value
答案1
得分: 1
从 pyspark.sql.functions 导入 regexp_replace 库,然后在 regexp_replace 调用之前将下面的代码放入单元格中,应该可以解决这个问题。
英文:
You will need to import the libraries you want to use first, to use them. The below code in a cell before the regexp_replace call should fix this issue
from pyspark.sql.functions import regexp_replace
答案2
得分: 0
这是工作答案:
from pyspark.sql.functions import regexp_replace
path = "dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
英文:
This is working asnwer
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论