2023年1月9日 10:08:38go评论179阅读模式

英文:

How to solve this Pyspark Code Block using Regexp

问题

我有这个CSV文件

但是当我运行我的笔记本时，正则表达式显示一些错误

from pyspark.sql.functions import regexp_replace

path = "dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)

dff.show(truncate=False)

for i in dffs_headers:
  columnLabel = i[0]
  print(columnLabel)
  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
    dff.show(truncate=False)

因此，我得到这个结果

可以有人改进这个代码吗，这将是一个很大的帮助。

预期输出是

|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|

但我得到了

��Id��,��Version��,��Questionnaire��,��Date��

第二列显示了截断的值

英文:

I have this CSV file

but when I am running my notebook regex shows some error

from pyspark.sql.functions import regexp_replace

path=&quot;dbfs:/FileStore/df/test.csv&quot;
dff = spark.read.option(&quot;header&quot;, &quot;true&quot;).option(&quot;inferSchema&quot;, &quot;true&quot;).option(&#39;multiline&#39;, &#39;true&#39;).option(&#39;encoding&#39;, &#39;UTF-8&#39;).option(&quot;delimiter&quot;, &quot;‡‡,‡‡&quot;).csv(path)

dff.show(truncate=False)
#dffs_headers = dff.dtypes

for i in dffs_headers:
  columnLabel = i[0]
  print(columnLabel)
  newColumnLabel = columnLabel.replace(&#39;‡‡&#39;,&#39;&#39;).replace(&#39;‡‡&#39;,&#39;&#39;)
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,&#39;^\\‡‡|\\‡‡$&#39;,&#39;&#39;)).drop(newColumnLabel)
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
    dff.show(truncate=False)

As and a result I am getting this

Can anyone improvise this code, it will be a great help.

Expected output is

But I am getting

��Id��,��Version��,��Questionnaire��,��Date��

Second column is showing Truncated value

答案1

得分: 1

从 pyspark.sql.functions 导入 regexp_replace 库，然后在 regexp_replace 调用之前将下面的代码放入单元格中，应该可以解决这个问题。

英文:

You will need to import the libraries you want to use first, to use them. The below code in a cell before the regexp_replace call should fix this issue

from pyspark.sql.functions import regexp_replace

答案2

得分: 0

这是工作答案：

from pyspark.sql.functions import regexp_replace

path = "dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)

#dffs_headers = dff.dtypes

for i in dffs_headers:
  columnLabel = i[0]
  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
  dff.show(truncate=False)

英文:

This is working asnwer

from pyspark.sql.functions import regexp_replace

path=&quot;dbfs:/FileStore/df/test.csv&quot;
dff = spark.read.option(&quot;header&quot;, &quot;true&quot;).option(&quot;inferSchema&quot;, &quot;true&quot;).option(&#39;multiline&#39;, &#39;true&#39;).option(&#39;encoding&#39;, &#39;UTF-8&#39;).option(&quot;delimiter&quot;, &quot;‡‡,‡‡&quot;).csv(path)

#dffs_headers = dff.dtypes

for i in dffs_headers:
  columnLabel = i[0]
  newColumnLabel = columnLabel.replace(&#39;‡‡&#39;,&#39;&#39;).replace(&#39;‡‡&#39;,&#39;&#39;)
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,&#39;^\\‡‡|\\‡‡$&#39;,&#39;&#39;))
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
  dff.show(truncate=False)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用正则表达式解决这个Pyspark代码块

问题

答案1

答案2

无法在Tkinter中创建新的标签。

App Engine (Flask) memory limit: how should I cache "large" (3 MB) database calls? How can I monitor memory usage on a local server or during testing?

`matplotlib` 的 `set_yticks` 移除了 `imshow` 的上半部和下半部行。

有没有一个标准的类，通过调用int(self)来实现所有类似整数的魔术方法？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论