2023年3月9日 23:38:46go评论193阅读模式

英文:

get a string between each / within string

问题

我有需要在每次出现反斜杠（\）后拆分的列数值。我需要提取每次反斜杠（\）出现时的单词，并创建新的列。如何在pyspark（databricks）中完成这个操作？任何帮助将不胜感激。

英文:

I have column Values that needs to be splited after every \ occurrence.
I need to fetch each word within each occurrence of \ and create new columns.

How do I do this in pyspark (databricks)? Any help is appreciated.

答案1

得分: 1

以下是您要求的代码部分的翻译结果：

替换values列中的所有不需要的字符为|：

df = df \
.withColumn("Values", regexp_replace("Values", "\\\\", "|")) \
.withColumn("Values", regexp_replace("Values", "\\a", "|a")) \
.withColumn("Values", regexp_replace("Values", "\\t", "|t")) \
.withColumn("Values", regexp_replace("Values", "\\s", "|s")) \
.withColumn("Values", regexp_replace("Values", "\\n", "|n"))

现在使用指定的分隔符|拆分列values：

df = df.withColumn("Values", split("Values", "\|"))

提取记录为数组格式：

records = df.rdd.map(lambda row: row.asDict()).collect()

为了插入具有不同列的记录，数据必须以键值对的形式发送：

output_records = []
for record in records:
    values = record["Values"]
    words = [value for value in values if len(value) > 0]
    
    for i, word in enumerate(words):
        column_name, column_values = f"col_{i+1}", word
        if record.get(column_name, None) is None:
            record[column_name] = word
        
    del record["Values"]
    output_records.append(record)

这是output_records的样子：

[
{'FieldA': 1, 'FieldB': 'a', 'FieldC': 'hello', 'col_1': 'abc', 'col_2': 'def', 'col_3': 'ghi', 'col_4': 'jk-l', 'col_5': 'mno'},
{'FieldA': 2, 'FieldB': 'b', 'FieldC': 'you', 'col_1': 'I', 'col_2': 'like', 'col_3': 'to', 'col_4': 'Code'},
{'FieldA': 3, 'FieldB': 'b', 'FieldC': 'there', 'col_1': 'Th-at', 'col_2': 'works'}
]

现在使用output_records创建一个Spark DataFrame：

spark.createDataFrame(output_records).show()

输出：

+------+------+------+-----+-----+-----+-----+-----+
|FieldA|FieldB|FieldC|col_1|col_2|col_3|col_4|col_5|
+------+------+------+-----+-----+-----+-----+-----+
|     1|     a| hello|  abc|  def|  ghi| jk-l|  mno|
|     2|     b|   you|    I| like|   to| Code| null|
|     3|     b| there|Th-at|works| null| null| null|
+------+------+------+-----+-----+-----+-----+-----+

英文:

From your input, I am considering this is your DataFrame:

+------+------+------+---------------------+
|FieldA|FieldB|FieldC|Values               |
+------+------+------+---------------------+
|1     |a     |hello |\abc\def\ghi\jk-l\mno|
|2     |b     |you   |\I\like\to\Code      |
|3     |b     |there |\Th-at\works         |
+------+------+------+---------------------+

Replace all the unwanted characters from the values column with |

df = df \
.withColumn(&quot;Values&quot;, regexp_replace(&quot;Values&quot;, &quot;\\\\&quot;, &quot;|&quot;)) \
.withColumn(&quot;Values&quot;, regexp_replace(&quot;Values&quot;, &quot;\\a&quot;, &quot;|a&quot;)) \
.withColumn(&quot;Values&quot;, regexp_replace(&quot;Values&quot;, &quot;\\t&quot;, &quot;|t&quot;)) \
.withColumn(&quot;Values&quot;, regexp_replace(&quot;Values&quot;, &quot;\\s&quot;, &quot;|s&quot;)) \
.withColumn(&quot;Values&quot;, regexp_replace(&quot;Values&quot;, &quot;\\n&quot;, &quot;|n&quot;))

Now split the column values using the specified delimiter |

df = df.withColumn(&quot;Values&quot;, split(&quot;Values&quot;, &quot;\|&quot;))

Extract the records into array format

records = df.rdd.map(lambda row: row.asDict()).collect()

In-order to insert records of varying column, the data has to be sent as key value pair

output_records=[]
for record in records:
    values = record[&quot;Values&quot;]
    words = [value for value in values if len(value)&gt;0]
    
    for i, word in enumerate(words):
        column_name, column_values = f&quot;col_{i+1}&quot;, word
        if record.get(column_name, None) is None:
            record[column_name] = word
        
    del record[&quot;Values&quot;]
    output_records.append(record)

This is how output_records looks like:

[
{&#39;FieldA&#39;: 1, &#39;FieldB&#39;: &#39;a&#39;, &#39;FieldC&#39;: &#39;hello&#39;, &#39;col_1&#39;: &#39;abc&#39;, &#39;col_2&#39;: &#39;def&#39;, &#39;col_3&#39;: &#39;ghi&#39;, &#39;col_4&#39;: &#39;jk-l&#39;, &#39;col_5&#39;: &#39;mno&#39;},
{&#39;FieldA&#39;: 2, &#39;FieldB&#39;: &#39;b&#39;, &#39;FieldC&#39;: &#39;you&#39;, &#39;col_1&#39;: &#39;I&#39;, &#39;col_2&#39;: &#39;like&#39;, &#39;col_3&#39;: &#39;to&#39;, &#39;col_4&#39;: &#39;Code&#39;},
{&#39;FieldA&#39;: 3, &#39;FieldB&#39;: &#39;b&#39;, &#39;FieldC&#39;: &#39;there&#39;, &#39;col_1&#39;: &#39;Th-at&#39;, &#39;col_2&#39;: &#39;works&#39;}
]

Now create a Spark DataFrame using this output_records

spark.createDataFrame(output_records).show()

Output:

+------+------+------+-----+-----+-----+-----+-----+
|FieldA|FieldB|FieldC|col_1|col_2|col_3|col_4|col_5|
+------+------+------+-----+-----+-----+-----+-----+
|     1|     a| hello|  abc|  def|  ghi| jk-l|  mno|
|     2|     b|   you|    I| like|   to| Code| null|
|     3|     b| there|Th-at|works| null| null| null|
+------+------+------+-----+-----+-----+-----+-----+

答案2

得分: 0

以下是我的两分建议：

from pyspark.sql.functions import *
import pyspark.sql.functions as F

from pyspark.sql.functions import split

data = [(1, 'a', 'hello', r'\abc\def\ghi\jk-l\mno'),
        (2, 'b', 'you', r'\I\like\to\Code'),
        (3, 'b', 'there', r'\Th-at\works')]

df = spark.createDataFrame(data, ['FieldA', 'FieldB', 'FieldC', 'Values'])

# 分割列，去除空格，基于数组创建动态列

df = df.withColumn('split_col', F.split(col('Values'), r'\\'))
df = df.withColumn("split_array", array_remove(df["split_col"], ""))
df = df.withColumn('cnt', F.size('split_array'))

max = df.agg(F.max('cnt')).first()[0]

textcols = [F.col('split_array')[i].alias(f'col{i+1}') for i in range(0, max)]

df.select([F.col('FieldA'), F.col('FieldB'), F.col('FieldC')] + textcols).show()

检查下面的示例输出：

英文:

Here are my 2 cents:

from pyspark.sql.functions import *
import pyspark.sql.functions as F


from pyspark.sql.functions import split

data = [(1, &#39;a&#39;, &#39;hello&#39;, r&#39;\abc\def\ghi\jk-l\mno&#39;),
        (2, &#39;b&#39;, &#39;you&#39;, r&#39;\I\like\to\Code&#39;),
        (3, &#39;b&#39;, &#39;there&#39;, r&#39;\Th-at\works&#39;)]

df = spark.createDataFrame(data, [&#39;FieldA&#39;, &#39;FieldB&#39;, &#39;FieldC&#39;, &#39;Values&#39;])

# Split the column, remove the blanks, create dynamic columns based on the array

df = df.withColumn(&#39;split_col&#39;,F.split(col(&#39;Values&#39;),r&#39;\\&#39;))
df = df.withColumn(&quot;split_array&quot;, array_remove(df[&quot;split_col&quot;], &quot;&quot;))
df = df.withColumn(&#39;cnt&#39;, F.size(&#39;split_array&#39;))

max = df.agg(F.max(&#39;cnt&#39;)).first()[0]

textcols = [F.col(&#39;split_array&#39;)[i].alias(f&#39;col{i+1}&#39;) for i in range(0, max)]

df.select([F.col(&#39;FieldA&#39;),F.col(&#39;FieldB&#39;),F.col(&#39;FieldC&#39;)] + textcols).show()

Check the sample output below:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取字符串中每个斜杠之间的字符串。

问题

答案1

答案2

Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Mat73打开MATLAB的.mat文件太慢了。

在Python中进行数字和字母的排序。

DPY-4011: 数据库或网络关闭了连接

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论