问题

我想要将换行分隔符更改为“\u0001”在pyspark中。我该怎么做？在进行以下操作时仍然使用换行符“\n”分隔符。谢谢！

from pyspark import SparkContext, SparkConf

# 创建一个带有一些配置选项的SparkConf对象
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")

# 创建一个带有SparkConf对象的SparkContext对象
sc = SparkContext(conf=conf)

rdd = sc.textFile("MY_PATH")

英文:

I would like to change the new line delimiter to "\u0001" in pyspark. How can I do that? when doing the following it still uses the newline "\n" delimiter. thanks!

from pyspark import SparkContext, SparkConf

# create a SparkConf object with some configuration options
conf = SparkConf().setAppName(&#39;example&#39;).setMaster(&#39;local[*]&#39;)
conf.set(&quot;textinputformat.record.delimiter&quot;, &quot;\u0002&quot;)

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

rdd = sc.textFile(f&quot;MY_PATH&quot;)

答案1

得分: 1

我找到了适用于我的以下答案：

英文:

I found the following answer that worked for me:

from pyspark import SparkContext, SparkConf

path = &lt;MY_PATH&gt;
# create a SparkConf object with some configuration options
conf = SparkConf().setAppName(&#39;example&#39;).setMaster(&#39;local[*]&#39;)

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

output_format_class = &quot;org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat&quot;
input_format_class = &quot;org.apache.hadoop.mapreduce.lib.input.TextInputFormat&quot;
key_class = &quot;org.apache.hadoop.io.Text&quot;
value_class = &quot;org.apache.hadoop.io.LongWritable&quot;

rdd = sc.textFile(path)
rconf = {&quot;textinputformat.record.delimiter&quot;: &quot;\u0002&quot;}
rdd = sc.newAPIHadoopFile(path,
                          input_format_class,
                          key_class,
                          value_class, 
                          conf=rconf)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pyspark的SparkContext中的分隔符

问题

答案1

增量源表与Spark的复制

不同运行结果（pyspark）

从PySpark数据框的行中检索非空值，并将此值存储在新列中。

Pyspark – 将字符串类型的嵌套JSON转换为数据框中的列

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论