英文:
Delimiter in pyspark sparkcontext delimiter
问题
我想要将换行分隔符更改为“\u0001”在pyspark中。我该怎么做?在进行以下操作时仍然使用换行符“\n”分隔符。谢谢!
from pyspark import SparkContext, SparkConf
# 创建一个带有一些配置选项的SparkConf对象
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")
# 创建一个带有SparkConf对象的SparkContext对象
sc = SparkContext(conf=conf)
rdd = sc.textFile("MY_PATH")
英文:
I would like to change the new line delimiter to "\u0001" in pyspark. How can I do that? when doing the following it still uses the newline "\n" delimiter. thanks!
from pyspark import SparkContext, SparkConf
# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")
# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)
rdd = sc.textFile(f"MY_PATH")
答案1
得分: 1
我找到了适用于我的以下答案:
英文:
I found the following answer that worked for me:
from pyspark import SparkContext, SparkConf
path = <MY_PATH>
# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')
# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)
output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"
input_format_class = "org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
key_class = "org.apache.hadoop.io.Text"
value_class = "org.apache.hadoop.io.LongWritable"
rdd = sc.textFile(path)
rconf = {"textinputformat.record.delimiter": "\u0002"}
rdd = sc.newAPIHadoopFile(path,
input_format_class,
key_class,
value_class,
conf=rconf)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论