在pyspark的SparkContext中的分隔符

huangapple go评论54阅读模式
英文:

Delimiter in pyspark sparkcontext delimiter

问题

我想要将换行分隔符更改为“\u0001”在pyspark中。我该怎么做?在进行以下操作时仍然使用换行符“\n”分隔符。谢谢!

from pyspark import SparkContext, SparkConf

# 创建一个带有一些配置选项的SparkConf对象
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")

# 创建一个带有SparkConf对象的SparkContext对象
sc = SparkContext(conf=conf)

rdd = sc.textFile("MY_PATH")
英文:

I would like to change the new line delimiter to "\u0001" in pyspark. How can I do that? when doing the following it still uses the newline "\n" delimiter. thanks!

from pyspark import SparkContext, SparkConf

# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

rdd = sc.textFile(f"MY_PATH")

答案1

得分: 1

我找到了适用于我的以下答案:

英文:

I found the following answer that worked for me:

from pyspark import SparkContext, SparkConf

path = <MY_PATH>
# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"
input_format_class = "org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
key_class = "org.apache.hadoop.io.Text"
value_class = "org.apache.hadoop.io.LongWritable"

rdd = sc.textFile(path)
rconf = {"textinputformat.record.delimiter": "\u0002"}
rdd = sc.newAPIHadoopFile(path,
                          input_format_class,
                          key_class,
                          value_class, 
                          conf=rconf)

huangapple
  • 本文由 发表于 2023年4月17日 02:45:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76029711.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定