How can we load a non delimited text file using spark scala and save it as a CSV file where column lengths are given dynamically?

huangapple go评论85阅读模式
英文:

How can we load a non delimited text file using spark scala and save it as a CSV file where column lengths are given dynamically?

问题

以下是要翻译的部分:

"如果我们有三列作为模式,名称,地址和年龄,文件中的一行有92个字符,其中前50个字符是名称,接下来的40个字符是地址,最后2个字符是年龄,而这些列的长度可能会变化并会动态给出,那么如何读取文件,将其分隔并保存为文本文件?"

"完全无法理解这个想法"

英文:

For ex - if we have three columns Name , address and age as schema and a line in file has 92 chars where first 50 are name, next 40 are address and last 2 chars is age, and if these column lengths might vary and will be given dynamically, how to read a file and make it delimited and save it as text file?

Could not get the idea at all

答案1

得分: 1

以下是翻译好的部分:

读取文件

我假设你有一个名为input.txt的文件,内容如下:

  1. 1 John 100
  2. 2 Jack 200
  3. 3 Jonah300
  4. 10JJ 400

我还假设这些是列名:

  1. // 列名, 长度
  2. val columns = Vector(("id", 2), ("name", 5), ("value", 3))

你应该找到每个列的起始位置:

  1. case class ColumnInfo(name: String, length: Int, position: Int)
  2. val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =>
  3. acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
  4. }

这将是结果:

  1. Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))

现在,你可以使用以下代码读取和解析这个文件:

  1. val sparkCols = columnInfos map { columnInfo =>
  2. trim(substring(col("value"), columnInfo.position, columnInfo.length)) as columnInfo.name
  3. }
  4. val df = spark.read
  5. .text("input.txt")
  6. .select(sparkCols: _*)
  7. df.show()

这将是结果:

  1. +---+-----+-----+
  2. | id| name|value|
  3. +---+-----+-----+
  4. | 1| John| 100|
  5. | 2| Jack| 200|
  6. | 3|Jonah| 300|
  7. | 10| JJ| 400|
  8. +---+-----+-----+

保存文件

你可以使用以下代码保存文件:

  1. df.repartition(1).write.option("header", true).csv("output.csv")

这将是结果:

  1. id,name,value
  2. 1,John,100
  3. 2,Jack,200
  4. 3,Jonah,300
  5. 10,JJ,400
英文:

Your question has two parts.

Read file

I assume you have a file named input.txt like this:

  1. 1 John 100
  2. 2 Jack 200
  3. 3 Jonah300
  4. 10JJ 400

I also assumed that these are the columns:

  1. // Column Name, Length
  2. val columns = Vector(("id", 2), ("name", 5), ("value", 3))

You should find starting position for each column:

  1. case class ColumnInfo(name: String, length: Int, position: Int)
  2. val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =>
  3. acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
  4. }

This will be the result:

  1. Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))

Now, you can read and parse this file using this code:

  1. val sparkCols = columnInfos map { columnInfo =>
  2. trim(substring(col("value"), columnInfo.position, columnInfo.length)) as columnInfo.name
  3. }
  4. val df = spark.read
  5. .text("input.txt")
  6. .select(sparkCols: _*)
  7. df.show()

This will be the result:

  1. +---+-----+-----+
  2. | id| name|value|
  3. +---+-----+-----+
  4. | 1| John| 100|
  5. | 2| Jack| 200|
  6. | 3|Jonah| 300|
  7. | 10| JJ| 400|
  8. +---+-----+-----+

Save file

You can save file using this code:

  1. df.repartition(1).write.option("header", true).csv("output.csv")

This will be the result:

  1. id,name,value
  2. 1,John,100
  3. 2,Jack,200
  4. 3,Jonah,300
  5. 10,JJ,400

huangapple
  • 本文由 发表于 2023年2月14日 19:51:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75447433.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定