英文:
How can we load a non delimited text file using spark scala and save it as a CSV file where column lengths are given dynamically?
问题
以下是要翻译的部分:
"如果我们有三列作为模式,名称,地址和年龄,文件中的一行有92个字符,其中前50个字符是名称,接下来的40个字符是地址,最后2个字符是年龄,而这些列的长度可能会变化并会动态给出,那么如何读取文件,将其分隔并保存为文本文件?"
"完全无法理解这个想法"
英文:
For ex - if we have three columns Name , address and age as schema and a line in file has 92 chars where first 50 are name, next 40 are address and last 2 chars is age, and if these column lengths might vary and will be given dynamically, how to read a file and make it delimited and save it as text file?
Could not get the idea at all
答案1
得分: 1
以下是翻译好的部分:
读取文件
我假设你有一个名为input.txt
的文件,内容如下:
1 John 100
2 Jack 200
3 Jonah300
10JJ 400
我还假设这些是列名:
// 列名, 长度
val columns = Vector(("id", 2), ("name", 5), ("value", 3))
你应该找到每个列的起始位置:
case class ColumnInfo(name: String, length: Int, position: Int)
val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =>
acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
}
这将是结果:
Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))
现在,你可以使用以下代码读取和解析这个文件:
val sparkCols = columnInfos map { columnInfo =>
trim(substring(col("value"), columnInfo.position, columnInfo.length)) as columnInfo.name
}
val df = spark.read
.text("input.txt")
.select(sparkCols: _*)
df.show()
这将是结果:
+---+-----+-----+
| id| name|value|
+---+-----+-----+
| 1| John| 100|
| 2| Jack| 200|
| 3|Jonah| 300|
| 10| JJ| 400|
+---+-----+-----+
保存文件
你可以使用以下代码保存文件:
df.repartition(1).write.option("header", true).csv("output.csv")
这将是结果:
id,name,value
1,John,100
2,Jack,200
3,Jonah,300
10,JJ,400
英文:
Your question has two parts.
Read file
I assume you have a file named input.txt
like this:
1 John 100
2 Jack 200
3 Jonah300
10JJ 400
I also assumed that these are the columns:
// Column Name, Length
val columns = Vector(("id", 2), ("name", 5), ("value", 3))
You should find starting position for each column:
case class ColumnInfo(name: String, length: Int, position: Int)
val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =>
acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
}
This will be the result:
Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))
Now, you can read and parse this file using this code:
val sparkCols = columnInfos map { columnInfo =>
trim(substring(col("value"), columnInfo.position, columnInfo.length)) as columnInfo.name
}
val df = spark.read
.text("input.txt")
.select(sparkCols: _*)
df.show()
This will be the result:
+---+-----+-----+
| id| name|value|
+---+-----+-----+
| 1| John| 100|
| 2| Jack| 200|
| 3|Jonah| 300|
| 10| JJ| 400|
+---+-----+-----+
Save file
You can save file using this code:
df.repartition(1).write.option("header", true).csv("output.csv")
This will be the result:
id,name,value
1,John,100
2,Jack,200
3,Jonah,300
10,JJ,400
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论