2023年2月14日 19:51:17go评论85阅读模式

英文:

How can we load a non delimited text file using spark scala and save it as a CSV file where column lengths are given dynamically?

问题

以下是要翻译的部分：

"如果我们有三列作为模式，名称，地址和年龄，文件中的一行有92个字符，其中前50个字符是名称，接下来的40个字符是地址，最后2个字符是年龄，而这些列的长度可能会变化并会动态给出，那么如何读取文件，将其分隔并保存为文本文件？"

"完全无法理解这个想法"

英文:

For ex - if we have three columns Name , address and age as schema and a line in file has 92 chars where first 50 are name, next 40 are address and last 2 chars is age, and if these column lengths might vary and will be given dynamically, how to read a file and make it delimited and save it as text file?

Could not get the idea at all

答案1

得分: 1

以下是翻译好的部分：

读取文件

我假设你有一个名为input.txt的文件，内容如下：

1 John 100
2 Jack 200
3 Jonah300
10JJ   400

我还假设这些是列名：

// 列名, 长度
val columns = Vector(("id", 2), ("name", 5), ("value", 3))

你应该找到每个列的起始位置：

case class ColumnInfo(name: String, length: Int, position: Int)
val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =>
  acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
}

这将是结果：

Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))

现在，你可以使用以下代码读取和解析这个文件：

val sparkCols = columnInfos map { columnInfo =>
  trim(substring(col("value"), columnInfo.position, columnInfo.length)) as columnInfo.name
}
val df = spark.read
  .text("input.txt")
  .select(sparkCols: _*)
df.show()

这将是结果：

+---+-----+-----+
| id| name|value|
+---+-----+-----+
|  1| John|  100|
|  2| Jack|  200|
|  3|Jonah|  300|
| 10|   JJ|  400|
+---+-----+-----+

保存文件

你可以使用以下代码保存文件：

df.repartition(1).write.option("header", true).csv("output.csv")

这将是结果：

id,name,value
1,John,100
2,Jack,200
3,Jonah,300
10,JJ,400

英文:

Your question has two parts.

Read file

I assume you have a file named input.txt like this:

1 John 100
2 Jack 200
3 Jonah300
10JJ   400

I also assumed that these are the columns:

// Column Name, Length
val columns = Vector((&quot;id&quot;, 2), (&quot;name&quot;, 5), (&quot;value&quot;, 3))

You should find starting position for each column:

case class ColumnInfo(name: String, length: Int, position: Int)
val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =&gt;
  acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
}

This will be the result:

Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))

Now, you can read and parse this file using this code:

val sparkCols = columnInfos map { columnInfo =&gt;
  trim(substring(col(&quot;value&quot;), columnInfo.position, columnInfo.length)) as columnInfo.name
}
val df = spark.read
  .text(&quot;input.txt&quot;)
  .select(sparkCols: _*)
df.show()

This will be the result:

+---+-----+-----+
| id| name|value|
+---+-----+-----+
|  1| John|  100|
|  2| Jack|  200|
|  3|Jonah|  300|
| 10|   JJ|  400|
+---+-----+-----+

Save file

You can save file using this code:

df.repartition(1).write.option(&quot;header&quot;, true).csv(&quot;output.csv&quot;)

This will be the result:

id,name,value
1,John,100
2,Jack,200
3,Jonah,300
10,JJ,400

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can we load a non delimited text file using spark scala and save it as a CSV file where column lengths are given dynamically?

问题

答案1

读取文件

保存文件

Read file

Save file

Spark作业需要1小时来处理10MB的文件。

如何在Scala中创建元组数组？

如何通过使用符号或类型对象将内容传递给泛型类型的函数？

Spark 如何处理分区和洗牌

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论