Process two very large CSV files. Load them at the same time. Replace some column in some files with columns in other file

huangapple go评论61阅读模式
英文:

Process two very large CSV files. Load them at the same time. Replace some column in some files with columns in other file

问题

想要将一个CSV文件中的某些列替换为其他CSV文件中的列值,这些列值无法一起放入内存中。语言要求:JAVA,SCALA。无框架限制。

其中一个文件具有键值映射的特性,而另一个文件具有大量列。我们需要使用键值映射文件中的值来替换大型CSV文件中的值。

英文:

Want to replace some of the columns in one csv file with the column values in other CSV files which cannot fit in memory together. Language contraints JAVA,SCALA. No Framwework constraints.

One of the file has key-value kind of mapping and other file have large number of columns. And we have to replace the the values in large CSV file with the values in file that have key-value mapping.

答案1

得分: 2

在假设您可以将所有键值映射存入内存的前提下,然后以流式处理的方式处理大文件:

import java.io.{File, PrintWriter}
import scala.io.Source

val kv_file = scala.io.Source.fromFile("key_values.csv")

// 构建一个简单的键值映射
val kv: Map[String, String] = kv_file.getLines().map { line =>
  val cols = line.split(";")
  cols(0) -> cols(1)
}.toMap

val writer = new PrintWriter(new File("processed_big_file.csv"))

big_file.getLines().foreach { line =>
  // 这是键值替换逻辑(据我理解)
  val processed_cols = line.split(";").map { s => kv.getOrElse(s, s) }

  val out_line = processed_cols.mkString(";")
  writer.write(out_line)
}
// 关闭文件
writer.close()

在假设您无法完全加载键值映射的情况下,您可以部分加载内存中的带有键值映射的文件,然后仍然处理大文件。当然,您必须多次迭代文件以获取所有键的处理

(代码部分未提供)

如果您还有其他需要翻译的内容,请继续提供。

英文:

Under the assumption that you can take in memory all the key-value mappings, then process the big one in a streaming fashion

import java.io.{File, PrintWriter}
import scala.io.Source

val kv_file = scala.io.Source.fromFile("key_values.csv")

// Construct a simple key value map
val kv : Map[String,String] = kv_file.getLines().map { line =>
  val cols = line.split(";")
  cols(0) -> cols(1)
}.toMap


val writer = new PrintWriter(new File("processed_big_file.csv" ))

big_file.getLines().foreach( line => {
  // this is the key-value replace logic (as I understood)
  val processed_cols = line.split(";").map { s => kv.getOrElse(s,s) }

  val out_line = processed_cols.mkString(";");
  writer.write(out_line)
})
// close file
writer.close()

Under the assumption that you cannotbe fully load thye key-value mapping then you could partially load in memory the file with the key-value maps and then still process the big one. Of course you have to iterate a bunch of times the files to get processed all the keys

huangapple
  • 本文由 发表于 2020年5月5日 20:56:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/61613713.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定