在数据结构中对特定元素进行排名 – 有更高效的方法吗?

huangapple go评论70阅读模式
英文:

Ranking specific elements within a data structure - is the a more efficient way?

问题

我正在对一个 .csv 文件中的某些组元素进行排名。我的程序可以运行。然而...

我正在寻求关于如何提高我编写的程序效率的建议。我不寻求对我的代码的审查。Stackoverflow 链接。我也没有请求别人为我编写代码。我所问的只是:“有没有更有效率的方法?如果有,是什么?”

我有一个程序,它接受多个 .csv 文件,对它们进行修改并添加额外的数据。然后保存这些文件。以下是输入数据的表示:

ISBN, 商店, 成本, 评论分数,
9780008305796, 一家书店, 11.99, 4.8,
9781787460966, 一家书店, 6.99, 4.3,
9781787460966, 书很多的书店, 5.99, 4.4,
9781838770013, 一家书店, 6.99, 3.8,
9780008305796, 书商, 13.99, 4.7,
9780008305796, 书很多的书店, 16.99, 4.1,

注意:每个 .csv 文件通常有数千行。一个 ISBN 可能会出现 1 到 20 次。.csv 文件没有按任何列排序。

我的程序工作方式如下(伪代码):

  1. 将 csv 加载到 String[][] 中
  2. 遍历 String[][] 创建一个映射:键为 ISBN,值为该 ISBN 出现的次数
  3. 遍历 String[][]
    3.1 从映射中获取 ISBN 值,然后保存具有该 ISBN 的每行数据(达到值时停止)
    3.2 对保存的行进行价格和评论排名,并将行保存到另一个变量中
    3.3 删除键
    3.4 回到步骤 3,直到没有键为止
  4. 保存到 .csv 文件中

数据现在看起来是这样的:

ISBN, 商店, 成本, 评论分数, 成本排名, 评论排名
9780008305796, 一家书店, 11.99, 4.8, 1, 1
9781787460966, 一家书店, 6.99, 4.3, 2, 2
9781787460966, 书很多的书店, 5.99, 4.4, 1, 1
9781838770013, 一家书店, 6.99, 3.8, 1, 1
9780008305796, 书商, 13.99, 4.1, 2, 3
9780008305796, 书很多的书店, 16.99, 4.3, 3, 2

这个程序不依赖于将 .csv 加载到哪种类型的数据结构中。它可以是列表,列表的列表,集合等。

英文:

I am ranking certain groups of elements within a .csv file. My program works. However ...

I am seeking advice on on how to improve the efficiency of a program I have written. I do not seek a review of my code. Stackoverflow ref. Nor I am requesting someone to write code for me. All I am asking is: "Is there a more efficient way? and if so what?"

I have a program that takes multiple .csv files, modifies them and adds extra data. These files are then saved. Below is a respresentation of the input data:

ISBN, Shop, Cost, ReviewScore,
9780008305796, A Bookshop, 11.99, 4.8,
9781787460966, A Bookshop, 6.99, 4.3,
9781787460966, Lots of books, 5.99, 4.4,
9781838770013, A Bookshop, 6.99, 3.8,
9780008305796, The bookseller, 13.99, 4.7,
9780008305796, Lots of books, 16.99, 4.1,

Note: each .csv file is normally 1000's of lines long. There could be 1 to 20 instances of an ISBN. The .csv is not ordered by any column.

My program works as follows (pseudocode):

  1. load csv into String[][]
  2. iterate through String[][] to create a map: with k = ISBN, v = number of occurances of that ISBN
  3. iterate through String[][]
    3.1 get the ISBN value from map then save each line that has that ISBN (stop when value reached)
    3.2 then rank the price and reviews of saved lines, and save the lines into another var.
    3.3 delete key
    3.4 go back to 3. until there are no keys
  4. save into .csv

data will now look like:

ISBN, Shop, Cost, ReviewScore, CostRank, ReviewRank
9780008305796, A Bookshop, 11.99, 4.8, 1, 1
9781787460966, A Bookshop, 6.99, 4.3, 2, 2
9781787460966, Lots of books, 5.99, 4.4, 1, 1
9781838770013, A Bookshop, 6.99, 3.8, 1, 1
9780008305796, The bookseller, 13.99, 4.1, 2, 3
9780008305796, Lots of books, 16.99, 4.3, 3, 2

This program does not depend on the type of data structure the .csv is loaded into. It could be a List, List of Lists, Collection etc.

答案1

得分: 1

以下是翻译的代码部分:

Map<String, IsbnData> dataStore = new HashMap();
forEach(row : rows) {
   IsbnData datum = dataStore.get(row[0]); //或者ISBN的索引
   if(datum == null) {
      datum = createIsbnDataFromRow(row);
   } else {
      datum = updateDatumWithMoreData(datum, row);
   }

   dataStore.put(row[0], datum);
}

这种方法的主要优势在于,不再需要处理 String[] 类型的数据,而是使用了结构化的类,使得代码更易于阅读。

这段代码的运行速度可能会更快,但这可能并不重要,因为在速度成为问题之前,很可能会耗尽内存。请不要将这个与程序运行缓慢混淆 - 程序可能会运行缓慢,但这是因为读取 / 解析 CSV 文件造成的。在解析 CSV 文件后,只需少量次数地遍历这些文件,所带来的速度提升微不足道。

英文:

You /could/ do it in a single pass, the code would look something like so:

  Map&lt;String, IsbnData&gt; dataStore = new HashMap();
  forEach(row : rows) {
     IsbnData datum = dataStore.get(row[0]); //or whatever the index of ISBN is
     if(datum == null) {
        datum = createIsbnDataFromRow(row);
     } else {
        datum = updateDatumWithMoreData(datum, row);
     }

     dataStore.put(row[0], datum);
  }

The main benefit of this is that instead of having to deal with String[] you have nicely structured classes and the code is easier to read.

The code /may/ run faster, but that's probably irrelevant since it's much more likely to run out of memory before the speed matters. (Don't confuse this with the program being slow - it may well be slow, but that is due to reading / parsing the CSV files. The speed gain from passing over the CSV files less times after you've parsed them is negligable).

huangapple
  • 本文由 发表于 2020年7月24日 22:35:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/63075808.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定