英文:
Best Practices for Performing Calculations on ArrayList With Million or More Elements
问题
我在进行一项面试测试,对于解决这个问题没有真正的方法。我希望您们能帮助我找出解决这种问题的最佳方法。
问题涉及一个最大容量为 Integer.MAX_VALUE 的 ArrayList。
ArrayList<User> arr = new ArrayList<User>(Integer.MAX_VALUE);
这只是假设,他们还提到了
arr.ensureCapacity(Integer.MAX_VALUE); // 没有问题
"User" 对象据说包含 Int 值 a 和 b。
问题是,计算每个 "User" 的 c 值的最佳方法,其中 c 是 a、b 值相乘的结果。
我的回答是将列表分成较小的列表,然后并行迭代所有较小的列表。在进行计算时,我将结果添加到一个结果列表中,其中包含 c 的值。类似这样:
List<User> firstNElementsList = list.stream().limit(n).collect(Collectors.toList());
我不知道什么是一个好的 N 值。我只是说 N 可以是任意的,比如 1000、10000 或 100000。这样的话,就会有 10 个要处理的列表。
我未能通过测试,所以这个答案是不够的。有更好的想法吗?
英文:
I was doing an interview test and had no real approach to solving the problem. I am hoping you guys can help me figure out what are the best methodologies for this sort of problem.
The question consisted of an ArrayList at max capacity, which is Integer.MAX_VALUE.
ArrayList<User> arr = new ArrayList<User>(Integer.MAX_VALUE);
This was hypothetical, they also mentioned
arr.ensureCapacity(Integer.MAX_VALUE); // shows no issues
The User
object was said to contain Int values a and b.
The question was, what is the best way of calculating the value of c for every "User" where c is the result of multiplying a,b values.
My answer was to break up the list into smaller lists, and then iterate through all the smaller lists in parallel. As I make calculations I then add the result to a results list with the value of c. Something like this,
List<User> firstNElementsList = list.stream().limit(n).collect(Collectors.toList());
I did not know what would be a good size for N. I just said N can be arbitrary such as 1000 or 10000 or 100000. The ladder would be 10 lists to process.
I failed the test, so this answer was not sufficient enough. Any better ideas?
答案1
得分: 1
使用并行流处理,并保持轻量级,使用IntStream
来将结果收集为int[]
:
int[] cs = arr.parallelStream().mapToInt(u -> u.getA() * u.getB()).toArray();
请注意,使用并行处理时,结果的顺序可能不会与输入的原始顺序一致,但这并没有被规定为要求,只是要"收集所有的a * b
";并没有要求您知道每个c
值来自哪个User
。
尽管没有明确规定,或者a
和b
的值是任意的,u.getA() * u.getB()
可能会导致算术溢出,因此更安全的方法是使用long
值来存储结果:
long[] cs = arr.parallelStream().mapToLong(u -> u.getA() * u.getB()).toArray();
作为面试官,我希望候选人在这一点上要求澄清,并且如果答案是"是",则提供第二个选项,并解释如果确保a * b
永远不会溢出int
,则第一个选项稍微轻量级的理由。
英文:
Use parallel stream processing, and to keep it lightweight, use an IntStream
to collect the result as int[]
:
int[] cs = arr.parallelStream().mapToInt(u -> u.getA() * u.getB()).toArray();
Note that when using parallel processing, the order of results may not align with the original order of the input, but this was not stated as a requirement, which was just to "collect all a * b
"; it did not say you had to know which User
each value of c
came from.
Although not stated, or arbitrary values of a
and b
, u.getA() * u.getB()
may result in arithmetic overflow, so the safer approach is to use long
values for the result:
long[] cs = arr.parallelStream().mapToLong(u -> u.getA() * u.getB()).toArray();
As an interviewer, I would hope the candidate would ask for clarification on this point and offer the second option if the answer was "yes", and justify the slightly lighter weight first option if there was a guaranteee that a * b
never overflows int
.
答案2
得分: 1
被解析的数据的大致大小为 2^31 * sizeof(User)。假设只有字段 int A 和 int B,则大约为 17.17GB。正如 Bohemian 指出的,因为整数乘法可能会导致 long 类型,输出数组 c 的大小也大约是相同的大小,即 2^31 * 8 字节 = 17.17GB。
一些可能有用的注释包括:
- 每个 User 可以独立处理。例如:数据可以分割成 1GB 的块,在多台机器上进行处理,然后进行聚合。同样,每台机器上的数据集可以并行处理。
- 计算 C 的操作相对较便宜。计算 C 的另一种方法是根据需要动态解析值(这也将节省 17GB 的空间)。或者,如果计算 C 的成本很高,可以在计算后对其进行缓存,但只在需要时计算。
问题似乎是在询问如何为每个用户最佳计算值 'c',而不仅仅是询问每个用户的计算 'c'。这可能表明问题是关于如何在大型数据集中计算派生字段的方法。因此,可以接受并值得考虑一种懒惰的方法(时间与空间的权衡)。
话虽如此,如果需要一次性预先计算 C,并且只有一台机器,使用并行流是一个合理的方法。在背后,这将在机器上的核心数上分布工作,并且非常适用于计算操作。
英文:
The rough size of the data to be parsed is 2^31 * sizeof(User). Assuming only fields int A and int B, this is roughly 17.17GB. As noted by Bohemian, because integer multiplication may result in a long, the size of the output array c is also roughly the same size 2^31 * 8 bytes = 17.17GB.
Some notes which may be useful include:
- Each User can be processed independently. Ex: the data can be split into 1GB chunks and processed on several machines then aggregated. Similarly, the data set on each machine can be processed in parallel.
- The operation to calculate C is relatively inexpensive. An alternative to calculating C is dynamically resolving the value as needed (which will also save 17GB of space). Alternatively, if C is expensive to calculate it can be cached after calculating but only calculated as needed.
The question seems to ask how to best calculate the value of 'c' for every user rather than just asking for the calculated 'c' for every user. This suggests maybe the question is asking how to approach calculating derived fields in a large data-set. As such, a lazy approach may be acceptable and worth considering (time vs space trade-off).
With that said, if C needs to be pre-calculated all at once and there is only 1 machine, using a parallel stream is a reasonable approach. Behind the scenes, this will distribute the work across the number of cores on the machine and is well suited to computation operations.
答案3
得分: 0
依我之见,你的回答并没有错,但存在一些问题,这些问题可能会导致面试官拒绝你的答案:
- 你说你会同时迭代列表的不同块,但你提供的是顺序代码,而不是并行代码:
List<User> firstNElementsList = list.stream().limit(n).collect(Collectors.toList());
。 - 在计算完
c
后,你想将结果添加到一个results
列表中。实际上,这会抵消你之前进行的并行操作的好处。原因是并行线程的执行顺序是不可预测的,因此这些线程会陷入竞争条件,试图在相同的时间内修改同一资源(即results
列表),从而可能使资源处于不一致的状态。这意味着为了实现你描述的内容,你需要在将c
保存到results
列表时同步这些并行线程。实际上,这使得整个过程再次变成了顺序操作,因为每个线程在修改results
列表之前都需要等待前一个线程完成。
依我之见,一个更快的方法是重用 user
对象,并在其中为 c
创建一个占位符,在计算后可以将其存储在其中。如果不允许这样做,可以创建一个包含未来值 c
占位符的对象的 results
列表。这样,在计算每个 c
后,你还可以(并行地)从 results
列表中检索相应的 c
占位符,并将其存储在那里。无需对此进行同步,因为你并未对 results
列表本身进行修改,而是对其中的各个对象进行修改。
英文:
In my opinion, your answer is not wrong, but has a couple of issues that might've led the interviewer to deny your answer:
- You said you would iterate the different chunks of the list in parallel yet you provide sequential code instead
List<User> firstNElementsList = list.stream().limit(n).collect(Collectors.toList());
. - After calculating
c
you wanted to add the result into aresults
list. This will actually negate the benefits of the parallelization you did earlier. The reason for this is that parallel thread's execution order is unpredictable so these threads end up in a race condition trying to modify the same resource (theresults
list) potentially at the same time putting the resource in an inconsistent state. This means that to achieve what you described, you need to synchronize these parallel threads at the time of savingc
into theresults
list. Effectively making the whole thing sequential again as each thread needs to wait for the previous thread to finish before modifying theresults
list.
In my opinion, a faster approach would be to reuse the the user
object and have a place holder for c
in it where you can store it after calculation. If that's not allowed, create the results
list of objects that contain a placeholder for the future value of c
. This way, after calculating each c
, you can also (in parallel) retrieve the respective placeholder of c
from the results
list and store it there. No need to synchronize this as you are not making modifications to the results
list itself but to the individual objects in it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论