2020年8月2日 18:51:37go评论162阅读模式

英文:

Fastest way to find files modified in last 'x' minutes

问题

I have a requirement to find files modified in the last 10 minutes in a directory. The directory keeps getting updated and it will have around 50k-60k files every time. I'm using the below code to get the files:

import java.io.File
import java.time.Instant

val dir = new File("/path/to/dir") 
val files = dir.listFiles.toList.filter(f => f.getName.matches("some filter"))
files.filter(f => f.isFile && f.exists &&
    Instant.ofEpochMilli(f.lastModified).plus(10, MINUTES).isAfter(Instant.now))
    .toList.sortBy(_.lastModified)

This takes around 20-30 minutes to run. But I want to get the results in less than 10 minutes.

I even tried running this in our hadoop cluster using spark. This is the spark code:

val sparkConfig = new SparkConf()
    .setAppName("findRecentFiles")
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .set("spark.shuffle.compress", "true")
    .set("spark.rdd.compress", "true")
val sc = new SparkContext(sparkConfig)
val rdd = sc.parallelize(files)
rdd.filter(f => f.isFile && f.exists &&
    Instant.ofEpochMilli(f.lastModified).plus(10, MINUTES).isAfter(Instant.now))
    .collect.toList.sortBy(_.lastModified)

Still, it takes the same time. And one thing I noticed is that filtering based on the file name is fast. But adding the lastModified filter makes it slow. Is there any better way so that I can get the results faster?

UPDATE
I updated the spark configs and now I'm able to get the results in less than 10 minutes. Earlier, I was running the jar like this:

spark-submit myJar.jar

I changed it to this:

spark-submit --deploy-mode client --queue SomeNonDefaultQueue --executor-memory 16g --num-executors 10 --executor-cores 1 --master yarn myJar.jar

Also removed set("spark.rdd.compress", "true") from code as it increases the CPU time, as explained here - https://spark.apache.org/docs/2.3.0/configuration.html#compression-and-serialization

英文:

I have a requirement to find files modified in last 10 minutes in a directory. The directory keeps getting updated and it will have around 50k-60k files every time. I'm using the below code to get the files:

import java.io.File
import java.time.Instant

val dir = new File(&quot;/path/to/dir&quot;) 
val files = dir.listFiles.toList.filter(f =&gt; f.getName.matches(&quot;some filter&quot;))
files.filter(f =&gt; f.isFile &amp;&amp; f.exists &amp;&amp;
    Instant.ofEpochMilli(f.lastModified).plus(10, MINUTES).isAfter(Instant.now))
    .toList.sortBy(_.lastModified)

This takes around 20-30 minutes to run. But I want to get the results in less than 10 minutes.
I even tried running this in our hadoop cluster using spark. This is the spark code:

val sparkConfig = new SparkConf()
    .setAppName(&quot;findRecentFiles&quot;)
    .set(&quot;spark.serializer&quot;, &quot;org.apache.spark.serializer.KryoSerializer&quot;)
    .set(&quot;spark.shuffle.compress&quot;, &quot;true&quot;)
    .set(&quot;spark.rdd.compress&quot;, &quot;true&quot;)
val sc = new SparkContext(sparkConfig)
val rdd = sc.parallelize(files)
rdd.filter(f =&gt; f.isFile &amp;&amp; f.exists &amp;&amp;
    Instant.ofEpochMilli(f.lastModified).plus(10, MINUTES).isAfter(Instant.now))
    .collect.toList.sortBy(_.lastModified)

Still it takes the same time. And one thing I noticed is that filtering based on the file name is fast. But adding the lastModified filter makes it slow. Is there any better way so that I can get the results faster?

UPDATE
I updated the spark configs and now I'm able to get the results in less than 10 minutes. Earlier, I was running the jar like this:

spark-submit myJar.jar

I changed it to this:

spark-submit --deploy-mode client --queue SomeNonDefaultQueue --executor-memory 16g --num-executors 10 --executor-cores 1 --master yarn myJar.jar

Also removed set("spark.rdd.compress", "true") from code as it increases the CPU time, as explained here - https://spark.apache.org/docs/2.3.0/configuration.html#compression-and-serialization

答案1

得分: 1

问题出在stat()检查获取最后修改时间的操作是在线性搜索目录以查找名称之后执行的。如果您可以更改目录格式，请添加子目录（由文件名计算）并尝试将每个子目录中的条目数分组到约1000个。

否则，创建一个名称:lastModified的映射并使用WatchService在触发事件时更新映射。

英文:

The problem is the stat() check to get last modified comes after a linear search through the directory to look-up the name. If you can change the directory format, add subdirectories (calculated by file name) and try to group the number of entries in each subdirectory to ~1000.

Otherwise, create a map of name:lastModified and use the WatchService to update the map whenever an event is fired.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

最快的方式找到在最后 ‘x’ 分钟内修改的文件。

问题

答案1

从数组中删除左侧重复项

统计链表中的单词数量

如何在Java中用特殊字符替换字符串中的单词？

`boot_completed` 在 Android 10（API 级别 29）上无法正常工作。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论