how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

huangapple go评论74阅读模式
英文:

how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

问题

目前,我正在遍历文件夹中所有可用的5,000个文件,并将它们存储在 tbufferoutput 中,然后通过使用 tbufferinput 逐个阅读这些文件,并根据 mtime(在ftp站点上的修改时间)按降序进行排序,仅提取前10个文件。

由于一次性遍历了所有的5,000个文件,导致时间消耗较长,并且与远程ftp站点导致不必要的延迟问题。

我想知道是否有其他简单的方法,而无需遍历,直接从ftp站点获取最新的前10个文件,并根据 mtime 降序排序,然后对它们执行操作?

我当前的 Talend 作业流如下,是否建议其他可以更好地优化作业性能的方法!

基本上,我不想遍历并逐个处理ftp站点上的所有文件,而是直接从远程ftp获取前10个文件:tftpfilelist,然后在数据库中执行检查,稍后再下载它们。

有没有一种方式可以在不遍历的情况下,只使用修改时间戳按降序获取最新的10个文件?- 这是简短的问题。

或者,我想从远程ftp站点中提取过去3天的文件。

文件名的格式如下:A_B_C_D_E_20200926053617.csv

方法B:使用JAVA,
我尝试使用以下 tjava 代码来处理流程B:

Date lastModifiedDate = TalendDate.parseDate("EEE MMM dd HH:mm:ss zzz yyyy", row2.mtime_string);
Date current_date = TalendDate.getCurrentDate();
System.out.println(lastModifiedDate);
System.out.println(current_date);
System.out.println(((String)globalMap.get("tFTPFileList_1_CURRENT_FILE")));
if(TalendDate.diffDate(current_date, lastModifiedDate, "dd") <= 1) {
    System.out.println
    output_row.abs_path = input_row.abs_path;
    System.out.println(output_row.abs_path);
}

现在 tlogrow3 正在打印出全为 NULL 的值,请提供建议。

英文:

As of now i am iterating through all the 5k files available in the folder and store them in a tbufferoutput and read through them by using tbufferinput and sorting them based on mtime desc(modified time in the ftp site) in the descending order and extract the top 10 files only.

Since its iterating through all the 5k files at once its time consuming and causing unnecessary latency issues with the remote ftp site.

i was wondering if there is any other simple way without iterating just get the latest top 10 files from the ftp site directly and sort them based on mtime desc and perform operations with them?

My talend job flow looks like this at the moment,would advise any other methods that could optimize the performance of the job in a much better way!
how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

Basically i dont want to iterate and run through all the files in the ftp site,instead directly get the top 10 from the remote ftp :tftpfilelist and perform checks in db and download them later

IS THERE ANYWAY WITHOUT ITERATING ,CAN I JUST GET THE LATEST 10 FILES just by using modified timestamp in desc order alone?-This is the question in short
OR
I want to extract the LAST 3 days files from the remote ftp site.

Filename is in this format:A_B_C_D_E_20200926053617.csv

Approach B:WITH JAVA,
I tried using the tjava code as below: for the flow B:

Date lastModifiedDate = TalendDate.parseDate(&quot;EEE MMM dd HH:mm:ss zzz yyyy&quot;, row2.mtime_string);

Date current_date = TalendDate.getCurrentDate();

System.out.println(lastModifiedDate);

System.out.println(current_date);
System.out.println(((String)globalMap.get(&quot;tFTPFileList_1_CURRENT_FILE&quot;)));

if(TalendDate.diffDate(current_date, lastModifiedDate,&quot;dd&quot;) &lt;= 1) {

System.out.println

output_row.abs_path = input_row.abs_path;

System.out.println(output_row.abs_path);
}

Now the tlogrow3 is printing NULL values all over,please suggest
how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

答案1

得分: 2

Define 3 context variables:

how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

in tJava, compute the mask (with wildcard) for the 3 days (starting at the current date):

Date currentDate = TalendDate.getCurrentDate();
Date currentDateMinus1 = TalendDate.addDate(currentDate, -1, "dd");
Date currentDateMinus2 = TalendDate.addDate(currentDate, -2, "dd");

context.mask1 = "" + TalendDate.formatDate("yyyyMMdd", currentDate) + ".csv";
context.mask2 = "" + TalendDate.formatDate("yyyyMMdd", currentDateMinus1) + ".csv";
context.mask3 = "" + TalendDate.formatDate("yyyyMMdd", currentDateMinus2) + ".csv";

then in the tFTPFileList, use the 3 context variables for filemask:

how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

to retrieve the files only from today and the 2 previous day.

英文:

Define 3 context variables :

how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

in tJava, compute the mask (with wildcard) for the 3 days (starting at the current date) :

Date currentDate = TalendDate.getCurrentDate();
Date currentDateMinus1 = TalendDate.addDate(currentDate, -1, &quot;dd&quot;);
Date currentDateMinus2 = TalendDate.addDate(currentDate, -2, &quot;dd&quot;);

context.mask1 =&quot;*&quot; + TalendDate.formatDate(&quot;yyyyMMdd&quot;, currentDate) + &quot;*.csv&quot;;
context.mask2 =&quot;*&quot; + TalendDate.formatDate(&quot;yyyyMMdd&quot;, currentDateMinus1) + &quot;*.csv&quot;;
context.mask3 =&quot;*&quot; + TalendDate.formatDate(&quot;yyyyMMdd&quot;, currentDateMinus2) + &quot;*.csv&quot;;

then in the tFTPFileList, use the 3 context variables for filemask :

how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

to retrieve the files only from today and the 2 previous day.

huangapple
  • 本文由 发表于 2020年10月8日 16:30:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/64258690.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定