英文:
how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job
问题
目前,我正在遍历文件夹中所有可用的5,000个文件,并将它们存储在 tbufferoutput 中,然后通过使用 tbufferinput 逐个阅读这些文件,并根据 mtime(在ftp站点上的修改时间)按降序进行排序,仅提取前10个文件。
由于一次性遍历了所有的5,000个文件,导致时间消耗较长,并且与远程ftp站点导致不必要的延迟问题。
我想知道是否有其他简单的方法,而无需遍历,直接从ftp站点获取最新的前10个文件,并根据 mtime 降序排序,然后对它们执行操作?
我当前的 Talend 作业流如下,是否建议其他可以更好地优化作业性能的方法!
基本上,我不想遍历并逐个处理ftp站点上的所有文件,而是直接从远程ftp获取前10个文件:tftpfilelist,然后在数据库中执行检查,稍后再下载它们。
有没有一种方式可以在不遍历的情况下,只使用修改时间戳按降序获取最新的10个文件?- 这是简短的问题。
或者,我想从远程ftp站点中提取过去3天的文件。
文件名的格式如下:A_B_C_D_E_20200926053617.csv
方法B:使用JAVA,
我尝试使用以下 tjava 代码来处理流程B:
Date lastModifiedDate = TalendDate.parseDate("EEE MMM dd HH:mm:ss zzz yyyy", row2.mtime_string);
Date current_date = TalendDate.getCurrentDate();
System.out.println(lastModifiedDate);
System.out.println(current_date);
System.out.println(((String)globalMap.get("tFTPFileList_1_CURRENT_FILE")));
if(TalendDate.diffDate(current_date, lastModifiedDate, "dd") <= 1) {
System.out.println
output_row.abs_path = input_row.abs_path;
System.out.println(output_row.abs_path);
}
现在 tlogrow3 正在打印出全为 NULL 的值,请提供建议。
英文:
As of now i am iterating through all the 5k files available in the folder and store them in a tbufferoutput and read through them by using tbufferinput and sorting them based on mtime desc(modified time in the ftp site) in the descending order and extract the top 10 files only.
Since its iterating through all the 5k files at once its time consuming and causing unnecessary latency issues with the remote ftp site.
i was wondering if there is any other simple way without iterating just get the latest top 10 files from the ftp site directly and sort them based on mtime desc and perform operations with them?
My talend job flow looks like this at the moment,would advise any other methods that could optimize the performance of the job in a much better way!
Basically i dont want to iterate and run through all the files in the ftp site,instead directly get the top 10 from the remote ftp :tftpfilelist and perform checks in db and download them later
IS THERE ANYWAY WITHOUT ITERATING ,CAN I JUST GET THE LATEST 10 FILES just by using modified timestamp in desc order alone?-This is the question in short
OR
I want to extract the LAST 3 days files from the remote ftp site.
Filename is in this format:A_B_C_D_E_20200926053617.csv
Approach B:WITH JAVA,
I tried using the tjava code as below: for the flow B:
Date lastModifiedDate = TalendDate.parseDate("EEE MMM dd HH:mm:ss zzz yyyy", row2.mtime_string);
Date current_date = TalendDate.getCurrentDate();
System.out.println(lastModifiedDate);
System.out.println(current_date);
System.out.println(((String)globalMap.get("tFTPFileList_1_CURRENT_FILE")));
if(TalendDate.diffDate(current_date, lastModifiedDate,"dd") <= 1) {
System.out.println
output_row.abs_path = input_row.abs_path;
System.out.println(output_row.abs_path);
}
Now the tlogrow3 is printing NULL values all over,please suggest
答案1
得分: 2
Define 3 context variables:
in tJava, compute the mask (with wildcard) for the 3 days (starting at the current date):
Date currentDate = TalendDate.getCurrentDate();
Date currentDateMinus1 = TalendDate.addDate(currentDate, -1, "dd");
Date currentDateMinus2 = TalendDate.addDate(currentDate, -2, "dd");
context.mask1 = "" + TalendDate.formatDate("yyyyMMdd", currentDate) + ".csv";
context.mask2 = "" + TalendDate.formatDate("yyyyMMdd", currentDateMinus1) + ".csv";
context.mask3 = "" + TalendDate.formatDate("yyyyMMdd", currentDateMinus2) + ".csv";
then in the tFTPFileList, use the 3 context variables for filemask:
to retrieve the files only from today and the 2 previous day.
英文:
Define 3 context variables :
in tJava, compute the mask (with wildcard) for the 3 days (starting at the current date) :
Date currentDate = TalendDate.getCurrentDate();
Date currentDateMinus1 = TalendDate.addDate(currentDate, -1, "dd");
Date currentDateMinus2 = TalendDate.addDate(currentDate, -2, "dd");
context.mask1 ="*" + TalendDate.formatDate("yyyyMMdd", currentDate) + "*.csv";
context.mask2 ="*" + TalendDate.formatDate("yyyyMMdd", currentDateMinus1) + "*.csv";
context.mask3 ="*" + TalendDate.formatDate("yyyyMMdd", currentDateMinus2) + "*.csv";
then in the tFTPFileList, use the 3 context variables for filemask :
to retrieve the files only from today and the 2 previous day.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论