2023年5月18日 01:08:51go评论61阅读模式

英文:

In Hadoop HDFS, delete several files older than x days and with spaces in the name (Not like UNIX)

问题

I have hundreds of thousands of files in a hadoop directory and I need to debug them. I'm looking to delete files that are more than 3 months old and I'm trying to delete in batches of a thousand files that I get in that directory with that condition, but I'm having problems. Among the multitude of files there are files that have some space in the name, like "hello word.csv". I've tried doing the batches, using arrays in unix or writing the output to a file, but one way or another, it doesn't recognize the files when doing hdfs dfs -rm -f.

With this, I found that files:

list_files=$(hdfs dfs -ls "${folder_in}" | awk '!/^d/ {print $0}' | awk -v days=${dias} '!/^d/ && $6 < strftime("%Y-%m-%d", systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }')

I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:

while IFS="" read -r file; do
    files+=("${file}")
    echo -e "\"$file\"" > ${PATH_TMP}/file_del_proof.tmp
done <<< "$list_files"

With the following script, I tried to delete the HDFS files:

total_lines=$(wc -l < "${PATH_TMP}/file_del_proof.tmp")
start_line=1
while [ $start_line -le $total_lines ]; do
    end_line=$((start_line + batch_size - 1))
    end_line=$((end_line > total_lines ? total_lines : end_line))
    hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} 'NR >= start_line && NR <= end_line' "${PATH_TMP}/file_del_proof.tmp")
    start_line=$((end_line + 1))
done

The problem is that in that list appear some files that have spaces in the name. I cannot find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name, and when deleting, for example, if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the "hello" remaining for one line:

hdfs dfs -rm /folder/hello

One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder, and then move from that temporary folder to the original one. But still, if I want to move those with space names, I fail in that mv because the files with spaces are not moved.

Does anyone have any suggestions? The idea I was given was to replace the spaces in the HDFS with underscores (_) in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.

英文:

With this, i found that files

list_files=$(hdfs dfs -ls &quot;${folder_in}&quot; | awk &#39;!/^d/ {print $0}&#39; | awk -v days=${dias} &#39;!/^d/ &amp;&amp; $6 &lt; strftime(&quot;%Y-%m-%d&quot;, systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }&#39;)

I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:

while IFS=&quot;&quot; read -r file; do
    files+=(\&quot;${file}\&quot;)
    echo -e &quot;\&quot;$file\&quot;&quot; &gt; ${PATH_TMP}/file_del_proof.tmp
done &lt;&lt;&lt; &quot;$list_files&quot;

With the following script I tried to delete the HDFS files:

    total_lines=$(wc -l &lt; &quot;${PATH_TMP}/file_del_proof.tmp&quot;)
    start_line=1
    while [ $start_line -le $total_lines ]; do
        end_line=$((start_line + batch_size - 1))
        end_line=$((end_line &gt; total_lines ? total_lines : end_line))
        hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} &#39;NR &gt;= start_line &amp;&amp; NR &lt;= end_line&#39; &quot;${PATH_TMP}/file_del_proof.tmp&quot;)
        start_line=$((end_line + 1))
   done

The problem is that in that list appear some files that have spaces in the name, I can not find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name and when deleting, for example if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the hello remaining for one line.

hdfs dfs -rm /folder/hello

One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder and then move from that temporary folder to the original one. But still if I want to move those with space names, I fail that mv because the files with spaces, are not moved.

Does anyone have any suggestions?
The idea I was given was to replace the spaces in the hdfs with _ in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.

答案1

得分: 1

以下是翻译好的部分：

感觉使用 hdfs dfs -stat 而不是 hdfs dfs -ls 可能是一个更好的选择；示例：

$ hdfs dfs -stat '&#39;%Y,%F,%n/&#39;' some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/

备注： 我在输出格式中添加了尾随的 /，以便 awk 可以将其用作记录分隔符（这是文件名中不能出现的字符）。此外，使用 , 作为字段分隔符可以准确获取前两个字段（文件名中可能包含逗号，但您可以从记录中剥离前两个字段）。

现在，唯一需要做的就是选择“修改时间”小于 time-of-day - N 天 的文件，重建它们的完整路径，并将其输出为 NUL 分隔的列表，以供 xargs -0 处理：

#!/bin/bash

folder_in=some/path
dias=30

printf '%s#!/bin/bash

folder_in=some/path
dias=30

printf '%s\0' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
    BEGIN {
        RS = "/";
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == "regular file" && $1 < modtime {
        sub(/^([^,]*,){2}/,"");
        printf("%s%c", basepath "/" $0, 0)
    }
' "$folder_in" |
xargs -0 hdfs dfs -rm -f
' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
    BEGIN {
        RS = "/";
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == "regular file" && $1 < modtime {
        sub(/^([^,]*,){2}/,"");
        printf("%s%c", basepath "/" $0, 0)
    }
' "$folder_in" |
xargs -0 hdfs dfs -rm -f

注意：

由于你正在处理“数十万个文件”，我使用了内置的 bash printf 来展开 * 通配符。请注意，任何非内置命令都将导致“参数列表过长”错误。
由于在 awk 中使用 / 作为记录分隔符，每个 $1 都有一个前导的 \n 字符；这不重要，因为 $1 用作数字，所以换行符会被隐式忽略。此外，最后一个记录将是一个单独的 \n 字符，这由 $2 == "regular file" 条件过滤掉了。

英文:

It feels like using hdfs dfs -stat instead of hdfs dfs -ls would be a better choice; example:

$ hdfs dfs -stat &#39;%Y,%F,%n/&#39; some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/

remark: I added a trailing / to the output format so that awk can use it as record separator (it's a character that can't appear in a filename). Also, using a , as field separator allows to accurately get the first two fields (the filename might have commas in it but you can just strip the first two fields from the record).

Now, all that's left to do is to select the files whose "modification time" is smaller than the time-of-day - N days, rebuild their full path and output the latter as a NUL-delimited list for xargs -0 to process:

#!/bin/bash

folder_in=some/path
dias=30

printf &#39;%s#!/bin/bash

folder_in=some/path
dias=30

printf &#39;%s\0&#39; &quot;$folder_in&quot;/* |
xargs -0 hdfs dfs -stat &#39;%Y,%F,%n/&#39; |
awk -v days=&quot;$dias&quot; &#39;
    BEGIN {
        RS = &quot;/&quot;;
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == &quot;regular file&quot; &amp;&amp; $1 &lt; modtime {
        sub(/^([^,]*,){2}/,&quot;&quot;);
        printf(&quot;%s%c&quot;, basepath &quot;/&quot; $0, 0)
    }
&#39; &quot;$folder_in&quot; |
xargs -0 hdfs dfs -rm -f
&#39; &quot;$folder_in&quot;/* |
xargs -0 hdfs dfs -stat &#39;%Y,%F,%n/&#39; |
awk -v days=&quot;$dias&quot; &#39;
    BEGIN {
        RS = &quot;/&quot;;
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == &quot;regular file&quot; &amp;&amp; $1 &lt; modtime {
        sub(/^([^,]*,){2}/,&quot;&quot;);
        printf(&quot;%s%c&quot;, basepath &quot;/&quot; $0, 0)
    }
&#39; &quot;$folder_in&quot; |
xargs -0 hdfs dfs -rm -f

notes:

Because you're dealing with "hundreds of thousands of files" I'm using the bash builtin printf for expanding the * glob. FYI, any non-builtin command would fail with an Argument list too long error.
As a consequence of using / as record separator in awk, each $1 have a leading \n character; it doesn't matter because $1 is used as a number so the newline is implicitly ignored. Additionally, the last record will be a single \n character, which is filtered out by the $2 == "regular file" condition.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Hadoop HDFS中，删除几天前且文件名中包含空格的多个文件（不同于UNIX）。

问题

答案1

使用Python，我想知道如何删除文件中两个字符串之间第一次出现的字符。

Awk如何打印特定表达式的下一行，但仅当它包含数字时。

如何从tsv文件中填充值到shell脚本，并根据第一列的值创建新文件。

GCS Hadoop connector error: ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer ls: No FileSystem for scheme gs

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论