在Hadoop HDFS中,删除几天前且文件名中包含空格的多个文件(不同于UNIX)。

huangapple go评论61阅读模式
英文:

In Hadoop HDFS, delete several files older than x days and with spaces in the name (Not like UNIX)

问题

I have hundreds of thousands of files in a hadoop directory and I need to debug them. I'm looking to delete files that are more than 3 months old and I'm trying to delete in batches of a thousand files that I get in that directory with that condition, but I'm having problems. Among the multitude of files there are files that have some space in the name, like "hello word.csv". I've tried doing the batches, using arrays in unix or writing the output to a file, but one way or another, it doesn't recognize the files when doing hdfs dfs -rm -f.

With this, I found that files:

list_files=$(hdfs dfs -ls "${folder_in}" | awk '!/^d/ {print $0}' | awk -v days=${dias} '!/^d/ && $6 < strftime("%Y-%m-%d", systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }')

I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:

while IFS="" read -r file; do
    files+=("${file}")
    echo -e "\"$file\"" > ${PATH_TMP}/file_del_proof.tmp
done <<< "$list_files"

With the following script, I tried to delete the HDFS files:

total_lines=$(wc -l < "${PATH_TMP}/file_del_proof.tmp")
start_line=1
while [ $start_line -le $total_lines ]; do
    end_line=$((start_line + batch_size - 1))
    end_line=$((end_line > total_lines ? total_lines : end_line))
    hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} 'NR >= start_line && NR <= end_line' "${PATH_TMP}/file_del_proof.tmp")
    start_line=$((end_line + 1))
done

The problem is that in that list appear some files that have spaces in the name. I cannot find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name, and when deleting, for example, if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the "hello" remaining for one line:

hdfs dfs -rm /folder/hello

One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder, and then move from that temporary folder to the original one. But still, if I want to move those with space names, I fail in that mv because the files with spaces are not moved.

Does anyone have any suggestions? The idea I was given was to replace the spaces in the HDFS with underscores (_) in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.

英文:

I have hundreds of thousands of files in a hadoop directory and I need to debug them. I'm looking to delete files that are more than 3 months old and I'm trying to delete in batches of a thousand files that I get in that directory with that condition, but I'm having problems. Among the multitude of files there are files that have some space in the name, like "hello word.csv". I've tried doing the batches, using arrays in unix or writing the output to a file, but one way or another, it doesn't recognise the files in when doing hdfs dfs -rm -f

With this, i found that files

list_files=$(hdfs dfs -ls &quot;${folder_in}&quot; | awk &#39;!/^d/ {print $0}&#39; | awk -v days=${dias} &#39;!/^d/ &amp;&amp; $6 &lt; strftime(&quot;%Y-%m-%d&quot;, systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }&#39;)

I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:

while IFS=&quot;&quot; read -r file; do
    files+=(\&quot;${file}\&quot;)
    echo -e &quot;\&quot;$file\&quot;&quot; &gt; ${PATH_TMP}/file_del_proof.tmp
done &lt;&lt;&lt; &quot;$list_files&quot;

With the following script I tried to delete the HDFS files:

    total_lines=$(wc -l &lt; &quot;${PATH_TMP}/file_del_proof.tmp&quot;)
    start_line=1
    while [ $start_line -le $total_lines ]; do
        end_line=$((start_line + batch_size - 1))
        end_line=$((end_line &gt; total_lines ? total_lines : end_line))
        hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} &#39;NR &gt;= start_line &amp;&amp; NR &lt;= end_line&#39; &quot;${PATH_TMP}/file_del_proof.tmp&quot;)
        start_line=$((end_line + 1))
   done

The problem is that in that list appear some files that have spaces in the name, I can not find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name and when deleting, for example if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the hello remaining for one line.

hdfs dfs -rm /folder/hello

One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder and then move from that temporary folder to the original one. But still if I want to move those with space names, I fail that mv because the files with spaces, are not moved.

Does anyone have any suggestions?
The idea I was given was to replace the spaces in the hdfs with _ in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.

答案1

得分: 1

以下是翻译好的部分:

感觉使用 hdfs dfs -stat 而不是 hdfs dfs -ls 可能是一个更好的选择;示例:

$ hdfs dfs -stat '&#39;%Y,%F,%n/&#39;' some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/

备注: 我在输出格式中添加了尾随的 /,以便 awk 可以将其用作记录分隔符(这是文件名中不能出现的字符)。此外,使用 , 作为字段分隔符可以准确获取前两个字段(文件名中可能包含逗号,但您可以从记录中剥离前两个字段)。

现在,唯一需要做的就是选择“修改时间”小于 time-of-day - N 天 的文件,重建它们的完整路径,并将其输出为 NUL 分隔的列表,以供 xargs -0 处理:

#!/bin/bash

folder_in=some/path
dias=30

printf '%s
#!/bin/bash

folder_in=some/path
dias=30

printf '%s\0' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
    BEGIN {
        RS = "/";
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == "regular file" && $1 < modtime {
        sub(/^([^,]*,){2}/,"");
        printf("%s%c", basepath "/" $0, 0)
    }
' "$folder_in" |
xargs -0 hdfs dfs -rm -f
'
"$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' | awk -v days="$dias" ' BEGIN { RS = "/"; basepath = ARGV[1]; delete ARGV[1]; srand(); modtime = (srand() - days * 86400) * 1000 } $2 == "regular file" && $1 < modtime { sub(/^([^,]*,){2}/,""); printf("%s%c", basepath "/" $0, 0) } ' "$folder_in" | xargs -0 hdfs dfs -rm -f

注意:

  • 由于你正在处理“数十万个文件”,我使用了内置的 bash printf 来展开 * 通配符。请注意,任何非内置命令都将导致“参数列表过长”错误。

  • 由于在 awk 中使用 / 作为记录分隔符,每个 $1 都有一个前导的 \n 字符;这不重要,因为 $1 用作数字,所以换行符会被隐式忽略。此外,最后一个记录将是一个单独的 \n 字符,这由 $2 == "regular file" 条件过滤掉了。

英文:

It feels like using hdfs dfs -stat instead of hdfs dfs -ls would be a better choice; example:

$ hdfs dfs -stat &#39;%Y,%F,%n/&#39; some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/

remark: I added a trailing / to the output format so that awk can use it as record separator (it's a character that can't appear in a filename). Also, using a , as field separator allows to accurately get the first two fields (the filename might have commas in it but you can just strip the first two fields from the record).

Now, all that's left to do is to select the files whose "modification time" is smaller than the time-of-day - N days, rebuild their full path and output the latter as a NUL-delimited list for xargs -0 to process:

#!/bin/bash

folder_in=some/path
dias=30

printf &#39;%s
#!/bin/bash

folder_in=some/path
dias=30

printf &#39;%s\0&#39; &quot;$folder_in&quot;/* |
xargs -0 hdfs dfs -stat &#39;%Y,%F,%n/&#39; |
awk -v days=&quot;$dias&quot; &#39;
    BEGIN {
        RS = &quot;/&quot;;
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == &quot;regular file&quot; &amp;&amp; $1 &lt; modtime {
        sub(/^([^,]*,){2}/,&quot;&quot;);
        printf(&quot;%s%c&quot;, basepath &quot;/&quot; $0, 0)
    }
&#39; &quot;$folder_in&quot; |
xargs -0 hdfs dfs -rm -f
&#39; &quot;$folder_in&quot;/* |
xargs -0 hdfs dfs -stat &#39;%Y,%F,%n/&#39; | awk -v days=&quot;$dias&quot; &#39; BEGIN { RS = &quot;/&quot;; basepath = ARGV[1]; delete ARGV[1]; srand(); modtime = (srand() - days * 86400) * 1000 } $2 == &quot;regular file&quot; &amp;&amp; $1 &lt; modtime { sub(/^([^,]*,){2}/,&quot;&quot;); printf(&quot;%s%c&quot;, basepath &quot;/&quot; $0, 0) } &#39; &quot;$folder_in&quot; | xargs -0 hdfs dfs -rm -f

notes:

  • Because you're dealing with "hundreds of thousands of files" I'm using the bash builtin printf for expanding the * glob. FYI, any non-builtin command would fail with an Argument list too long error.

  • As a consequence of using / as record separator in awk, each $1 have a leading \n character; it doesn't matter because $1 is used as a number so the newline is implicitly ignored. Additionally, the last record will be a single \n character, which is filtered out by the $2 == &quot;regular file&quot; condition.

huangapple
  • 本文由 发表于 2023年5月18日 01:08:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76274582.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定