英文:
In Hadoop HDFS, delete several files older than x days and with spaces in the name (Not like UNIX)
问题
I have hundreds of thousands of files in a hadoop directory and I need to debug them. I'm looking to delete files that are more than 3 months old and I'm trying to delete in batches of a thousand files that I get in that directory with that condition, but I'm having problems. Among the multitude of files there are files that have some space in the name, like "hello word.csv". I've tried doing the batches, using arrays in unix or writing the output to a file, but one way or another, it doesn't recognize the files when doing hdfs dfs -rm -f.
With this, I found that files:
list_files=$(hdfs dfs -ls "${folder_in}" | awk '!/^d/ {print $0}' | awk -v days=${dias} '!/^d/ && $6 < strftime("%Y-%m-%d", systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }')
I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:
while IFS="" read -r file; do
files+=("${file}")
echo -e "\"$file\"" > ${PATH_TMP}/file_del_proof.tmp
done <<< "$list_files"
With the following script, I tried to delete the HDFS files:
total_lines=$(wc -l < "${PATH_TMP}/file_del_proof.tmp")
start_line=1
while [ $start_line -le $total_lines ]; do
end_line=$((start_line + batch_size - 1))
end_line=$((end_line > total_lines ? total_lines : end_line))
hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} 'NR >= start_line && NR <= end_line' "${PATH_TMP}/file_del_proof.tmp")
start_line=$((end_line + 1))
done
The problem is that in that list appear some files that have spaces in the name. I cannot find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name, and when deleting, for example, if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the "hello" remaining for one line:
hdfs dfs -rm /folder/hello
One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder, and then move from that temporary folder to the original one. But still, if I want to move those with space names, I fail in that mv
because the files with spaces are not moved.
Does anyone have any suggestions? The idea I was given was to replace the spaces in the HDFS with underscores (_) in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.
英文:
I have hundreds of thousands of files in a hadoop directory and I need to debug them. I'm looking to delete files that are more than 3 months old and I'm trying to delete in batches of a thousand files that I get in that directory with that condition, but I'm having problems. Among the multitude of files there are files that have some space in the name, like "hello word.csv". I've tried doing the batches, using arrays in unix or writing the output to a file, but one way or another, it doesn't recognise the files in when doing hdfs dfs -rm -f
With this, i found that files
list_files=$(hdfs dfs -ls "${folder_in}" | awk '!/^d/ {print $0}' | awk -v days=${dias} '!/^d/ && $6 < strftime("%Y-%m-%d", systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }')
I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:
while IFS="" read -r file; do
files+=(\"${file}\")
echo -e "\"$file\"" > ${PATH_TMP}/file_del_proof.tmp
done <<< "$list_files"
With the following script I tried to delete the HDFS files:
total_lines=$(wc -l < "${PATH_TMP}/file_del_proof.tmp")
start_line=1
while [ $start_line -le $total_lines ]; do
end_line=$((start_line + batch_size - 1))
end_line=$((end_line > total_lines ? total_lines : end_line))
hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} 'NR >= start_line && NR <= end_line' "${PATH_TMP}/file_del_proof.tmp")
start_line=$((end_line + 1))
done
The problem is that in that list appear some files that have spaces in the name, I can not find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name and when deleting, for example if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the hello remaining for one line.
hdfs dfs -rm /folder/hello
One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder and then move from that temporary folder to the original one. But still if I want to move those with space names, I fail that mv because the files with spaces, are not moved.
Does anyone have any suggestions?
The idea I was given was to replace the spaces in the hdfs with _ in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.
答案1
得分: 1
以下是翻译好的部分:
感觉使用 hdfs dfs -stat
而不是 hdfs dfs -ls
可能是一个更好的选择;示例:
$ hdfs dfs -stat ''%Y,%F,%n/'' some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/
备注: 我在输出格式中添加了尾随的 /
,以便 awk
可以将其用作记录分隔符(这是文件名中不能出现的字符)。此外,使用 ,
作为字段分隔符可以准确获取前两个字段(文件名中可能包含逗号,但您可以从记录中剥离前两个字段)。
现在,唯一需要做的就是选择“修改时间”小于 time-of-day - N 天
的文件,重建它们的完整路径,并将其输出为 NUL
分隔的列表,以供 xargs -0
处理:
#!/bin/bash
folder_in=some/path
dias=30
printf '%s#!/bin/bash
folder_in=some/path
dias=30
printf '%s\0' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
BEGIN {
RS = "/";
basepath = ARGV[1];
delete ARGV[1];
srand();
modtime = (srand() - days * 86400) * 1000
}
$2 == "regular file" && $1 < modtime {
sub(/^([^,]*,){2}/,"");
printf("%s%c", basepath "/" $0, 0)
}
' "$folder_in" |
xargs -0 hdfs dfs -rm -f
' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
BEGIN {
RS = "/";
basepath = ARGV[1];
delete ARGV[1];
srand();
modtime = (srand() - days * 86400) * 1000
}
$2 == "regular file" && $1 < modtime {
sub(/^([^,]*,){2}/,"");
printf("%s%c", basepath "/" $0, 0)
}
' "$folder_in" |
xargs -0 hdfs dfs -rm -f
注意:
-
由于你正在处理“数十万个文件”,我使用了内置的 bash
printf
来展开*
通配符。请注意,任何非内置命令都将导致“参数列表过长”错误。 -
由于在
awk
中使用/
作为记录分隔符,每个$1
都有一个前导的\n
字符;这不重要,因为$1
用作数字,所以换行符会被隐式忽略。此外,最后一个记录将是一个单独的\n
字符,这由$2 == "regular file"
条件过滤掉了。
英文:
It feels like using hdfs dfs -stat
instead of hdfs dfs -ls
would be a better choice; example:
$ hdfs dfs -stat '%Y,%F,%n/' some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/
remark: I added a trailing /
to the output format so that awk
can use it as record separator (it's a character that can't appear in a filename). Also, using a ,
as field separator allows to accurately get the first two fields (the filename might have commas in it but you can just strip the first two fields from the record).
Now, all that's left to do is to select the files whose "modification time" is smaller than the time-of-day - N days
, rebuild their full path and output the latter as a NUL
-delimited list for xargs -0
to process:
#!/bin/bash
folder_in=some/path
dias=30
printf '%s#!/bin/bash
folder_in=some/path
dias=30
printf '%s\0' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
BEGIN {
RS = "/";
basepath = ARGV[1];
delete ARGV[1];
srand();
modtime = (srand() - days * 86400) * 1000
}
$2 == "regular file" && $1 < modtime {
sub(/^([^,]*,){2}/,"");
printf("%s%c", basepath "/" $0, 0)
}
' "$folder_in" |
xargs -0 hdfs dfs -rm -f
' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
BEGIN {
RS = "/";
basepath = ARGV[1];
delete ARGV[1];
srand();
modtime = (srand() - days * 86400) * 1000
}
$2 == "regular file" && $1 < modtime {
sub(/^([^,]*,){2}/,"");
printf("%s%c", basepath "/" $0, 0)
}
' "$folder_in" |
xargs -0 hdfs dfs -rm -f
notes:
-
Because you're dealing with "hundreds of thousands of files" I'm using the bash builtin
printf
for expanding the*
glob. FYI, any non-builtin command would fail with anArgument list too long
error. -
As a consequence of using
/
as record separator inawk
, each$1
have a leading\n
character; it doesn't matter because$1
is used as a number so the newline is implicitly ignored. Additionally, the last record will be a single\n
character, which is filtered out by the$2 == "regular file"
condition.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论