英文:
how to use git diff --name-only with non ascii file names
问题
我有一个在提交前运行的预提交钩子,它运行以下代码并在提交前对这些文件执行一些格式化操作。
然而,我有一些包含非ASCII字符的文件,意识到这些文件没有被格式化。
经过一些调试,发现是因为git diff输出这些文件名时带有转义字符,并用双引号括起来,例如:
"\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext"
我尝试修改正则表达式模式以接受带引号括起来的名称,甚至尝试删除这些引号,但无论我在哪里尝试访问这个文件,都找不到,例如:
$ cat $file
cat: '"136130130130133134132130134131130.ext"': No such file or directory
$ file="${file:1:${#file}-2}"
$ cat $file
cat: '136130130130133134132130134131130.ext': No such file or directory
如何处理包含非ASCII字符的文件?
英文:
I have a pre-commit hook that runs
files=`git diff --cached --name-only --diff-filter=ACMR | grep -E "$extension_regex"`
and performs some formatting on those files before committing.
However, I have some files that contain non-ascii letters, and realized those files weren't being formatted.
After some debugging, found that it was because git diff outputted those file names with escaped characters and surrounded with double quotes, for example:
"\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext"
I tried to modify the regex pattern to accept names surrounded with quotes, and even tried removing those quotes, but anywhere I try to access the file it can't be found, for example:
$ cat $file
cat: '"136130130130133134132130134131130.ext"': No such file or directory
$ file="${file:1:${#file}-2}"
$ cat $file
cat: '136130130130133134132130134131130.ext': No such file or directory
How do I handle files with non ascii characters?
答案1
得分: 2
你可以使用 -z
选项来获取空终止符,而不是使用C字符串字面引用来处理路径中的非ASCII字符。
files=$(
git diff -z --cached --name-only --diff-filter=ACMR \
| grep -Ez "$extension_regex" \
| tr \\0 \\n
)
UTF-8仍然不完全通用,可能永远也不会完全通用,文件系统之间差异很大,除了ASCII以外的任何东西都不是完全可移植的。Git 在默认情况下选择了使用C字符串字面约定对不能在ASCII中往返的内容进行编码,虽然它的选择有点烦人,但它确实具有安全的往返性,基本上没有其他东西能做到(至少目前还没有),所以就是这样。
如果你不担心完全不受限制的文件名,特别是如果你不需要处理文件名中包含自己的\n
的情况,你可以将tr
的选项提升一级,然后在grep
中删除-z
选项,或者完全删除-z
选项,关闭core.quotepath
。从命令行中执行:
git -c core.quotepath=false diff --name-only | grep etc
或者在配置中设置。
英文:
You can use the -z
option to get nul termination instead of the C string literal quoting to deal with non-ASCII characters in paths.
files=$(
git diff -z --cached --name-only --diff-filter=ACMR \
| grep -Ez "$extension_regex" \
| tr \files=$(
git diff -z --cached --name-only --diff-filter=ACMR \
| grep -Ez "$extension_regex" \
| tr \\0 \\n
)
\\n
)
utf-8 is still not completely universal and may never be, filesystems are so disparate that anything beyond ASCII is not entirely portable. Git's playing it annoyingly safe with its default to encoding anything that won't roundtrip in ASCII using C string literal conventions, but its choice does have that safe roundtrippability going for it which basically nothing else does (at least not yet) so there's that.
If you're not worried about completely unconstrained file names, in particular if you don't need to handle file names containing their own \n
's, newlines, you can hike the tr up a step and remove the -z
option on the grep, or drop the -z
option entirely, turn core.quotepath
off. From the command line:
git -c core.quotepath=false diff --name-only | grep etc
or in the configs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论