How to sort os.walk(path) in alphanumeric order, with duplicates coming after the original file, using Python 3?

huangapple go评论124阅读模式
英文:

How to sort os.walk(path) in alphanumeric order, with duplicates coming after the original file, using Python 3?

问题

在Python 3(具体是3.10.6版本)中,您如何更改os.walk(path)排序所找到的文件的方式呢?给定以下文件列表:

  1. IMG0001.jpg
  2. IMG0002.jpg
  3. IMG0002(1).jpg
  4. IMG0002(2).jpg
  5. IMG0003.jpg

您希望按照这个顺序进行排序,其中每个(n)重复的文件位于原始文件之后。目前,os.walk(path)按以下方式对列表进行排序:

  1. IMG0001.jpg
  2. IMG0002(1).jpg
  3. IMG0002(2).jpg
  4. IMG0002.jpg
  5. IMG0003.jpg

我猜测主要问题是默认的排序方法赋予((以及-)比扩展名中的.更高的“排序值”。如果这种情况正确,您该如何修改特殊字符的排序顺序呢?

我尝试过使用sorted(files),但它将文件与os.walk(path)已经排序的方式相同地排序。如果我尝试sorted(files, reverse=True),那么原始文件会出现在重复文件之前,但多个重复文件也会倒序排序,所有原始文件也会倒序排序,即:

  1. IMG0003.jpg
  2. IMG0002.jpg
  3. IMG0002(2).jpg
  4. IMG0002(1).jpg
  5. IMG0001.jpg

如果您想按照您描述的方式排序文件,可以使用自定义的排序函数来实现。以下是一个示例代码:

  1. import os
  2. def custom_sort(file_name):
  3. if "(" in file_name:
  4. base_name, extension = file_name.rsplit("(", 1)
  5. return (base_name, int(extension.split(")")[0]), extension)
  6. else:
  7. base_name, extension = os.path.splitext(file_name)
  8. return (base_name, 0, extension)
  9. path = "your_directory_path"
  10. files = os.listdir(path)
  11. sorted_files = sorted(files, key=custom_sort)
  12. for file in sorted_files:
  13. print(file)

这将按照您期望的方式对文件进行排序。

英文:

In python 3 (specifically 3.10.6), how can you change the way that os.walk(path) sorts the files it finds? Given this list of files:

  1. IMG0001.jpg
  2. IMG0002.jpg
  3. IMG0002(1).jpg
  4. IMG0002(2).jpg
  5. IMG0003.jpg

How would you sort it in that order, with each (n) duplicate file coming after the original file? Currently, os.walk(path) is sorting this list like this:

  1. IMG0001.jpg
  2. IMG0002(1).jpg
  3. IMG0002(2).jpg
  4. IMG0002.jpg
  5. IMG0003.jpg

I suppose the main issue is that the default sort method is giving a higher "sort value" to the ( (and also -) than it is to the . in the extension. If this is correct in what's happening here, how would you modify which special characters come before others?

I've tried to use sorted(files), however that sorts it the same as os.walk(path) already sorts them. If I try sorted(files, reverse=True), then while the originals come before the duplicates, multiple duplicates are now sorted backward and all the originals are backward too, ie:

  1. IMG0003.jpg
  2. IMG0002.jpg
  3. IMG0002(2).jpg
  4. IMG0002(1).jpg
  5. IMG0001.jpg

答案1

得分: 3

String ordering is lexicographic, so you'll need a custom sort key if you want something different. It's a little trickier than expected, but something like this should work:

  1. import os
  2. import re
  3. def key(fname):
  4. basename, ext = os.path.splitext(fname)
  5. v = 0
  6. if m := re.match(r"(.*)\((\d+)\)$", basename):
  7. basename, v = m.groups()
  8. v = int(v)
  9. return basename, ext, v

Now you should be able to use something like files.sort(key=key).

英文:

String ordering is lexicographic, so you'll need a custom sort key if you want something different. It's a little trickier than expected, but something like this should work:

  1. import os
  2. import re
  3. def key(fname):
  4. basename, ext = os.path.splitext(fname)
  5. v = 0
  6. if m := re.match(r"(.*)\((\d+)\)$", basename):
  7. basename, v = m.groups()
  8. v = int(v)
  9. return basename, ext, v

Now you should be able to use something like files.sort(key=key).

答案2

得分: 1

使用pathlib.Path可以更好地了解文件名的语义,构建一个元组,其中包含特殊情况作为前导元素,文件名作为最后一个元素。对这些元组进行排序,但仅保留最后一个元素。

  1. def test():
  2. from pathlib import Path
  3. def filenamesort(inp: list[str]):
  4. """
  5. 构建一个自定义元组列表,对其进行排序并返回最右边的字段,即文件名。
  6. """
  7. def tupleize(v):
  8. """
  9. 根据文件名的 Path.stem 返回字符串元组
  10. 特殊情况。分割成最后一个 `(` 之前的部分和之后的部分
  11. IMG0002(1).jpg => ('IMG0002', 1, 'IMG0002(2).jpg')
  12. 正常情况,返回 stem 和一个空值
  13. IMG0002.jpg => ('IMG0002', 0, 'IMG0002.jpg')
  14. 最后一个元素,按照排序的重要性最小,是文件名
  15. 为了更可靠,foo(xxx).jpg 应该被忽略,因为 xxx 不是数字。
  16. """
  17. pa = Path(v)
  18. stem = pa.stem
  19. if stem.endswith(")"):
  20. lead, seq = stem.rsplit("(", maxsplit=1)
  21. return (lead, int(seq.rstrip(")")), v)
  22. else:
  23. # 空字符串将在 "1)" 之前排序
  24. return (stem, 0, v)
  25. li = [tupleize(v) for v in inp]
  26. # 对列表进行排序,然后返回元组的最后一个位置:文件名本身
  27. return [v[-1] for v in sorted(li)]
  28. def fmt(sin: str):
  29. res = [v for line in sin.splitlines() if (v := line.strip())]
  30. return res
  31. inp = fmt("""
  32. IMG0001.jpg
  33. IMG0002(1).jpg
  34. IMG0002(2).jpg
  35. IMG0002(11).jpg
  36. IMG0002.jpg
  37. IMG0003.jpg
  38. """)
  39. exp = fmt("""
  40. IMG0001.jpg
  41. IMG0002.jpg
  42. IMG0002(1).jpg
  43. IMG0002(2).jpg
  44. IMG0002(11).jpg
  45. IMG0003.jpg
  46. """)
  47. dataexp = [
  48. (inp, exp),
  49. ]
  50. for inp, exp in dataexp:
  51. for f in [filenamesort]:
  52. got = f(inp)
  53. msg = f"\n{f.__name__} for {str(inp):100.100} \nexp :{exp}:\ngot :{got}:\n"
  54. if exp == got:
  55. print(f"✅! {msg}")
  56. else:
  57. print(f"❌! {msg}")
  58. test()

输出:

  1. ✅!
  2. filenamesort for ['IMG0001.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0002.jpg', 'IMG0003.jpg']
  3. exp :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
  4. got :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:

我已经检查过tupleize(用于元组的函数)可以作为sort函数的key参数使用。

即,这也可以起作用:sorted(inp, key=tupleize)

Wim 是对的,这个方法在不同的扩展名下会失败。修复方法如下:

  1. pa = Path(v)
  2. stem = pa.stem
  3. if stem.endswith(")"):
  4. lead, seq = stem.rsplit("(", maxsplit=1)
  5. return (lead, pa.suffix, int(seq.rstrip(")")), v)
  6. else:
  7. # 空字符串将在 "1)" 之前排序
  8. return (stem, pa.suffix, 0, v)
英文:

Using pathlib.Path for more awareness of filename semantics, build a tuple
with special cases as leading elements and the filename at the end. Sort that list of tuples but keep only the last element.

  1. def test():
  2. from pathlib import Path
  3. def filenamesort(inp : list[str]):
  4. """build a list of custom tuples from the filename list, sort it and
  5. return the rightmost field, which is the filename.
  6. """
  7. def tupleize(v):
  8. """ returns a tuple of strings based on Path.stem for the filename
  9. special case. split into the part before the last `(` and what comes after
  10. IMG0002(1).jpg => ('IMG0002', 1, 'IMG0002(2).jpg')
  11. normal case, return the stem and an empty value
  12. IMG0002.jpg => ('IMG0002', 0, 'IMG0002.jpg')
  13. The last element, least significant to sort is the filename
  14. to be more solid foo(xxx).jpg should be ignored as xxx is not a numeric.
  15. """
  16. pa = Path(v)
  17. stem = pa.stem
  18. if stem.endswith(")"):
  19. lead, seq = stem.rsplit("(",maxsplit=1)
  20. return (lead,int(seq.rstrip(")")),v)
  21. else:
  22. # "" will sort before "1)"
  23. return (stem,0,v)
  24. li = [tupleize(v) for v in inp]
  25. #sort the list then return the last position in the tuple: the filename proper
  26. return [v[-1] for v in sorted(li)]
  27. def fmt(sin : str):
  28. res = [v for line in sin.splitlines() if (v:=line.strip())]
  29. return res
  30. inp = fmt("""IMG0001.jpg
  31. IMG0002(1).jpg
  32. IMG0002(2).jpg
  33. IMG0002(11).jpg
  34. IMG0002.jpg
  35. IMG0003.jpg
  36. """)
  37. exp = fmt("""IMG0001.jpg
  38. IMG0002.jpg
  39. IMG0002(1).jpg
  40. IMG0002(2).jpg
  41. IMG0002(11).jpg
  42. IMG0003.jpg""")
  43. dataexp = [
  44. (inp,exp),
  45. ]
  46. for inp, exp in dataexp:
  47. for f in [filenamesort]:
  48. got = f(inp)
  49. msg = f"\n{f.__name__} for {str(inp):100.100} \nexp :{exp}:\ngot :{got}:\n"
  50. if exp == got:
  51. print(f"✅! {msg}")
  52. else:
  53. print(f"❌! {msg}")
  54. test()

output:

  1. ✅!
  2. filenamesort for ['IMG0001.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0002.jpg', 'IMG0003.jpg']
  3. exp :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
  4. got :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:

I've checked that tupleize (the one for the tuple) can be used as a key parameter to sort.

i.e. this works as well sorted(inp,key=tupleize)

Wim was right, this was failing with different extension. Fixed with adjustment below:

  1. pa = Path(v)
  2. stem = pa.stem
  3. if stem.endswith(")"):
  4. lead, seq = stem.rsplit("(",maxsplit=1)
  5. return (lead,pa.suffix, int(seq.rstrip(")")),v)
  6. else:
  7. # "" will sort before "1)"
  8. return (stem,pa.suffix,0,v)

huangapple
  • 本文由 发表于 2023年3月1日 11:49:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75599415.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定