2023年2月18日 16:19:57go评论63阅读模式

英文:

Randomly splitting 1 file from many files based on ID

问题

Before

TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000001.jpg
- 00000000_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000001_0002_00000001.jpg
- 00000001_0002_00000002.jpg
- 00000004_0001_00000001.jpg
- 00000004_0002_00000001.jpg

After

TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000002.jpg
- 00000001_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000004_0001_00000001.jpg
ValidationSet
- 00000000_0001_00000001.jpg
- 00000001_0001_00000002.jpg
- 00000004_0001_00000002.jpg

英文:

In my dataset, I have a large number of images in jpg format and they are named [ID]_[Cam]_[Frame].jpg. The dataset contains many IDs, and every ID has a different number of image. I want to randomly take 1 image from each ID into a different set of images. The problem is that the IDs in the dataset aren't always in order (Sometimes jump and skipped some numbers). As for the example below, the set of files doesn't have ID number 2 and 3.

Is there any python code to do this?

Before

TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000001.jpg
- 00000000_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000001_0002_00000001.jpg
- 00000001_0002_00000002.jpg
- 00000004_0001_00000001.jpg
- 00000004_0002_00000001.jpg

After

TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000002.jpg
- 00000001_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000004_0001_00000001.jpg
ValidationSet
- 00000000_0001_00000001.jpg
- 00000001_0001_00000002.jpg
- 00000004_0001_00000002.jpg

答案1

得分: 0

你需要使用一种数据结构 - 字典来进行排序。
例如：

myDict = {'a': '00000000_0001_00000000.jpg', 'b': '00000000_0001_00000001.jpg'}

myKeys = list(myDict.keys())
myKeys.sort()
sorted_dict = {i: myDict[i] for i in myKeys}

print(sorted_dict)

英文:

You need to use a sort alongwith Datastructure - Dictionary
FOr eg :

myDict = {&#39;a&#39;: 00000000_0001_00000000.jpg, &#39;b&#39;: 00000000_0001_00000001.jpg}
 
myKeys = list(myDict.keys())
myKeys.sort()
sorted_dict = {i: myDict[i] for i in myKeys}
 
print(sorted_dict)

答案2

得分: 0

在这种情况下，我会使用一个字典，其中id作为键，具有匹配id的文件名列表作为值。然后从字典中随机选择数组。

import os
from random import choice
from pathlib import Path
import shutil

source_folder = "TrainSet"

dest_folder = "ValidationSet"

dir_list = os.listdir(source_folder)

ids = {}

for f in dir_list:
    f_id = f.split("_")[0]
    ids[f_id] = [f, *ids.get(f_id, [])]

Path(dest_folder).mkdir(parents=True, exist_ok=True)

for files in ids.values():
    random_file = choice(files)
    shutil.move(
        os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
    )

在您的情况中，将SOURCE_FOLDER替换为TrainSet，将DEST_FOLDER替换为ValidationSet。

英文:

In this case, I would use a dictionary with id as the key and list of the name of files with matching id as the value. Then randomly picks the array from the dict.

import os
from random import choice
from pathlib import Path
import shutil

source_folder = &quot;SOURCE_FOLDER&quot;

dest_folder = &quot;DEST_FOLDER&quot;

dir_list = os.listdir(source_folder)

ids = {}

for f in dir_list:
    f_id = f.split(&quot;_&quot;)[0]
    ids[f_id] = [f, *ids.get(f_id, [])]

Path(dest_folder).mkdir(parents=True, exist_ok=True)

for files in ids.values():
    random_file = choice(files)
    shutil.move(
        os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
    )

In your case, replace SOURCE_FOLDER with TrainSet and DEST_FOLDER with ValidationSet.

答案3

得分: 0

这是一个使用 Pandas DataFrame 的解决方案，无需将文件移动到不同文件夹。str.extract 方法可以从 DataFrame 中提取与正则表达式模式匹配的文本作为新列。文件名按新创建的 f_id 列的值进行分组。groupby.sample 方法从每个组中返回一个随机样本，random_state 参数允许重现。

import numpy as np
import pandas as pd

# 将文件名加载到数据帧中
data = [
    {"fname": "00000000_0001_00000000.jpg"},
    {"fname": "00000000_0001_00000001.jpg"},
    {"fname": "00000000_0002_00000001.jpg"},
    {"fname": "00000001_0001_00000001.jpg"},
    {"fname": "00000001_0002_00000001.jpg"},
    {"fname": "00000001_0002_00000002.jpg"},
    {"fname": "00000004_0001_00000001.jpg"},
    {"fname": "00000004_0002_00000001.jpg"},
]
df = pd.DataFrame(data)

# 从 'fname' 字符串中提取 'f_id'
df = df.join(df["fname"].str.extract(r'^(?P<f_id>\d+)_'))

sample_size = 1  # 样本大小
state_seed = 43  # 可重现性
group_list = ["f_id"]

# 添加 'validation' 列
df["validation"] = 0
# 为选定的样本递增 'validation' 列中的值
df["validation"] = df.groupby(group_list).sample(n=sample_size, random_state=state_seed)["validation"].add(1)
# 重置 'NaN' 值为 0
df["validation"] = df["validation"].fillna(0).astype(np.int8)

结果是一个 DataFrame，其中在选定的文件名的 validation 列中的值为 1。

	fname	f_id	validation
0	00000000_0001_00000000.jpg	00000000	0
1	00000000_0001_00000001.jpg	00000000	1
2	00000000_0002_00000001.jpg	00000000	0
3	00000001_0001_00000001.jpg	00000001	1
4	00000001_0002_00000001.jpg	00000001	0
5	00000001_0002_00000002.jpg	00000001	0
6	00000004_0001_00000001.jpg	00000004	0
7	00000004_0002_00000001.jpg	00000004	1

英文:

Here's a Pandas DataFrame solution that negates the need to move the files between folders. The str.extract method can extract the text matching a regex pattern as new columns in a DataFrame. The file names are grouped by the values in the newly created f_id column. The groupby.sample method returns a random sample from each group and the random_state parameter allows reproducibility.

import numpy as np
import pandas as pd

# Load file names into a data frame
data = [
    {&quot;fname&quot;: &quot;00000000_0001_00000000.jpg&quot;},
    {&quot;fname&quot;: &quot;00000000_0001_00000001.jpg&quot;},
    {&quot;fname&quot;: &quot;00000000_0002_00000001.jpg&quot;},
    {&quot;fname&quot;: &quot;00000001_0001_00000001.jpg&quot;},
    {&quot;fname&quot;: &quot;00000001_0002_00000001.jpg&quot;},
    {&quot;fname&quot;: &quot;00000001_0002_00000002.jpg&quot;},
    {&quot;fname&quot;: &quot;00000004_0001_00000001.jpg&quot;},
    {&quot;fname&quot;: &quot;00000004_0002_00000001.jpg&quot;},
]
df = pd.DataFrame(data)

# Extract &#39;f_id&#39; from &#39;fname&#39; string
df = df.join(df[&quot;fname&quot;].str.extract(r&#39;^(?P&lt;f_id&gt;\d+)_&#39;))

sample_size = 1 # sample size
state_seed = 43 # reproducible
group_list = [&quot;f_id&quot;]

# Add &#39;validation&#39; column
df[&quot;validation&quot;] = 0
# Increment &#39;validation&#39; by 1 for selected samples
df[&quot;validation&quot;] = df.groupby(group_list).sample(n=sample_size, random_state=state_seed)[&quot;validation&quot;].add(1)
# Reset &#39;NaN&#39; values to 0
df[&quot;validation&quot;] = df[&quot;validation&quot;].fillna(0).astype(np.int8)

The result is a DataFrame with a value of 1 in the validation column for the selected file names.

	fname	f_id	validation
0	00000000_0001_00000000.jpg	00000000	0
1	00000000_0001_00000001.jpg	00000000	1
2	00000000_0002_00000001.jpg	00000000	0
3	00000001_0001_00000001.jpg	00000001	1
4	00000001_0002_00000001.jpg	00000001	0
5	00000001_0002_00000002.jpg	00000001	0
6	00000004_0001_00000001.jpg	00000004	0
7	00000004_0002_00000001.jpg	00000004	1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

随机根据ID从多个文件中拆分1个文件。

问题

答案1

答案2

答案3

根据数值和窗口大小从另一个数组构建两个数组

如何删除包含单词的撇号？

如何从一个张量中提取张量，并将其转换成一个二维NumPy数组？

如何高效处理和筛选大型CSV文件在Python中？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论