英文:
Function to copy files from S3 bucket to another adds extra files
问题
以下是代码部分的翻译:
我有一个名为'data'的S3存储桶,其中包含一个名为'20230225'的单个目录,其中包含JSON和视频文件。在'20230225'中,我创建了一个名为'metadata'的子目录,我希望将JSON文件移动到该目录中,以便将所有JSON和视频文件放在不同的目录中。
我编写了一个将JSON文件复制到另一个目录的函数,似乎在数据的小样本上工作正常。但是,当我在'20230225'中的所有JSON文件上运行该函数时,它花费的时间比我预期的要长得多。我中断了函数的执行,当我计算目标目录中的文件时,发现有比预期的JSON文件多得多。
以下是函数代码。其中是否有任何可能会添加一些额外文件的内容?
我认为问题可能是源文件夹包含了所有子目录,而唯一的源文件夹子目录实际上是目标文件夹,所以函数可能会陷入循环中,尝试从它已经复制的目标文件夹中复制文件。
但是即使是这种情况,它不应该只是覆盖这些文件而不是添加额外的文件吗?
```python
def copy_json_files(s3_bucket: str, source_folder: str, dest_folder: str):
"""
参数:
- s3_bucket (str): S3存储桶的名称。
- source_folder (str): 源文件夹的名称。
- dest_folder (str): 目标文件夹的名称。
返回:
- int: 复制的文件数量。
"""
s3 = boto3.resource('s3')
src_bucket = s3.Bucket(s3_bucket)
# 创建目标前缀
dest_prefix = dest_folder.strip('/') + '/' if dest_folder else ''
# 配置S3传输管理器
botocore_config = botocore.config.Config(max_pool_connections=200)
s3client = boto3.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(use_threads=True, max_concurrency=140)
# 创建S3传输管理器
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
copied_files = 0
for obj in src_bucket.objects.filter(Prefix=source_folder):
# 排除源文件夹子目录中的对象
if '/' in obj.key[len(source_folder):]:
continue
# 排除已经在目标文件夹中的对象
if obj.key.startswith(dest_prefix):
continue
if obj.key.endswith('.json'):
# 通过将源文件夹的名称替换为目标文件夹的名称来形成目标键
dest_key = obj.key.replace(source_folder, dest_prefix, 1)
copy_source = {
'Bucket': s3_bucket,
'Key': obj.key
}
s3t.copy(
copy_source=copy_source,
bucket=s3_bucket,
key=dest_key
)
copied_files += 1
# 关闭传输管理器
s3t.shutdown()
return copied_files
我用来检查文件数量的函数是:
def count_files(s3_bucket, s3_dir):
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(s3_bucket)
count = 0
for obj in bucket.objects.filter(Prefix=s3_dir):
count += 1
return count
英文:
I have an S3 bucket 'data' with a single directory '20230225' which contains JSON and video files. Within '20230225', I created a subdirectory 'metadata' in which I wanted to move the JSON files so as to have all the JSON and video files in separate directories.
I wrote a function to copy the JSON files to another directory which seemed to work on a small sample of the data. However when I ran the function on the totality of the JSON files in '20230225', it took much longer than I expected it to. I interrupted the execution of the function and when I counted the files in the destination directory, a there were many more json files than were supposed to be there.
Here is the function code. Is there anything in the there that could add some extra files?
I'm thinking it could be because the source folder is going to be all the subdirectories as well, and the only source folder's subdirectory is actually the destination folder, so maybe the function got stuck in a loop trying to copy the files from the destination folder it had already copied.
However even if that's the case shouldn't it just overwrite those files and not add extra files?
def copy_json_files(s3_bucket: str, source_folder: str, dest_folder: str):
"""
Parameters:
- s3_bucket (str): The name of the S3 bucket.
- source_folder (str): The name of the source folder.
- dest_folder (str): The name of the destination folder.
Returns:
- int: The number of files copied.
"""
s3 = boto3.resource('s3')
src_bucket = s3.Bucket(s3_bucket)
# Create destination prefix
dest_prefix = dest_folder.strip('/') + '/' if dest_folder else ''
# Configure S3 transfer manager
botocore_config = botocore.config.Config(max_pool_connections=200)
s3client = boto3.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(use_threads=True, max_concurrency=140)
# Create S3 transfer manager
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
copied_files = 0
for obj in src_bucket.objects.filter(Prefix=source_folder):
# Exclude objects in subdirectories of source folder
if '/' in obj.key[len(source_folder):]:
continue
# Exclude objects already in the destination folder
if obj.key.startswith(dest_prefix):
continue
if obj.key.endswith('.json'):
# Form destination key by replacing source folder name with destination folder name
dest_key = obj.key.replace(source_folder, dest_prefix, 1)
copy_source = {
'Bucket': s3_bucket,
'Key': obj.key
}
s3t.copy(
copy_source=copy_source,
bucket=s3_bucket,
key=dest_key
)
copied_files += 1
# Close transfer manager
s3t.shutdown()
return copied_files
The function I used to check the number of files was:
def count_files(s3_bucket, s3_dir):
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(s3_bucket)
count = 0
for obj in bucket.objects.filter(Prefix=s3_dir):
count += 1
return count
答案1
得分: 1
在子文件夹中的对象将包括在对象列表中。
例如,如果源文件夹中有一个对象并且运行了您的代码,它将会将该对象复制到子目录中。下次运行时,它将复制两个对象到子文件夹中,因为src_bucket.objects.filter(Prefix=source_folder)
将_包括_所有子文件夹。
如果您只希望复制源文件夹的“顶部”对象,那么您需要:
- 将目标位置移动到其他地方(即不作为源文件夹的子文件夹),或
- 添加一些逻辑来检查要复制的对象是否_不_在子文件夹中,比如获取最后一个
/
之前的键的所有内容,并将其与源文件夹的名称进行比较。
英文:
Objects in the sub-folders would be included in the object listing.
For example, if the source has one object and your code is run, it would copy that object to the sub-directory. The next time it is run, it would copy BOTH objects to the sub-folder since src_bucket.objects.filter(Prefix=source_folder)
will include all sub-folders.
If you only wish to copy objects in the 'top' of the source folder, then you will either need to:
- Move the destination somewhere else (that is, not as a sub-folder of the source), or
- Add some logic that checks that the object to be copied is not in a sub-folder -- such as taking everything in the key before the last
/
and comparing it to the name of the source folder.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论