从一个S3存储桶复制文件到另一个存储桶的函数添加了额外的文件。

huangapple go评论56阅读模式
英文:

Function to copy files from S3 bucket to another adds extra files

问题

以下是代码部分的翻译:

我有一个名为'data'的S3存储桶其中包含一个名为'20230225'的单个目录其中包含JSON和视频文件'20230225'我创建了一个名为'metadata'的子目录我希望将JSON文件移动到该目录中以便将所有JSON和视频文件放在不同的目录中

我编写了一个将JSON文件复制到另一个目录的函数似乎在数据的小样本上工作正常但是当我在'20230225'中的所有JSON文件上运行该函数时它花费的时间比我预期的要长得多我中断了函数的执行当我计算目标目录中的文件时发现有比预期的JSON文件多得多

以下是函数代码其中是否有任何可能会添加一些额外文件的内容

我认为问题可能是源文件夹包含了所有子目录而唯一的源文件夹子目录实际上是目标文件夹所以函数可能会陷入循环中尝试从它已经复制的目标文件夹中复制文件

但是即使是这种情况它不应该只是覆盖这些文件而不是添加额外的文件吗

```python
def copy_json_files(s3_bucket: str, source_folder: str, dest_folder: str):
    """
    参数:
    - s3_bucket (str): S3存储桶的名称。
    - source_folder (str): 源文件夹的名称。
    - dest_folder (str): 目标文件夹的名称。
    
    返回:
    - int: 复制的文件数量。
    """

    s3 = boto3.resource('s3')
    src_bucket = s3.Bucket(s3_bucket)

    # 创建目标前缀
    dest_prefix = dest_folder.strip('/') + '/' if dest_folder else ''

    # 配置S3传输管理器
    botocore_config = botocore.config.Config(max_pool_connections=200)
    s3client = boto3.client('s3', config=botocore_config)
    transfer_config = s3transfer.TransferConfig(use_threads=True, max_concurrency=140)

    # 创建S3传输管理器
    s3t = s3transfer.create_transfer_manager(s3client, transfer_config)

    copied_files = 0

    for obj in src_bucket.objects.filter(Prefix=source_folder):
        # 排除源文件夹子目录中的对象
        if '/' in obj.key[len(source_folder):]:
            continue

        # 排除已经在目标文件夹中的对象
        if obj.key.startswith(dest_prefix):
            continue

        if obj.key.endswith('.json'):
            # 通过将源文件夹的名称替换为目标文件夹的名称来形成目标键
            dest_key = obj.key.replace(source_folder, dest_prefix, 1)

            copy_source = {
                'Bucket': s3_bucket,
                'Key': obj.key
            }

            s3t.copy(
                copy_source=copy_source,
                bucket=s3_bucket,
                key=dest_key
            )

            copied_files += 1

    # 关闭传输管理器
    s3t.shutdown()

    return copied_files

我用来检查文件数量的函数是:

def count_files(s3_bucket, s3_dir):
    
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(s3_bucket)

    count = 0
    for obj in bucket.objects.filter(Prefix=s3_dir):
        count += 1
    
    return count
英文:

I have an S3 bucket 'data' with a single directory '20230225' which contains JSON and video files. Within '20230225', I created a subdirectory 'metadata' in which I wanted to move the JSON files so as to have all the JSON and video files in separate directories.

I wrote a function to copy the JSON files to another directory which seemed to work on a small sample of the data. However when I ran the function on the totality of the JSON files in '20230225', it took much longer than I expected it to. I interrupted the execution of the function and when I counted the files in the destination directory, a there were many more json files than were supposed to be there.

Here is the function code. Is there anything in the there that could add some extra files?

I'm thinking it could be because the source folder is going to be all the subdirectories as well, and the only source folder's subdirectory is actually the destination folder, so maybe the function got stuck in a loop trying to copy the files from the destination folder it had already copied.

However even if that's the case shouldn't it just overwrite those files and not add extra files?

def copy_json_files(s3_bucket: str, source_folder: str, dest_folder: str):
"""
Parameters:
- s3_bucket (str): The name of the S3 bucket.
- source_folder (str): The name of the source folder.
- dest_folder (str): The name of the destination folder.
Returns:
- int: The number of files copied.
"""
s3 = boto3.resource('s3')
src_bucket = s3.Bucket(s3_bucket)
# Create destination prefix
dest_prefix = dest_folder.strip('/') + '/' if dest_folder else ''
# Configure S3 transfer manager
botocore_config = botocore.config.Config(max_pool_connections=200)
s3client = boto3.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(use_threads=True, max_concurrency=140)
# Create S3 transfer manager
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
copied_files = 0
for obj in src_bucket.objects.filter(Prefix=source_folder):
# Exclude objects in subdirectories of source folder
if '/' in obj.key[len(source_folder):]:
continue
# Exclude objects already in the destination folder
if obj.key.startswith(dest_prefix):
continue
if obj.key.endswith('.json'):
# Form destination key by replacing source folder name with destination folder name
dest_key = obj.key.replace(source_folder, dest_prefix, 1)
copy_source = {
'Bucket': s3_bucket,
'Key': obj.key
}
s3t.copy(
copy_source=copy_source,
bucket=s3_bucket,
key=dest_key
)
copied_files += 1
# Close transfer manager
s3t.shutdown()
return copied_files

The function I used to check the number of files was:

def count_files(s3_bucket, s3_dir):
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(s3_bucket)
count = 0
for obj in bucket.objects.filter(Prefix=s3_dir):
count += 1
return count

答案1

得分: 1

在子文件夹中的对象将包括在对象列表中。

例如,如果源文件夹中有一个对象并且运行了您的代码,它将会将该对象复制到子目录中。下次运行时,它将复制两个对象到子文件夹中,因为src_bucket.objects.filter(Prefix=source_folder)将_包括_所有子文件夹。

如果您只希望复制源文件夹的“顶部”对象,那么您需要:

  • 将目标位置移动到其他地方(即不作为源文件夹的子文件夹),
  • 添加一些逻辑来检查要复制的对象是否_不_在子文件夹中,比如获取最后一个/之前的键的所有内容,并将其与源文件夹的名称进行比较。
英文:

Objects in the sub-folders would be included in the object listing.

For example, if the source has one object and your code is run, it would copy that object to the sub-directory. The next time it is run, it would copy BOTH objects to the sub-folder since src_bucket.objects.filter(Prefix=source_folder) will include all sub-folders.

If you only wish to copy objects in the 'top' of the source folder, then you will either need to:

  • Move the destination somewhere else (that is, not as a sub-folder of the source), or
  • Add some logic that checks that the object to be copied is not in a sub-folder -- such as taking everything in the key before the last / and comparing it to the name of the source folder.

huangapple
  • 本文由 发表于 2023年5月29日 03:21:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76353235.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定