英文:
Snowsql put command skip modified file
问题
在 c:\myFolder 中,我目前有文件 myFile1.csv。
我在 SnowSQL 中运行了以下命令:
put file://C:\myFolder\myFile*.csv @my_csv_stage auto_compress=true;
然后,在 Snowflake 中运行了以下命令将 csv 文件加载到表格 MYCSVTABLE 中:
COPY INTO MYCSVTABLE
FROM @my_csv_stage
FILE_FORMAT = (FORMAT_NAME = mycsvformat)
PATTERN = '.*myFile.*[.]CSV.gz'
ON_ERROR = 'skip_file';
C:\myFolder\myFile1.csv 今天已经被修改,但是 "LIST @my_csv_stage" 仍然显示文件的 last_modified 为昨天。
因此,我执行了另一个命令 "put file://C:\myFolder\myFile*.csv @my_csv_stage auto_compress=true;",但是状态下显示为 "SKIPPED"。
https://docs.snowflake.com/en/sql-reference/sql/put 上说:
"To overwrite an already-staged file, you must modify the file you are uploading so that its contents are different from the staged file, which results in a new checksum for the newly-staged file."
所以,对我来说,即使没有 overwrite=true,如果文件自上次暂存以来发生了变化,它应该会覆盖文件,而不是跳过文件。
当我在 put file 中使用 overwrite=true,并且在 c:\myFolder 中有其他文件 myFile2.csv 和 myFile3.csv 时,它也会覆盖那些没有被更改的其他文件 myFile2.csv 和 myFile3.csv。
因此,当我执行 COPY INTO 时,它会再次加载 myFile2.csv 和 myFile3.csv 的数据,这不是我想要的。我只想加载已更改的文件的数据,也就是 myFile1.csv。
英文:
In c:\myFolder, currently I have file myFile1.csv.
I ran the following command in SnowSQL
put file://C:\myFolder\myFile*.csv @my_csv_stage auto_compress=true;
Then, I ran the following command in Snowflake to load the csv file to table MYCSVTABLE
COPY INTO MYCSVTABLE
FROM @my_csv_stage
FILE_FORMAT = (FORMAT_NAME = mycsvformat)
PATTERN = '.*myFile.*[.]CSV.gz'
ON_ERROR = 'skip_file';
C:\myFolder\myFile1.csv was modified today, but "LIST @my_csv_stage" still shows the file last_modified as yesterday.
So, I execute another "put file://C:\myFolder\myFile*.csv @my_csv_stage auto_compress=true;", but under status it says "SKIPPED".
https://docs.snowflake.com/en/sql-reference/sql/put says this
"To overwrite an already-staged file, you must modify the file you are uploading so that its contents are different from the staged file, which results in a new checksum for the newly-staged file."
So, it sounds to me that even without the overwrite=true, if the files changed since the last time it was staged, it should overwrite the file, and NOT Skip the file.
When I use overwrite=true in my put file, and I have other files myFile2.csv and myFile3.csv in c:\myFolder, it also override the other files myFile2.csv and myFile3.csv that were not changed.
So, when I execute COPY INTO, it loads the data for myFile2.csv and myFile3.csv again, which I do not want to do. I only want to load data for files that are changed, in this case myFile1.csv.
COPY INTO MYCSVTABLE
FROM @my_csv_stage
FILE_FORMAT = (FORMAT_NAME = mycsvformat)
PATTERN = '.*myFile.*[.]CSV.gz'
ON_ERROR = 'skip_file';
Thank you
答案1
得分: 1
你需要使用OVERWRITE = TRUE,如在PUT命令的文档中所解释的。
我不确定你是基于什么依据认为“只有当校验和不同的情况下PUT命令才会放置文件”,因为文档中没有提到这一点。
英文:
You need to use OVERWRITE = TRUE, as explained in the documentation for the PUT command.
I’m not sure on what basis you thought “the put command will put the file if the checksum is different” as this is not mentioned in the documentation?
答案2
得分: 1
指的是与NickW指出的相同点:
https://docs.snowflake.com/en/sql-reference/sql/put
OVERWRITE = TRUE | FALSE
指定Snowflake在上传过程中是否覆盖同名的现有文件:TRUE:覆盖同名的现有文件。
FALSE:不覆盖同名的现有文件。
请注意,对阶段执行的LIST操作是在后台执行的,这可能会影响PUT操作的性能。
如果尝试PUT文件因目标阶段存在同名文件而失败,有以下选项:
将数据从现有文件加载到一个或多个表中,然后从阶段中删除文件。然后将包含新数据或更新数据的文件PUT到阶段。
重命名本地文件,然后再次尝试PUT操作。
在PUT语句中设置OVERWRITE = TRUE。只有在实际上可以安全地覆盖尚未加载到Snowflake中的数据的文件时才这样做。
请注意,如果您的Snowflake帐户托管在Google Cloud Platform上,PUT语句不会识别OVERWRITE参数设置为TRUE时。PUT操作始终会覆盖目标阶段中的任何现有文件,使用您上传的本地文件。
以下客户端支持Snowflake帐户托管在Amazon Web Services或Microsoft Azure上的OVERWRITE选项:
SnowSQL
Snowflake ODBC驱动程序
Snowflake JDBC驱动程序
Python的Snowflake连接器
支持的值:TRUE,FALSE。
默认值:FALSE。
但使用注意事项部分中提到:
该命令会忽略您尝试上传到同一阶段的任何重复文件。重复文件是与已在阶段中的文件同名的未经修改的文件。
要覆盖已在阶段中的文件,您必须修改要上传的文件,以使其内容与阶段文件不同,从而为新的阶段文件生成新的校验和。
因此,这暗示了会创建校验和,但也可能与覆盖选项互动。
英文:
Same points as NickW notes:
https://docs.snowflake.com/en/sql-reference/sql/put
> OVERWRITE = TRUE | FALSE
Specifies whether Snowflake overwrites an existing file with the same name during upload:
>
> TRUE: An existing file with the same name is overwritten.
>
>FALSE: An existing file with the same name is not overwritten.
>
>Note that a LIST operation on the stage is performed in the background, which can affect the performance of the PUT operation.
>
>If attempts to PUT a file fail because a file with the same name exists in the target stage, the following options are available:
>
>Load the data from the existing file into one or more tables, and remove the file from the stage. Then PUT a file with new or updated data to the stage.
>
>Rename the local file, and then attempt the PUT operation again.
>
> Set OVERWRITE = TRUE in the PUT statement. Do this only if it is actually safe to overwrite a file with data that might not yet have been loaded into Snowflake.
>
> Note that if your Snowflake account is hosted on Google Cloud Platform, PUT statements do not recognize when the OVERWRITE parameter is set to TRUE. A PUT operation always overwrites any existing files in the target stage with the local files you are uploading.
>
>The following clients support the OVERWRITE option for Snowflake accounts hosted on Amazon Web Services or Microsoft Azure:
>
>SnowSQL
>
>Snowflake ODBC Driver
>
>Snowflake JDBC Driver
>
>Snowflake Connector for Python
>
>Supported values: TRUE, FALSE.
>
>Default: FALSE.
but the Usage Notes section does say:
>The command ignores any duplicate files you attempt to upload to the same stage. A duplicate file is an unmodified file with the same name as an already-staged file.
>
>To overwrite an already-staged file, you must modify the file you are uploading so that its contents are different from the staged file, which results in a new checksum for the newly-staged file.
So this does imply there is checksums created, but it also possible interplays with the OVERRIDE option.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论