How can I load several CSVs with the same columns but in different column orders into a single table using Azure Data Factory?

huangapple go评论57阅读模式
英文:

How can I load several CSVs with the same columns but in different column orders into a single table using Azure Data Factory?

问题

我在blob存储中有一组包含200个CSV文件的文件夹,每个文件都包含相同的90列和相同的列标题。然而,在其中一些文件中,这些列的顺序不同。例如,在某些文件中,第一列的名称是“transaction_data”,而在其他文件中,“transaction_data”可能是第8列或第73列。我每天通过自动化过程收到几个这样的文件,我不能手动审核它们,也不能根据它们的列顺序将它们分组到逻辑组中。

从文件名或任何文件元数据中无法知道哪些文件的列处于“正确的顺序”(即列在我用作接收端的SQL数据库表中的顺序),它们的顺序是更或多是随机的。我知道可以通过预处理(比如使用Python脚本)来解决这个问题,强制所有文件的列都按照正确的顺序排列,但是这似乎需要很多额外的工作、时间和潜在的成本,而我认为ADF本应该可以原生处理这个问题,我可能是在做一些错误的操作。

目前我有一个数据流,其中:

  • 源设置中启用了“允许模式漂移”。
  • 源选项使用通配符路径查找所有*.csv文件。
  • 接收端是Azure SQL数据库表。
  • 接收端选项中启用了“允许模式漂移”。
  • 接收端映射中启用了“自动映射”。

但是发生的情况是,它会选择第一个文件,然后将该文件的列顺序应用于所有处理的其他文件。结果,接收端表中可能存在某些行的数据(例如“transaction_data”)位于错误的列中。肯定有一种方法可以强制它单独评估每个文件的吧?据我从ADF的文档和信息弹出窗口中了解,它应该使用我的CSV文件的字段名称进行映射。事实上,当我在接收端的映射选项卡上启用“自动映射”选项时,文本显示“所有输入按名称映射,包括漂移列”。这正是我想要的,但却不是我得到的行为。

英文:

I have a set of 200 CSV files in blob storage in a single folder that each contain the exact same set of 90 columns with the exact same set of column headers, but in some of the files, these columns appear in a different order. For example, in some of the files, the first column is named "transaction_data" while in others, "transaction_data" may be the 8th or 73rd column. I receive several of these files every day via an automated process, and I am not able to manually review them or bucket them into logical groups based on their column orders.

There is no way to know from the filename or any file metadata which of the files have the columns in the "right order" (the order in which the columns appear in my SQL Database table I'm using as the sink), it's more-or-less random. I know this problem is solvable by using some preprocessing like a Python script to force the files to all have columns in the correct order, but that feels like a lot of extra work, time, and potential cost when this is something it seems ADF should handle natively and I'm just doing something wrong.

What I have right now is a data flow where:

  • Source settings "Allow Schema Drift" is checked
  • Source options are using wildcard paths to find all *.csv files
  • Sink is an Azure SQL Database table
  • Sink options "Allow schema drift" is checked
  • Sink mapping "Auto mapping" is checked

But what's happening is that whatever file it picks up first, it's using that file's column order and applying it to all the rest of the files that are processed. As a result, the Sink table will have some rows where, e.g., the data for "transaction_data" is in the wrong column. There must be some way to force it to evaluate each file on its own? As far as I can tell from ADF's documentation and the info popups, it should be using the field names from my CSVs to do the mapping, in fact when I check the "Auto mapping" option on the Mapping tab of my sink, the text reads "All inputs mapped by name including drifted columns". This is exactly what I want, but isn't the behavior I'm getting.

答案1

得分: 0

我无法使数据流与此配合使用,所以最终我在我的管道中使用了一个foreach循环,这样可以正常工作。很遗憾,因为设置更加繁琐,但至少它可以正常工作。

英文:

I was never able to make this work with a data flow, so I ended up using a foreach loop in my pipeline instead, which worked just fine. Bummer because it's much more of a pain to set up but at least it works.

huangapple
  • 本文由 发表于 2023年3月10日 01:19:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75687980.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定