2023年4月20日 06:02:38go评论51阅读模式

英文:

Error in Databricks using "copy into" SQL command to populate table from csv file (string is not converted to integer)

问题

I have a need to insert data into a Databricks table using data from a csv file.

I've uploaded the file to Databricks, but when I try to use it to insert data into the table using copy into, I get an error that the command is not interpreting the column in the csv file as an integer, but as a string and it is not casting that string to an int. The file was uploaded using the REST API.

How do I get Databricks to write these data to the table?

This is the code I'm currently using.

COPY INTO concept
  FROM '/FileStore/tables/prod//ohdsi/demo_cdm/concept/concept.csv'
  FILEFORMAT = CSV
  FORMAT_OPTIONS ('mergeSchema' = 'true',
                  'inferSchema' ='true',
                  'delimiter' = ',',
                  'header' = 'true')
  COPY_OPTIONS ('mergeSchema' = 'true');

This also fails

COPY INTO concept
  FROM '/FileStore/tables/prod//ohdsi/demo_cdm/concept/concept.csv'
  FILEFORMAT = CSV
  FORMAT_OPTIONS (
    'delimiter' = ',',
    'header' = 'true')
;

This is the error message

Error in SQL statement: AnalysisException: Failed to merge fields 'concept_id' and 'concept_id'. Failed to merge incompatible data types IntegerType and StringType

--- EDIT ----------------------------------------------

I've posted a solution that uses Python.

However, it would be great if I had a SQL solution so I could call this programmatically (or if there is a way I can call the Python programmatically from my local machine).

英文:

I have a need to insert data into a Databricks table using data from a csv file.

How do I get Databricks to write these data to the table?

This is the code I'm currently using.

COPY INTO concept
  FROM &#39;/FileStore/tables/prod//ohdsi/demo_cdm/concept/concept.csv&#39;
  FILEFORMAT = CSV
  FORMAT_OPTIONS (&#39;mergeSchema&#39; = &#39;true&#39;,
                  &#39;inferSchema&#39; =&#39;true&#39;,
                  &#39;delimiter&#39; = &#39;,&#39;,
                  &#39;header&#39; = &#39;true&#39;)
  COPY_OPTIONS (&#39;mergeSchema&#39; = &#39;true&#39;);

This also fails

COPY INTO concept
  FROM &#39;/FileStore/tables/prod//ohdsi/demo_cdm/concept/concept.csv&#39;
  FILEFORMAT = CSV
  FORMAT_OPTIONS (
    &#39;delimiter&#39; = &#39;,&#39;,
    &#39;header&#39; = &#39;true&#39;)
;

This is the error message

Error in SQL statement: AnalysisException: Failed to merge fields &#39;concept_id&#39; and &#39;concept_id&#39;. Failed to merge incompatible data types IntegerType and StringType

--- EDIT ----------------------------------------------

I've posted a solution that uses Python.

However, it would be great if I had a SQL solution so I could call this programmatically (or if there is a way I can call the Python programmatically from my local machine).

答案1

得分: 2

Spark 在自行解释 csv 数据的模式方面效果不佳，因此一种解决方法是首先从 csv 数据创建一个具有适当定义的模式的表，然后使用复制命令。

英文:

Spark doesn't work well in interpreting schema of csv data on its own, so a workaround would be to create a table first from the csv data giving it a proper well defined schema and then using the copy into command

答案2

得分: 1

我们能够使用Python脚本来解决这个问题（在运行此脚本之前，表格已被截断）：

tables = [
'care_site',
'cdm_source',
'vocabulary'
]

target_db = 'demo_cdm'
data_path = '/FileStore/tables/prod/ohdsi/demo_cdm/'

for t in tables:
  tgt_t = t.lower()
  df = spark.sql('SELECT * FROM {db}.{table}'.format(db=target_db, table=tgt_t))
  spark.read.options(delimiter=",", header="True", dateFormat="yyyy-MM-dd")\
          .schema(df.schema)\
          .csv(data_path + t + '/' + t + '.csv')\
          .write.format('delta')\
          .insertInto(target_db + '.' + tgt_t.lower(), overwrite=True)
  spark.sql('REFRESH TABLE {db}.{table}'.format(db=target_db, table=tgt_t))

注意：代码部分未被翻译，仅返回已翻译的内容。

英文:

We were able to resolve this using a Python script (tables are truncated before running this script):

%python
tables = [
&#39;care_site&#39;,
&#39;cdm_source&#39;,
&#39;vocabulary&#39;
]

target_db = &#39;demo_cdm&#39;
data_path = &#39;/FileStore/tables/prod/ohdsi/demo_cdm/&#39;

for t in tables:
  tgt_t = t.lower()
  df = spark.sql(&#39;SELECT * FROM {db}.{table}&#39;.format(db=target_db, table=tgt_t))
  spark.read.options(delimiter=&quot;,&quot;, header=&quot;True&quot;, dateFormat=&quot;yyyy-MM-dd&quot;)\
          .schema(df.schema)\
          .csv(data_path + t + &#39;/&#39; + t + &#39;.csv&#39;)\
          .write.format(&#39;delta&#39;)\
          .insertInto(target_db + &#39;.&#39; + tgt_t.lower(), overwrite=True)
  spark.sql(&#39;REFRESH TABLE {db}.{table}&#39;.format(db=target_db, table=tgt_t))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Error in Databricks using "copy into" SQL command to populate table from csv file (string is not converted to integer)

问题

答案1

答案2

How to Authenitcate Access to ADLS from Databricks without creating Service Principle with Using AD App Registrations to Mount a Drive

在Databricks中追加值到已存在值的行中。

当我们删除Spark管理的表时会发生什么？

使用mlflow.pyfunc模型中的code_path在Databricks上。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论