2016年3月22日 14:08:28go评论120阅读模式

英文:

Read a csv and insert to database performance

问题

我有一个任务，需要逐行读取一个 CSV 文件并将其插入到数据库中。

这个 CSV 文件大约有 170 万行。

我使用 Python 和 SQLAlchemy ORM（merge 函数）来完成这个任务。但是花费了超过五个小时。

是 Python 的性能慢，还是 SQLAlchemy 的性能慢，还是数据库的性能慢？

如果我使用 Golang 来做，会有明显更好的性能吗？（但是我对 Golang 没有经验。另外，这个任务需要每个月定期执行）

希望你们能给出一些建议，谢谢！

更新：数据库 - MySQL

英文:

I have a mission to read a csv file line by line and insert them to database.

And the csv file contains about 1.7 million lines.

I use python with sqlalchemy orm(merge function) to do this.
But it spend over five hours.

Is it caused by python slow performance or sqlalchemy or sqlalchemy?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)

Hope you guy giving any suggestion, thanks!

Update: database - mysql

答案1

得分: 2

对于这样的任务，你不想逐行插入数据基本上，你有两种方法：

确保 sqlalchemy 不逐行运行查询。使用批量 INSERT 查询（https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql）代替。
按照你需要的方式处理数据，然后将其输出到某个临时 CSV 文件中，然后按照上面建议的方式运行 LOAD DATA [LOCAL] INFILE。如果你不需要预处理数据，只需将 CSV 提供给数据库（我假设是 MySQL）。

英文:

For such a mission you don't want to insert data line by line Basically, you have 2 ways:

Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query (https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql) instead.
Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)

答案2

得分: 0

按照以下三个步骤进行操作：

使用你想要保存的表名保存CSV文件。
执行以下Python脚本以动态创建表（更新CSV文件名和数据库参数）。
执行命令"mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"。

PYTHON CODE:

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')

query = "CREATE TABLE " + table_name[0] + "( \n" 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == 'int64':
        query += "\t\t int(11) NOT NULL"
    elif df.dtypes[count] == 'object':
        query += "\t\t varchar(64) NOT NULL"
    elif df.dtypes[count] == 'float64':
        query += "\t\t float(10,2) NOT NULL"

    
    if count == 0:
        query += " PRIMARY KEY"
        
    if count < df.columns.values.size - 1:
        query += ",\n"
    
query += " );"
#print(query)

database = connect(host='localhost',  # your host
                     user='username', # username
                     passwd='password',     # password
                     db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)

英文:

Follow below three steps

Save the CSV file with the name of table what you want to save it
to.
Execute below python script to create a table dynamically
(Update CSV filename, db parameters)
Execute "mysqlimport
--ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"

PYTHON CODE:

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = &#39;dbtable_name.csv&#39;
df = pd.read_csv(csv_file)
table_name = csv_file.split(&#39;.&#39;)

query = &quot;CREATE TABLE &quot; + table_name[0] + &quot;( \n&quot; 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == &#39;int64&#39;:
        query += &quot;\t\t int(11) NOT NULL&quot;
    elif df.dtypes[count] == &#39;object&#39;:
        query += &quot;\t\t varchar(64) NOT NULL&quot;
    elif df.dtypes[count] == &#39;float64&#39;:
        query += &quot;\t\t float(10,2) NOT NULL&quot;

    
    if count == 0:
        query += &quot; PRIMARY KEY&quot;
        
    if count &lt; df.columns.values.size - 1:
        query += &quot;,\n&quot;
    
query += &quot; );&quot;
#print(query)

database = connect(host=&#39;localhost&#39;,  # your host
                     user=&#39;username&#39;, # username
                     passwd=&#39;password&#39;,     # password
                     db=&#39;dbname&#39;) #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

读取CSV文件并插入数据库的性能

问题

答案1

答案2

Structural typing and polymorphism in Go – Writing a method that can operate on two types having the same fields

通过反射获取接口的正确结构类型

何时在相对导入中使用点？而m开关与此有什么关系？

Golang：通用结构体实现与通用接口不匹配。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论