读取CSV文件并插入数据库的性能

huangapple go评论85阅读模式
英文:

Read a csv and insert to database performance

问题

我有一个任务,需要逐行读取一个 CSV 文件并将其插入到数据库中。

这个 CSV 文件大约有 170 万行。

我使用 Python 和 SQLAlchemy ORM(merge 函数)来完成这个任务。但是花费了超过五个小时。

是 Python 的性能慢,还是 SQLAlchemy 的性能慢,还是数据库的性能慢?

如果我使用 Golang 来做,会有明显更好的性能吗?(但是我对 Golang 没有经验。另外,这个任务需要每个月定期执行)

希望你们能给出一些建议,谢谢!

更新:数据库 - MySQL

英文:

I have a mission to read a csv file line by line and insert them to database.

And the csv file contains about 1.7 million lines.

I use python with sqlalchemy orm(merge function) to do this.
But it spend over five hours.

Is it caused by python slow performance or sqlalchemy or sqlalchemy?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)

Hope you guy giving any suggestion, thanks!

Update: database - mysql

答案1

得分: 2

对于这样的任务,你不想逐行插入数据 读取CSV文件并插入数据库的性能 基本上,你有两种方法:

  1. 确保 sqlalchemy 不逐行运行查询。使用批量 INSERT 查询(https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql)代替。
  2. 按照你需要的方式处理数据,然后将其输出到某个临时 CSV 文件中,然后按照上面建议的方式运行 LOAD DATA [LOCAL] INFILE。如果你不需要预处理数据,只需将 CSV 提供给数据库(我假设是 MySQL)。
英文:

For such a mission you don't want to insert data line by line 读取CSV文件并插入数据库的性能 Basically, you have 2 ways:

  1. Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query (https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql) instead.
  2. Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)

答案2

得分: 0

按照以下三个步骤进行操作:

  1. 使用你想要保存的表名保存CSV文件。
  2. 执行以下Python脚本以动态创建表(更新CSV文件名和数据库参数)。
  3. 执行命令"mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"。

PYTHON CODE:

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')

query = "CREATE TABLE " + table_name[0] + "( \n" 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == 'int64':
        query += "\t\t int(11) NOT NULL"
    elif df.dtypes[count] == 'object':
        query += "\t\t varchar(64) NOT NULL"
    elif df.dtypes[count] == 'float64':
        query += "\t\t float(10,2) NOT NULL"

    
    if count == 0:
        query += " PRIMARY KEY"
        
    if count < df.columns.values.size - 1:
        query += ",\n"
    
query += " );"
#print(query)

database = connect(host='localhost',  # your host
                     user='username', # username
                     passwd='password',     # password
                     db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)
英文:

Follow below three steps

  1. Save the CSV file with the name of table what you want to save it
    to.
  2. Execute below python script to create a table dynamically
    (Update CSV filename, db parameters)
  3. Execute "mysqlimport
    --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"

PYTHON CODE:

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = &#39;dbtable_name.csv&#39;
df = pd.read_csv(csv_file)
table_name = csv_file.split(&#39;.&#39;)

query = &quot;CREATE TABLE &quot; + table_name[0] + &quot;( \n&quot; 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == &#39;int64&#39;:
        query += &quot;\t\t int(11) NOT NULL&quot;
    elif df.dtypes[count] == &#39;object&#39;:
        query += &quot;\t\t varchar(64) NOT NULL&quot;
    elif df.dtypes[count] == &#39;float64&#39;:
        query += &quot;\t\t float(10,2) NOT NULL&quot;

    
    if count == 0:
        query += &quot; PRIMARY KEY&quot;
        
    if count &lt; df.columns.values.size - 1:
        query += &quot;,\n&quot;
    
query += &quot; );&quot;
#print(query)

database = connect(host=&#39;localhost&#39;,  # your host
                     user=&#39;username&#39;, # username
                     passwd=&#39;password&#39;,     # password
                     db=&#39;dbname&#39;) #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)

huangapple
  • 本文由 发表于 2016年3月22日 14:08:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/36147293.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定