英文:
Read a csv and insert to database performance
问题
我有一个任务,需要逐行读取一个 CSV 文件并将其插入到数据库中。
这个 CSV 文件大约有 170 万行。
我使用 Python 和 SQLAlchemy ORM(merge 函数)来完成这个任务。但是花费了超过五个小时。
是 Python 的性能慢,还是 SQLAlchemy 的性能慢,还是数据库的性能慢?
如果我使用 Golang 来做,会有明显更好的性能吗?(但是我对 Golang 没有经验。另外,这个任务需要每个月定期执行)
希望你们能给出一些建议,谢谢!
更新:数据库 - MySQL
英文:
I have a mission to read a csv file line by line and insert them to database.
And the csv file contains about 1.7 million lines.
I use python with sqlalchemy orm(merge function) to do this.
But it spend over five hours.
Is it caused by python slow performance or sqlalchemy or sqlalchemy?
or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)
Hope you guy giving any suggestion, thanks!
Update: database - mysql
答案1
得分: 2
对于这样的任务,你不想逐行插入数据 基本上,你有两种方法:
- 确保 sqlalchemy 不逐行运行查询。使用批量
INSERT
查询(https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql)代替。 - 按照你需要的方式处理数据,然后将其输出到某个临时 CSV 文件中,然后按照上面建议的方式运行
LOAD DATA [LOCAL] INFILE
。如果你不需要预处理数据,只需将 CSV 提供给数据库(我假设是 MySQL)。
英文:
For such a mission you don't want to insert data line by line Basically, you have 2 ways:
- Ensure that sqlalchemy does not run queries one by one. Use BATCH
INSERT
query (https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql) instead. - Massage your data in a way you need, then output it into some temporary CSV file and then run
LOAD DATA [LOCAL] INFILE
as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)
答案2
得分: 0
按照以下三个步骤进行操作:
- 使用你想要保存的表名保存CSV文件。
- 执行以下Python脚本以动态创建表(更新CSV文件名和数据库参数)。
- 执行命令"mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"。
PYTHON CODE:
import numpy as np
import pandas as pd
from mysql.connector import connect
csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')
query = "CREATE TABLE " + table_name[0] + "( \n"
for count in np.arange(df.columns.values.size):
query += df.columns.values[count]
if df.dtypes[count] == 'int64':
query += "\t\t int(11) NOT NULL"
elif df.dtypes[count] == 'object':
query += "\t\t varchar(64) NOT NULL"
elif df.dtypes[count] == 'float64':
query += "\t\t float(10,2) NOT NULL"
if count == 0:
query += " PRIMARY KEY"
if count < df.columns.values.size - 1:
query += ",\n"
query += " );"
#print(query)
database = connect(host='localhost', # your host
user='username', # username
passwd='password', # password
db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)
英文:
Follow below three steps
- Save the CSV file with the name of table what you want to save it
to. - Execute below python script to create a table dynamically
(Update CSV filename, db parameters) - Execute "mysqlimport
--ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"
PYTHON CODE:
import numpy as np
import pandas as pd
from mysql.connector import connect
csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')
query = "CREATE TABLE " + table_name[0] + "( \n"
for count in np.arange(df.columns.values.size):
query += df.columns.values[count]
if df.dtypes[count] == 'int64':
query += "\t\t int(11) NOT NULL"
elif df.dtypes[count] == 'object':
query += "\t\t varchar(64) NOT NULL"
elif df.dtypes[count] == 'float64':
query += "\t\t float(10,2) NOT NULL"
if count == 0:
query += " PRIMARY KEY"
if count < df.columns.values.size - 1:
query += ",\n"
query += " );"
#print(query)
database = connect(host='localhost', # your host
user='username', # username
passwd='password', # password
db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论