FileNotFoundError: [Errno 2] No such file or directory: while exporting a parquet file from pandas dataframe

huangapple go评论109阅读模式
英文:

FileNotFoundError: [Errno 2] No such file or directory: while exporting a parquet file from pandas dataframe

问题

以下是您要翻译的代码部分:

  1. I am basically trying to export a parquet file inside GCS cloud bucket as shown below in my code which is a GCP cloud function where i am getting error in the line "chunk.to_parquet(parquet_file_path, engine='fastparquet', compression='snappy')" saying - " No such file or directory: 'new_folder_20230206_065500/table1-20230206_065638.parquet". The folder is getting created successfully inside bucket but i am not sure why parquet file is not getting generated inside it.
  2. import mysql.connector
  3. import pandas as pd
  4. from google.cloud import storage
  5. from datetime import datetime, timedelta
  6. import os
  7. def extract_data_to_gcs(request):
  8. connection = mysql.connector.connect(
  9. host=os.getenv('..'),
  10. user=os.getenv('...'),
  11. password=os.getenv('...'),
  12. database='....'
  13. )
  14. cursor = connection.cursor(buffered=True)
  15. tables = ["table1", "table2", "table3"]
  16. client = storage.Client()
  17. bucket = client.bucket('data-lake-archive')
  18. # Create a timestamp-based folder name
  19. now = datetime.now()
  20. folder_name = now.strftime("new_folder_%Y%m%d_%H%M%S")
  21. folder_path = f"{folder_name}/"
  22. # Create the folder in the GCS bucket
  23. blob = bucket.blob(folder_path)
  24. blob.upload_from_string("", content_type="application/octet-stream")
  25. for table in tables:
  26. cursor.execute("SELECT * FROM {}".format(table))
  27. chunks = pd.read_sql_query("SELECT * FROM {}".format(table), connection, chunksize=5000000)
  28. for i, chunk in enumerate(chunks):
  29. chunk.columns = [str(col) for col in chunk.columns]
  30. ingestion_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
  31. parquet_file_path = folder_path + f"{table}-{i}.parquet"
  32. timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
  33. # parquet_file_path = folder_path + f'abc.parquet'
  34. print(f'folder path is {folder_path}')
  35. print(f'parquet file path is {parquet_file_path}')
  36. chunk.to_parquet(parquet_file_path, engine='fastparquet', compression='snappy')
  37. # blob = bucket.blob(folder_path + f'{table}-{i}.parquet')
  38. # blob.upload_from_filename(folder_path + f'{table}-{i}.parquet')
  39. cursor.execute("SELECT table_name, column_name FROM information_schema.key_column_usage WHERE referenced_table_name = '{}'".format(table))
  40. referenced_tables = cursor.fetchall()
  41. for referenced_table in referenced_tables:
  42. chunks = pd.read_sql_query("SELECT * FROM {}".format(referenced_table[0]), connection, chunksize=5000000)
  43. for i, chunk in enumerate(chunks):
  44. chunk.columns = [str(col) for col in chunk.columns]
  45. ingestion_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
  46. chunk.to_parquet(f"{folder_path}{referenced_table[0]}-{ingestion_timestamp}-{i}.parquet", engine='fastparquet', compression='snappy')
  47. blob = bucket.blob(folder_path + f'{referenced_table[0]}-{ingestion_timestamp}-{i}.parquet')
  48. blob.upload_from_filename(folder_path + f'{referenced_table[0]}-{ingestion_timestamp}-{i}.parquet')
  49. return 'Data extracted and uploaded to GCS'

请注意,我已经将代码中的双引号翻译为单引号,以便在代码中的字符串中使用双引号。如果您有任何其他疑问或需要进一步的帮助,请告诉我。

英文:

I am basically trying to export a parquet file inside GCS cloud bucket as shown below in my code which is a GCP cloud function where i am getting error in the line "chunk.to_parquet(parquet_file_path, engine='fastparquet', compression='snappy')" saying -" No such file or directory: 'new_folder_20230206_065500/table1-20230206_065638.parquet". The folder is getting created successfully inside bucket but i am not sure why parquet file is not getting generated inside it.

  1. import mysql.connector
  2. import pandas as pd
  3. from google.cloud import storage
  4. from datetime import datetime, timedelta
  5. import os
  6. def extract_data_to_gcs(request):
  7. connection = mysql.connector.connect(
  8. host=os.getenv('..'),
  9. user=os.getenv('...'),
  10. password=os.getenv('...'),
  11. database='....'
  12. )
  13. cursor = connection.cursor(buffered=True)
  14. tables = ["table1", "table2", "table3"]
  15. client = storage.Client()
  16. bucket = client.bucket('data-lake-archive')
  17. # Create a timestamp-based folder name
  18. now = datetime.now()
  19. folder_name = now.strftime("new_folder_%Y%m%d_%H%M%S")
  20. folder_path = f"{folder_name}/"
  21. # Create the folder in the GCS bucket
  22. blob = bucket.blob(folder_path)
  23. blob.upload_from_string("", content_type="application/octet-stream")
  24. for table in tables:
  25. cursor.execute("SELECT * FROM {}".format(table))
  26. chunks = pd.read_sql_query("SELECT * FROM {}".format(table), connection, chunksize=5000000)
  27. for i, chunk in enumerate(chunks):
  28. chunk.columns = [str(col) for col in chunk.columns]
  29. ingestion_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
  30. parquet_file_path = folder_path + f"{table}-{i}.parquet"
  31. timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
  32. # parquet_file_path = folder_path + f'abc.parquet'
  33. print(f'folder path is {folder_path}')
  34. print(f'parquet file path is {parquet_file_path}')
  35. chunk.to_parquet(parquet_file_path, engine='fastparquet', compression='snappy')
  36. # blob = bucket.blob(folder_path + f'{table}-{i}.parquet')
  37. # blob.upload_from_filename(folder_path + f'{table}-{i}.parquet')
  38. cursor.execute("SELECT table_name, column_name FROM information_schema.key_column_usage WHERE referenced_table_name = '{}'".format(table))
  39. referenced_tables = cursor.fetchall()
  40. for referenced_table in referenced_tables:
  41. chunks = pd.read_sql_query("SELECT * FROM {}".format(referenced_table[0]), connection, chunksize=5000000)
  42. for i, chunk in enumerate(chunks):
  43. chunk.columns = [str(col) for col in chunk.columns]
  44. ingestion_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
  45. chunk.to_parquet(f"{folder_path}{referenced_table[0]}-{ingestion_timestamp}-{i}.parquet", engine='fastparquet', compression='snappy')
  46. blob = bucket.blob(folder_path + f'{referenced_table[0]}-{ingestion_timestamp}-{i}.parquet')
  47. blob.upload_from_filename(folder_path + f'{referenced_table[0]}-{ingestion_timestamp}-{i}.parquet')
  48. return 'Data extracted and uploaded to GCS'

答案1

得分: 1

Do you need to create the folder first? I'm not familiar with Google Cloud, but that might be a cause of the issue. folder_path = f"{folder_name}/" Create this folder before doing, chunk.to_parquet(...)

Where exactly are the errors thrown? There are two lines with chunk.to_parquet(). Can you reduce the error down to a specific line?

英文:

Do you need to create the folder first? I'm not familiar with Google Cloud, but that might be a cause of the issue. folder_path = f"{folder_name}/" Create this folder before doing, chunk.to_parquet(...)

Where exactly are the errors thrown? There are two lines with chunk.to_parquet(). Can you reduce the error down to a specific line?

huangapple
  • 本文由 发表于 2023年2月6日 16:40:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75359017.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定