我要翻译的内容: 如何优化我的代码以确保我的Google Colab不会崩溃

huangapple go评论96阅读模式
英文:

How can I optimize my code so my Google Colab doens't crash

问题

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.

  1. import pandas as pd
  2. import matplotlib.pyplot as plt
  3. import seaborn as sns
  4. import numpy as np
  5. from xgboost import XGBRegressor
  6. from sklearn.linear_model import LinearRegression
  7. from sklearn.linear_model import Lasso
  8. from sklearn.linear_model import Ridge
  9. from sklearn.ensemble import RandomForestRegressor
  10. from sklearn.model_selection import train_test_split
  11. from sklearn import metrics
  12. from sklearn.preprocessing import LabelEncoder
  13. from google.colab import drive
  14. drive.mount('/content/drive')
  15. df = pd.read_csv('path/beforeNeural.csv')
  16. df.shape
  17. df.head()
  18. df.isnull().sum()
  19. encoder = LabelEncoder()
  20. df['Property Type'] = encoder.fit_transform(df['Property Type'])
  21. df['Old/New'] = encoder fit_transform(df['Old/New'])
  22. df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
  23. df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
  24. df['County'] = encoder.fit_transform(df['County'])
  25. df['District'] = encoder.fit_transform(df['District'])
  26. df['Town/City'] = encoder.fit_transform(df['Town/City'])
  27. df['Duration'] = encoder.fit_transform(df['Duration'])
  28. df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
  29. df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])
  30. X = df.drop(columns='Price', axis=1)
  31. Y = df['Price']
  32. X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
  33. df.shape
  34. boostenc = XGBRegressor()
  35. boostenc.fit(X_train, Y_train)
英文:

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.

  1. import pandas as pd
  2. import matplotlib.pyplot as plt
  3. import seaborn as sns
  4. import numpy as np
  5. from xgboost import XGBRegressor
  6. from sklearn.linear_model import LinearRegression
  7. from sklearn.linear_model import Lasso
  8. from sklearn.linear_model import Ridge
  9. from sklearn.ensemble import RandomForestRegressor
  10. from sklearn.model_selection import train_test_split
  11. from sklearn import metrics
  12. from sklearn.preprocessing import LabelEncoder
  13. from google.colab import drive
  14. drive.mount('/content/drive')
  15. df = pd.read_csv('path/beforeNeural.csv')
  16. df.shape
  17. df.head()
  18. df.isnull().sum()
  19. encoder = LabelEncoder()
  20. df['Property Type'] = encoder.fit_transform(df['Property Type'])
  21. df['Old/New'] = encoder.fit_transform(df['Old/New'])
  22. df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
  23. df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
  24. df['County'] = encoder.fit_transform(df['County'])
  25. df['District'] = encoder.fit_transform(df['District'])
  26. df['Town/City'] = encoder.fit_transform(df['Town/City'])
  27. df['Duration'] = encoder.fit_transform(df['Duration'])
  28. df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
  29. df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])
  30. X = df.drop(columns='Price', axis=1)
  31. Y = df['Price']
  32. X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
  33. df.shape
  34. boostenc = XGBRegressor()
  35. boostenc.fit(X_train, Y_train)

答案1

得分: 1

Here is the translated code part:

  1. 我将尝试优化你的代码以下是可能的优化选项
  2. ## 代码:
  3. ```python
  4. import pandas as pd
  5. from xgboost import XGBRegressor
  6. from sklearn.model_selection import train_test_split
  7. from sklearn.preprocessing import OneHotEncoder
  8. from google.colab import drive
  9. drive.mount('/content/drive')
  10. df = pd.read_csv('path/beforeNeural.csv')
  11. categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
  12. encoder = OneHotEncoder()
  13. X_concat = encoder.fit_transform(df[categorical_columns])
  14. # 方法 1:
  15. X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
  16. # 方法 2:
  17. X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
  18. X_numerical = df.drop(columns = categorical_columns + ['Price'])
  19. X = pd.concat([X_numerical, X_concat], axis = 1)
  20. Y = df['Price']
  21. X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
  22. boostenc = XGBRegressor()
  23. boostenc.fit(X_train, Y_train)

请注意,我删除了未使用的导入并删除了例如 df.head() 这样的调用,例如在代码中间的部分,它不执行任何操作,也不在使用它时打印任何内容。

代码解释:

  1. 我使用了 OneHotEncoder 而不是 LabelEncoder,以便对所有的分类特征进行独热编码。这会为分类特征的每个唯一值创建一个新的二进制列。通常情况下,除了仅使用 LabelEncoder 分配整数值之外,在使用机器学习时,独热编码通常是更好的处理分类特征的方法。
  2. 我提取了所有分类列的名称到一个列表中,这样在需要修改它们时更容易。
  1. <details>
  2. <summary>英文:</summary>
  3. I&#39;ll give it a try, here is a possible option to optimize your code,
  4. ## Code:
  5. ```python
  6. import pandas as pd
  7. from xgboost import XGBRegressor
  8. from sklearn.model_selection import train_test_split
  9. from sklearn.preprocessing import OneHotEncoder
  10. from google.colab import drive
  11. drive.mount(&#39;/content/drive&#39;)
  12. df = pd.read_csv(&#39;path/beforeNeural.csv&#39;)
  13. categorical_columns = [&#39;Property Type&#39;, &#39;Old/New&#39;, &#39;Record Status - monthly file only&#39;, &#39;PPDCategory Type&#39;, &#39;County&#39;, &#39;District&#39;, &#39;Town/City&#39;, &#39;Duration&#39;, &#39;Transaction unique identifier&#39;, &#39;Date of Transfer&#39;]
  14. encoder = OneHotEncoder()
  15. X_concat = encoder.fit_transform(df[categorical_columns])
  16. # Approach 1:
  17. X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
  18. # Approach 2:
  19. X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
  20. X_numerical = df.drop(columns = categorical_columns + [&#39;Price&#39;])
  21. X = pd.concat([X_numerical, X_concat], axis = 1)
  22. Y = df[&#39;Price&#39;]
  23. X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
  24. boostenc = XGBRegressor()
  25. boostenc.fit(X_train, Y_train)

> Note, I removed the unused imports and deleted the calls such as
> df.head() for example in the middle of the code, which does nothing and also
> does not print anything when you use it like that in the middle of the
> code

Code Explanation:

  1. Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features.
    This creates a new binary column for each unique value in the categorical features.
    In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
  2. I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.

huangapple
  • 本文由 发表于 2023年2月19日 22:12:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75500721.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定