我要翻译的内容: 如何优化我的代码以确保我的Google Colab不会崩溃

huangapple go评论61阅读模式
英文:

How can I optimize my code so my Google Colab doens't crash

问题

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()

df.isnull().sum()

encoder = LabelEncoder()

df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])

X = df.drop(columns='Price', axis=1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

df.shape

boostenc = XGBRegressor()

boostenc.fit(X_train, Y_train)
英文:

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()

df.isnull().sum()

encoder = LabelEncoder()

df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder.fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])

X = df.drop(columns='Price', axis=1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

df.shape

boostenc = XGBRegressor()

boostenc.fit(X_train, Y_train)

答案1

得分: 1

Here is the translated code part:

我将尝试优化你的代码以下是可能的优化选项

## 代码:
```python
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')

categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# 方法 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# 方法 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))

X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

请注意,我删除了未使用的导入并删除了例如 df.head() 这样的调用,例如在代码中间的部分,它不执行任何操作,也不在使用它时打印任何内容。

代码解释:

  1. 我使用了 OneHotEncoder 而不是 LabelEncoder,以便对所有的分类特征进行独热编码。这会为分类特征的每个唯一值创建一个新的二进制列。通常情况下,除了仅使用 LabelEncoder 分配整数值之外,在使用机器学习时,独热编码通常是更好的处理分类特征的方法。
  2. 我提取了所有分类列的名称到一个列表中,这样在需要修改它们时更容易。

<details>
<summary>英文:</summary>

I&#39;ll give it a try, here is a possible option to optimize your code,


## Code:
```python
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount(&#39;/content/drive&#39;)

df = pd.read_csv(&#39;path/beforeNeural.csv&#39;)

categorical_columns = [&#39;Property Type&#39;, &#39;Old/New&#39;, &#39;Record Status - monthly file only&#39;, &#39;PPDCategory Type&#39;, &#39;County&#39;, &#39;District&#39;, &#39;Town/City&#39;, &#39;Duration&#39;, &#39;Transaction unique identifier&#39;, &#39;Date of Transfer&#39;]
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))

X_numerical = df.drop(columns = categorical_columns + [&#39;Price&#39;])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df[&#39;Price&#39;]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

> Note, I removed the unused imports and deleted the calls such as
> df.head() for example in the middle of the code, which does nothing and also
> does not print anything when you use it like that in the middle of the
> code

Code Explanation:

  1. Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features.
    This creates a new binary column for each unique value in the categorical features.
    In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
  2. I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.

huangapple
  • 本文由 发表于 2023年2月19日 22:12:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75500721.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定