2023年2月19日 22:12:21go评论96阅读模式

英文:

How can I optimize my code so my Google Colab doens't crash

问题

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()
df.isnull().sum()
encoder = LabelEncoder()
df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])
X = df.drop(columns='Price', axis=1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
df.shape
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

英文:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount(&#39;/content/drive&#39;)
df = pd.read_csv(&#39;path/beforeNeural.csv&#39;)
df.shape
df.head()
df.isnull().sum()
encoder = LabelEncoder()
df[&#39;Property Type&#39;] = encoder.fit_transform(df[&#39;Property Type&#39;])
df[&#39;Old/New&#39;] = encoder.fit_transform(df[&#39;Old/New&#39;])
df[&#39;Record Status - monthly file only&#39;] = encoder.fit_transform(df[&#39;Record Status - monthly file only&#39;])
df[&#39;PPDCategory Type&#39;] = encoder.fit_transform(df[&#39;PPDCategory Type&#39;])
df[&#39;County&#39;] = encoder.fit_transform(df[&#39;County&#39;])
df[&#39;District&#39;] = encoder.fit_transform(df[&#39;District&#39;])
df[&#39;Town/City&#39;] = encoder.fit_transform(df[&#39;Town/City&#39;])
df[&#39;Duration&#39;] = encoder.fit_transform(df[&#39;Duration&#39;])
df[&#39;Transaction unique identifier&#39;] = encoder.fit_transform(df[&#39;Transaction unique identifier&#39;])
df[&#39;Date of Transfer&#39;] = encoder.fit_transform(df[&#39;Date of Transfer&#39;])
X = df.drop(columns=&#39;Price&#39;, axis=1)
Y = df[&#39;Price&#39;]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
df.shape
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

答案1

得分: 1

Here is the translated code part:

我将尝试优化你的代码，以下是可能的优化选项，
## 代码：
```python
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# 方法 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# 方法 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

请注意，我删除了未使用的导入并删除了例如 df.head() 这样的调用，例如在代码中间的部分，它不执行任何操作，也不在使用它时打印任何内容。

代码解释：

我使用了 OneHotEncoder 而不是 LabelEncoder，以便对所有的分类特征进行独热编码。这会为分类特征的每个唯一值创建一个新的二进制列。通常情况下，除了仅使用 LabelEncoder 分配整数值之外，在使用机器学习时，独热编码通常是更好的处理分类特征的方法。
我提取了所有分类列的名称到一个列表中，这样在需要修改它们时更容易。


<details>
<summary>英文:</summary>
I&#39;ll give it a try, here is a possible option to optimize your code,
## Code:
```python
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount(&#39;/content/drive&#39;)
df = pd.read_csv(&#39;path/beforeNeural.csv&#39;)
categorical_columns = [&#39;Property Type&#39;, &#39;Old/New&#39;, &#39;Record Status - monthly file only&#39;, &#39;PPDCategory Type&#39;, &#39;County&#39;, &#39;District&#39;, &#39;Town/City&#39;, &#39;Duration&#39;, &#39;Transaction unique identifier&#39;, &#39;Date of Transfer&#39;]
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
X_numerical = df.drop(columns = categorical_columns + [&#39;Price&#39;])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df[&#39;Price&#39;]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

> Note, I removed the unused imports and deleted the calls such as
> df.head() for example in the middle of the code, which does nothing and also
> does not print anything when you use it like that in the middle of the
> code

Code Explanation:

Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features.
This creates a new binary column for each unique value in the categorical features.
In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我要翻译的内容：如何优化我的代码以确保我的Google Colab不会崩溃

问题

答案1

代码解释：

Code Explanation:

Why if I'm placing a lookbehind constraint on the capturing group, does it ensure compliance but also capture what is prior to the given constraint?

如何查看我的逻辑回归已分类的特定行

Python – 为每个存档记录日志

HackerRank链表节点删除-一个测试案例失败，需要理解哪里出了问题。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。