英文:
Color Regions in a Scatter Plot
问题
我最近发现在Orange中可以为散点图创建彩色区域。我知道Orange是基于Python构建的,所以我想我应该能够重新创建这个功能,但我遇到了一些困难。我还没有弄清楚如何将pandas数据框转换为Orange格式。更重要的是,我正在一个Spark环境中工作,所以如果我能从pyspark转到Orange会更好。
我已经在seaborn和matplotlib中设置了一个基本的散点图,以查看是否能够弄清楚它。
import seaborn as sns
import matplotlib.pyplot as plt
# 从Seaborn加载Iris数据集
iris = sns.load_dataset("iris")
# 创建散点图
sns.scatterplot(x="sepal_length", y="petal_width", hue="species", data=iris)
# 添加标签和标题
plt.xlabel("Sepal Length")
plt.ylabel("Petal Width")
plt.title("Scatter Plot of Sepal Length vs. Petal Width")
# 显示图形
plt.legend()
plt.show()
<details>
<summary>英文:</summary>
I recently found out that you can create color regions for scatter plots in Orange. I know Orange sits on top of python, so I figured I'd be able to recreate this, but I'm having a hard time. I haven't figured out how to convert a pandas dataframe for orange. More importantly, I'm working in a spark environment, so if I could go from pyspark to orange that would be better.
I've set up a basic scatter plot in both seaborn and matplotlib to see if I could figure it out.
import seaborn as sns
import matplotlib.pyplot as plt
Load the Iris dataset from Seaborn
iris = sns.load_dataset("iris")
Create a scatter plot
sns.scatterplot(x="sepal_length", y="petal_width", hue="species", data=iris)
Add labels and title
plt.xlabel("Sepal Length")
plt.ylabel("Petal Width")
plt.title("Scatter Plot of Sepal Length vs. Petal Width")
Show the plot
plt.legend()
plt.show()
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/om4pt.png
</details>
# 答案1
**得分**: 1
根据[Orange文档](https://orange3.readthedocs.io/projects/orange-visual-programming/en/latest/widgets/visualize/scatterplot.html#intelligent-data-visualization):
> 如果在颜色部分选择了一个分类变量,得分计算如下。对于每个数据实例,该方法在投影的2D空间中找到10个最近的邻居,即属性对的组合上。然后检查其中有多少个具有相同的颜色。然后,投影的总得分是具有相同颜色的邻居的平均数。
您可以使用scikit-learn的k最近邻分类器获得类似的结果。在[它们的文档](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html)中有一个使用鸢尾花数据集的示例。
我已经修改了这个示例以使其更类似于您分享的截图:
```python
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import datasets, neighbors
from sklearn.inspection import DecisionBoundaryDisplay
n_neighbors = 10
# 导入鸢尾花数据集
iris = datasets.load_iris()
# 选择特征
features = [2, 3]
X = iris.data[:, features]
y = iris.target
# 创建颜色映射
cmap_light = ListedColormap(["blue", "red", "green"])
cmap_bold = ["blue", "red", "green"]
# 创建一个最近邻分类器的实例并拟合数据。
clf = neighbors.KNeighborsClassifier(n_neighbors, weights="distance")
clf.fit(X, y)
# 绘制边界
_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(
clf,
X,
cmap=cmap_light,
ax=ax,
response_method="predict",
plot_method="pcolormesh",
xlabel=iris.feature_names[features[0]],
ylabel=iris.feature_names[features[1]],
shading="auto",
alpha=0.3,
)
# 绘制训练点
sns.scatterplot(
x=X[:, 0],
y=X[:, 1],
hue=iris.target_names[y],
palette=cmap_bold,
alpha=1.0,
edgecolor="black",
)
这是结果图:
英文:
According to the Orange Documentation:
> If a categorical variable is selected in the Color section, the score is computed as follows. For each data instance, the method finds 10 nearest neighbors in the projected 2D space, that is, on the combination of attribute pairs. It then checks how many of them have the same color. The total score of the projection is then the average number of same-colored neighbors.
You can get similar results using scikit-learn's k nearest neighbour classifier. There is an example in their docs that uses the iris dataset as well.
I've modified this example to be more similar to the screenshot you shared:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import datasets, neighbors
from sklearn.inspection import DecisionBoundaryDisplay
n_neighbors = 10
# import iris dataset
iris = datasets.load_iris()
# Select features
features = [2, 3]
X = iris.data[:, features]
y = iris.target
# Create color maps
cmap_light = ListedColormap(["blue", "red", "green"])
cmap_bold = ["blue", "red", "green"]
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights="distance")
clf.fit(X, y)
# Plot boundaries
_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(
clf,
X,
cmap=cmap_light,
ax=ax,
response_method="predict",
plot_method="pcolormesh",
xlabel=iris.feature_names[features[0]],
ylabel=iris.feature_names[features[1]],
shading="auto",
alpha=0.3,
)
# Plot training points
sns.scatterplot(
x=X[:, 0],
y=X[:, 1],
hue=iris.target_names[y],
palette=cmap_bold,
alpha=1.0,
edgecolor="black",
)
This is the result:
答案2
得分: 1
以下是翻译好的代码部分:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
#
#加载数据
#
iris = load_iris(as_frame=True)
iris_x = iris.data
iris_y = iris.target
iris_x.columns = [col.capitalize()[:-5] for col in iris_x.columns]
#
#为每个类别选择颜色
#
# 在所有类别中自动选择颜色
np.random.seed(2)
class_colors = np.random.choice(
list(matplotlib.colors.CSS4_COLORS),
size=len(iris_y.unique()),
replace=False
)
# 或者,为每个类别指定颜色:
class_colors = ['tab:red', 'tab:green', 'tab:blue']
print('类别颜色为:', class_colors)
display(matplotlib.colors.ListedColormap(class_colors))
#从每种颜色创建一个颜色映射
class_cmaps = [
matplotlib.colors.LinearSegmentedColormap.from_list('自定义', ['w', color])
for color in class_colors
]
#查看颜色映射
# for cmap in class_cmaps: display(cmap)
#
#选择特征并拟合KNN分类器
#
feat0 = '花瓣长度'
feat1 = '花瓣宽度'
iris_x = iris_x[[feat0, feat1]]
n_neighbors = 10
knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights='distance').fit(iris_x.values, iris_y)
#
#定义特征空间并获取整个区域的预测
#
x_grid, y_grid = np.meshgrid(
np.linspace(iris_x[feat0].min(), iris_x[feat0].max(), 100),
np.linspace(iris_x[feat1].min(), iris_x[feat1].max(), 100)
)
grid_flat = np.hstack([x_grid.reshape(-1, 1), y_grid.reshape(-1, 1)])
#在特征空间的每个点上,获取:
#预测类别和最近的邻居
classes = knn.predict(grid_flat)
neighbors = knn.kneighbors(grid_flat, return_distance=False)
#对于每个点,邻居中有多少与预测类别匹配
prop_per_gridpt = [sum(iris_y[row_neighbors] == clas) / n_neighbors
for row_neighbors, clas
in zip(neighbors, classes)]
#将比例转换为颜色。每个类别都有一种颜色。
rgb_per_gridpt = [
class_cmaps[clas](prop)
for clas, prop in zip(classes, prop_per_gridpt)
]
rgb_per_gridpt = np.array(rgb_per_gridpt).reshape(x_grid.shape + (4,))
#绘制图像
f, ax = plt.subplots(figsize=(8, 8))
ax.scatter(iris_x[feat0], iris_x[feat1], c=np.choose(iris_y.values, class_colors), s=60,
alpha=0.7, linewidth=2)
ax.set_xlabel(feat0)
ax.set_ylabel(feat1)
ax.set_title(f'散点图 {feat0} vs. {feat1}')
ax.imshow(rgb_per_gridpt, extent=ax.axis(), alpha=0.5,
interpolation='bicubic', origin='lower')
英文:
The code below produces a similar-looking plot to the one you posted. It uses matplotlib
directly for plotting.
Output:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
#
#Load data
#
iris = load_iris(as_frame=True)
iris_x = iris.data
iris_y = iris.target
iris_x.columns = [col.capitalize()[:-5] for col in iris_x.columns]
#
#Choose a color for each class
#
# Choose automatically across all classes
np.random.seed(2)
class_colors = np.random.choice(
list(matplotlib.colors.CSS4_COLORS),
size=len(iris_y.unique()),
replace=False
)
# Alternatively, specify per class:
class_colors = ['tab:red', 'tab:green', 'tab:blue']
print('Class colors are:', class_colors)
display( matplotlib.colors.ListedColormap(class_colors) )
#Create a colormap out of each color
class_cmaps = [
matplotlib.colors.LinearSegmentedColormap.from_list('Custom', ['w', color])
for color in class_colors
]
#View the colormap
# for cmap in class_cmaps: display(cmap)
#
#Select features and fit KNN classifier
#
feat0 = 'Petal length'
feat1 = 'Petal width'
iris_x = iris_x[[feat0, feat1]]
n_neighbors = 10
knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights='distance').fit(iris_x.values, iris_y)
#
#Define a feature space and get a prediction over the entire area
#
x_grid, y_grid = np.meshgrid(
np.linspace(iris_x[feat0].min(), iris_x[feat0].max(), 100),
np.linspace(iris_x[feat1].min(), iris_x[feat1].max(), 100)
)
grid_flat = np.hstack([x_grid.reshape(-1, 1), y_grid.reshape(-1, 1)])
#At each point in the feature space, get the:
#predicted class and nearest neighbors
classes = knn.predict(grid_flat)
neighbors = knn.kneighbors(grid_flat, return_distance=False)
#For each point, what proportion of neighbors match the predicted class
prop_per_gridpt = [sum(iris_y[row_neighbors] == clas) / n_neighbors
for row_neighbors, clas
in zip(neighbors, classes)]
#Convert proportions to colours. Each class has a colour.
rgb_per_gridpt = [
class_cmaps[clas](prop)
for clas, prop in zip(classes, prop_per_gridpt)
]
rgb_per_gridpt = np.array(rgb_per_gridpt).reshape(x_grid.shape + (4,))
#Plot
f, ax = plt.subplots(figsize=(8, 8))
ax.scatter(iris_x[feat0], iris_x[feat1], c=np.choose(iris_y.values, class_colors), s=60,
alpha=0.7, linewidth=2)
ax.set_xlabel(feat0)
ax.set_ylabel(feat1)
ax.set_title(f'Scatter plot of {feat0} vs. {feat1}')
ax.imshow(rgb_per_gridpt, extent=ax.axis(), alpha=0.5,
interpolation='bicubic', origin='lower')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论