python 如何使KMeans聚类对泰坦尼克号数据更有意义?

zpf6vheq  于 2023-03-11  发布在  Python
关注(0)|答案(1)|浏览(173)

我在运行这个代码。

import pandas as pd
titanic = pd.read_csv('titanic.csv')
titanic.head()

#Import required module
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

from sklearn.cluster import KMeans

# initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)
# fit the model
kmeans.fit(X)
# store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
titanic.tail()

Finally...

from sklearn.decomposition import PCA

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# save our two dimensions into x0 and x1
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]

# assign clusters and pca vectors to our dataframe 
titanic['cluster'] = clusters
titanic['x0'] = x0
titanic['x1'] = x1

titanic.head()

import plotly.express as px

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', text='Name')
fig.show()

这是我看到的情节。

我想这是可行的...但我的问题是...我们如何才能使文本更加分散和/或删除异常值,使图表更有意义?我猜聚类是正确的,因为我在这里没有做任何特殊的事情,但有什么方法可以使聚类更显著或更有意义?
数据来源于此。
https://www.kaggle.com/competitions/titanic/data?select=test.csv

oug3syen

oug3syen1#

您可以使姓名信息仅在鼠标悬停在某个数据点上时显示。当前,您正尝试将每个乘客的姓名绘制在数据点旁边。由于有许多数据点彼此靠近,直接将姓名包含在图中会导致每个乘客的姓名放在彼此的顶部。您可以通过将绘图代码更改为以下内容来修复此问题:

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                  title_font_size=30)
fig.show()

基本上,我们对上面代码所做的唯一更改是使用哪个参数来包含'Name'信息。以下是更改后的外观:

现在,只有当您将鼠标悬停在数据点上时,才会显示名称。

完整代码

下面是考虑到上述更改的完整代码:

# Import required module
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Where our data is located in our machine
train_data_filepath = '/Users/erikingwersen/Downloads/train.csv'
test_data_filepath = '/Users/erikingwersen/Downloads/test.csv'

# Read the train data from downloaded file
titanic = pd.read_csv(train_data_filepath)

documents = titanic['Name']

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

# Initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)

# Fit the model
kmeans.fit(X)

# Store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
documents = titanic['Name']

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

# Initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)

# Pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# Save our two dimensions into x0 and x1
x0, x1 = pca_vecs[:, 0], pca_vecs[:, 1]

# Assign clusters and pca vectors to our dataframe 
titanic[['cluster', 'x0', 'x1']] = [
    [x, y, z] for x, y, z in zip(clusters, x0, x1)
]

titanic.head()

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                  title_font_size=30)
fig.show()

相关问题