comparison of k-means clustereing and k-medoids

  • Centroid vs. Medoid:
    • K-means: Uses centroids (mean points), it is sensitive to outliers.
    • K-medoids: Uses medoids (minimizes distances), less sensitive to outliers.
  • Robustness to Outliers:
    • K-means: Sensitive to outliers.
    • K-medoids: More robust to outliers.
  • Initialization:
    • K-means: Sensitive to initial centroids.
    • K-medoids: Less sensitive to initial medoids.
  • Cluster Shape:
    • K-means: Assumes spherical clusters.
    • K-medoids: Handles arbitrary shapes.
  • Computational Complexity(cost) :
    • K-means: Less computationally expensive.
    • K-medoids: Can be more computationally expensive.
  • Cluster Connectivity:
    • K-means: Does not naturally connect non-contiguous regions.
    • K-medoids: Connects points based on density.
  • Use Cases:
    • K-means: Well-defined, spherical clusters, computational efficiency.
    • K-medoids: Irregular clusters, robustness to outliers.

Finally, k-means clustering  is faster, suitable for spherical clusters, and sensitive to outliers.

K-medoids is more robust to outliers, handles arbitrary shapes, but can be computationally more expensive.

DB scan

A clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clusters data points according to how densely they are arranged in the feature space. DBSCAN may find clusters of any shape, unlike k-means, which assumes that clusters have a spherical shape. It can treat outliers as noise and is especially good at detecting clusters divided by different densities.

Steps in DBSCAN:

  1. Parameter Selection:
    • Choose two parameters: eps (epsilon) and min_samples.
    • eps: Radius around a data point to define its neighborhood.
    • min_samples: Minimum number of points required to form a dense region.
  2. Core Point Identification:
    • Identify core points by counting the number of points in the epsilon neighborhood of each data point.
  3. Cluster Expansion:
    • Form clusters by connecting core points that are within each other’s epsilon neighborhood.
  4. Label Border Points:
    • Label border points that are in the epsilon neighborhood of a core point but are not core points themselves.
  5. Noise Identification:
    • Assign noise points that are neither core nor border points.

October 25

In today’s class, we delved into the age distribution of individuals involved in police shootings, specifically focusing on disparities between Black and White individuals. Our analysis utilized Mathematica, with plans to develop a Python equivalent for future sessions.

Key Findings:

1. Overall Age Distribution:
– Range: 6 to 91 years
– Average: 37.1 years, Median: 35 years
– Standard Deviation: 13 years
– Slight right skewness (0.73) and kurtosis close to 3, indicating a distribution resembling the normal curve.

2. Age Distribution for Black Individuals:
– Range: 13 to 88 years
– Average: 32.7 years, Median: 31 years
– Standard Deviation: 11.4 years
– Right skewness (1) and kurtosis of 3.9, suggesting a slightly more pronounced tail.

3. Age Distribution for White Individuals:
– Range: 6 to 91 years
– Average: 40 years, Median: 38 years
– Standard Deviation: 13.3 years
– Slightly right-skewed (0.53) with kurtosis of 2.86, indicating a distribution with a less pronounced peak and tails.

4. Comparison Between Black and White Individuals:
– Average age difference of approximately 7.3 years, with White individuals being older on average.
– Monte Carlo simulation confirmed statistical significance, implying a low likelihood of this result occurring by chance.
– Cohen’s d effect size of 0.58 indicates a medium magnitude for the observed age difference.

This session provided a thorough exploration of age distributions in the context of police shootings, shedding light on distinct patterns between racial groups and general age tendencies in the dataset.

K-medoids

An improvement on the k-means clustering technique known as “k-medoids” solves some of its drawbacks, most notably how it handles noise and outliers. The actual data points (medoids) are used as cluster representations in k-medoids as opposed to the mean (centroid) of the data points within a cluster. A cluster’s medoid is the data point that minimizes the total distances to all other points in the cluster.

Steps in K-medoids:

  1. Initialization:
    • Select K initial data points as the initial medoids.
  2. Assignment:
    • Assign each data point to the cluster represented by the closest medoid.
  3. Update Medoids:
    • For each cluster, choose the data point that minimizes the sum of distances to all other points as the new medoid.
  4. Repeat:
    • Iterate the assignment and medoid update steps until convergence.

k-means clustering

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping clusters. It assigns data points to clusters based on their similarity, with the goal of minimizing the intra-cluster variance (sum of squared distances between data points within the same cluster) and maximizing the inter-cluster variance.

Key Steps in K-means Clustering:

  1. Initialization:
    • Randomly choose K value  initial cluster centroids.
  2. Assignment:
    • Assign each data point to the cluster whose centroid is closest (Euclidean distance).
  3. Update Centroids:
    • Recalculate the centroids of each cluster based on the mean of the data points assigned to that cluster.
  4. Repeat:
    • Iterate the assignment and centroid update steps until convergence (when centroids no longer change significantly or a set number of iterations is reached).

Washington-Post-Police-data

The Washington Post Police Data has entries dating back to January 2, 2015, and it is updated each week with the latest information. Today’s session covered a variety of questions, including the issue of missing values for columns like armed, flee, and race that get string entries. And how can we add the missing values to the data? There are several approaches for doing this:

Mode imputation : Replace the missing values with most frequent entry(Mode) in to the column. I think this method would be suitable for armed column even though it has many entries but gun, knife, replica being the most frequent one.

Forward fill (ffill) or Backward Fill (bfil) : fills in the missing values with either the value above or below the current value. This method might be suitable for flee as it has most of the entries as ‘not’.

Constant Imputation : Replaces missing values with a specified constant. This method would appropriate for body camera and signs of mental illness as they either True or False (constant value).

Alternatively, if we are unsure of how to fill in certain unique columns, such as the one named “state” in this dataset, we can train a machine learning model to predict the values based on the other dataset entries. Discussed above  Various other methods for filling missing entries in a column can be considered, and their effect on model accuracy can be assessed to identify the most effective approach. I would like to verify my testing set with the above mentioned  methods and test which one is more efficient .

Chi square distribution

The chi-square (χ²) distribution is a probability distribution that arises in statistics and is commonly used in hypothesis testing and confidence interval construction.

Uses of the Chi-Square Distribution:

  1. Hypothesis Testing for Variances: it is used in the chi-square test for variance to compare sample variances with known or hypothesized population variances.
  2. Model Fit Assessment: the chi-square statistic is used to assess how well the observed data fit the expected model.
  3. In contingency table, the chi-square test for independence is used to determine whether there is a significant association between two categorical variables.

The results of chi square distribution in cdc diabetes 2018 dataframe is shown below :

When is one or two, the chi-square distribution is a curve shaped like a backwards “J.” The curve starts out high and then drops off, meaning that there is a high probability that Χ² is close to zero.

 

Plot for residual vs predicted values

Creating a residual vs. predicted values plot is a common diagnostic tool in regression analysis. It helps you visually assess the relationship between the predicted values from your regression model and the corresponding residuals (the differences between the observed and predicted values).

Plot for residual vs predicted values:

With above graph I am able to analyse the presence of outliers in my dataframe and how much they are diverted from the actual values. And the y =0 line represents the magnitude of deviation, Larger vertical distances indicate more substantial deviations, suggesting that the model made a more significant prediction error for those specific cases.

Points above the line represent positive residuals (model underpredicted), while points below the line represent negative residuals (model overpredicted).

Positive outliers suggest that the model underestimated the actual values, while negative outliers suggest overestimation.

From the above we analyse that there is a presence of heteroscedasticity.

Residual plot of %Diabetes vs %Inactive

A residual plot is a graphical representation of the residuals (the differences between observed and predicted values) in a regression analysis. Examining the residual plot can provide insights into the validity of assumptions and the overall performance of the regression model.

Residual plot of %Diabetes vs %Inactive:

The red dotted horizontal line indicates the zero residual line and blue dots across the line indicates the residuals.  With the above residual plot I am able to verify Homoscedasticity: Homoscedasticity means that the spread of residuals is constant across all levels of the independent variable.

linearity in the data: A good regression model assumes that the relationship between the independent and dependent variables is linear. In a residual plot, you ideally want to see a random, patternless spread of points around the horizontal axis. If there’s a clear pattern, it suggests that the relationship might not be entirely linear, and the model might need adjustment.

presence of outliers: Residual plots can help identify outliers, which are data points that deviate significantly from the overall pattern. Outliers can have a substantial impact on regression results.

and model fit assesment etc.