comparison of k-means clustering and DB scan

K-means:
– Assumes spherical clusters, sensitive to outliers.
– Not density-sensitive, requires specifying K.
– Suitable for well-defined, spherical clusters.
– Generally computationally less expensive.

DBSCAN:
– Handles arbitrary cluster shapes, robust to outliers.
– Density-sensitive, less dependent on specifying clusters.
– Effective for irregularly shaped clusters with varying densities.
– Can be more computationally expensive.

Summary:

  • Cluster Shape:
    • K-means assumes spherical clusters; DBSCAN can handle arbitrary shapes.
  • Outlier Handling:
    • K-means is sensitive to outliers; DBSCAN is robust to outliers.
  • Density Sensitivity:
    • K-means is not sensitive to density variations; DBSCAN is density-sensitive.
  • Parameter Dependency:
    • K-means requires specifying the number of clusters (K); DBSCAN is less dependent on this.

Choose K-means for spherical clusters with minimal noise, and DBSCAN for irregularly shaped clusters with varying densities and robustness to outliers.

The Bayesian approach

The Bayesian approach is a statistical framework based on Bayesian probability theory, which provides a systematic way to update beliefs about uncertainty by adopting a new evidence.

Key Concepts:

  1. Bayesian Probability:
    • Represents degrees of belief rather than frequencies.
    • Probability is assigned to hypotheses, reflecting the degree of belief in their truth.
  2. Prior Probability:
    • Represents the initial belief or probability before considering new evidence.
    • Based on prior knowledge, experience, or assumptions.
  3. Bayesian Inference:
    • Process of updating beliefs based on observed data.
    • Uses Bayes’ theorem to calculate the posterior probability.

Applications:

  1. Machine Learning:
    • Bayesian methods in machine learning include Bayesian networks, Bayesian regression, and Bayesian optimization.
  2. Statistics:
    • Bayesian statistics is used in parameter estimation, hypothesis testing, and model comparison.
  3. Natural Language Processing:
    • Bayesian models are applied in language modeling, text classification, and information retrieval.

The Bayesian approach provides a coherent framework for updating beliefs and making decisions under uncertainty. It has diverse applications across various fields, with advantages in handling uncertainty and incorporating prior knowledge. However, its adoption may require overcoming challenges related to computational complexity and potential subjectivity in the choice of priors.

 

comparison of k-means clustereing and k-medoids

  • Centroid vs. Medoid:
    • K-means: Uses centroids (mean points), it is sensitive to outliers.
    • K-medoids: Uses medoids (minimizes distances), less sensitive to outliers.
  • Robustness to Outliers:
    • K-means: Sensitive to outliers.
    • K-medoids: More robust to outliers.
  • Initialization:
    • K-means: Sensitive to initial centroids.
    • K-medoids: Less sensitive to initial medoids.
  • Cluster Shape:
    • K-means: Assumes spherical clusters.
    • K-medoids: Handles arbitrary shapes.
  • Computational Complexity(cost) :
    • K-means: Less computationally expensive.
    • K-medoids: Can be more computationally expensive.
  • Cluster Connectivity:
    • K-means: Does not naturally connect non-contiguous regions.
    • K-medoids: Connects points based on density.
  • Use Cases:
    • K-means: Well-defined, spherical clusters, computational efficiency.
    • K-medoids: Irregular clusters, robustness to outliers.

Finally, k-means clustering  is faster, suitable for spherical clusters, and sensitive to outliers.

K-medoids is more robust to outliers, handles arbitrary shapes, but can be computationally more expensive.

DB scan

A clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clusters data points according to how densely they are arranged in the feature space. DBSCAN may find clusters of any shape, unlike k-means, which assumes that clusters have a spherical shape. It can treat outliers as noise and is especially good at detecting clusters divided by different densities.

Steps in DBSCAN:

  1. Parameter Selection:
    • Choose two parameters: eps (epsilon) and min_samples.
    • eps: Radius around a data point to define its neighborhood.
    • min_samples: Minimum number of points required to form a dense region.
  2. Core Point Identification:
    • Identify core points by counting the number of points in the epsilon neighborhood of each data point.
  3. Cluster Expansion:
    • Form clusters by connecting core points that are within each other’s epsilon neighborhood.
  4. Label Border Points:
    • Label border points that are in the epsilon neighborhood of a core point but are not core points themselves.
  5. Noise Identification:
    • Assign noise points that are neither core nor border points.

October 25

In today’s class, we delved into the age distribution of individuals involved in police shootings, specifically focusing on disparities between Black and White individuals. Our analysis utilized Mathematica, with plans to develop a Python equivalent for future sessions.

Key Findings:

1. Overall Age Distribution:
– Range: 6 to 91 years
– Average: 37.1 years, Median: 35 years
– Standard Deviation: 13 years
– Slight right skewness (0.73) and kurtosis close to 3, indicating a distribution resembling the normal curve.

2. Age Distribution for Black Individuals:
– Range: 13 to 88 years
– Average: 32.7 years, Median: 31 years
– Standard Deviation: 11.4 years
– Right skewness (1) and kurtosis of 3.9, suggesting a slightly more pronounced tail.

3. Age Distribution for White Individuals:
– Range: 6 to 91 years
– Average: 40 years, Median: 38 years
– Standard Deviation: 13.3 years
– Slightly right-skewed (0.53) with kurtosis of 2.86, indicating a distribution with a less pronounced peak and tails.

4. Comparison Between Black and White Individuals:
– Average age difference of approximately 7.3 years, with White individuals being older on average.
– Monte Carlo simulation confirmed statistical significance, implying a low likelihood of this result occurring by chance.
– Cohen’s d effect size of 0.58 indicates a medium magnitude for the observed age difference.

This session provided a thorough exploration of age distributions in the context of police shootings, shedding light on distinct patterns between racial groups and general age tendencies in the dataset.

K-medoids

An improvement on the k-means clustering technique known as “k-medoids” solves some of its drawbacks, most notably how it handles noise and outliers. The actual data points (medoids) are used as cluster representations in k-medoids as opposed to the mean (centroid) of the data points within a cluster. A cluster’s medoid is the data point that minimizes the total distances to all other points in the cluster.

Steps in K-medoids:

  1. Initialization:
    • Select K initial data points as the initial medoids.
  2. Assignment:
    • Assign each data point to the cluster represented by the closest medoid.
  3. Update Medoids:
    • For each cluster, choose the data point that minimizes the sum of distances to all other points as the new medoid.
  4. Repeat:
    • Iterate the assignment and medoid update steps until convergence.

k-means clustering

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping clusters. It assigns data points to clusters based on their similarity, with the goal of minimizing the intra-cluster variance (sum of squared distances between data points within the same cluster) and maximizing the inter-cluster variance.

Key Steps in K-means Clustering:

  1. Initialization:
    • Randomly choose K value  initial cluster centroids.
  2. Assignment:
    • Assign each data point to the cluster whose centroid is closest (Euclidean distance).
  3. Update Centroids:
    • Recalculate the centroids of each cluster based on the mean of the data points assigned to that cluster.
  4. Repeat:
    • Iterate the assignment and centroid update steps until convergence (when centroids no longer change significantly or a set number of iterations is reached).

Washington-Post-Police-data

The Washington Post Police Data has entries dating back to January 2, 2015, and it is updated each week with the latest information. Today’s session covered a variety of questions, including the issue of missing values for columns like armed, flee, and race that get string entries. And how can we add the missing values to the data? There are several approaches for doing this:

Mode imputation : Replace the missing values with most frequent entry(Mode) in to the column. I think this method would be suitable for armed column even though it has many entries but gun, knife, replica being the most frequent one.

Forward fill (ffill) or Backward Fill (bfil) : fills in the missing values with either the value above or below the current value. This method might be suitable for flee as it has most of the entries as ‘not’.

Constant Imputation : Replaces missing values with a specified constant. This method would appropriate for body camera and signs of mental illness as they either True or False (constant value).

Alternatively, if we are unsure of how to fill in certain unique columns, such as the one named “state” in this dataset, we can train a machine learning model to predict the values based on the other dataset entries. Discussed above  Various other methods for filling missing entries in a column can be considered, and their effect on model accuracy can be assessed to identify the most effective approach. I would like to verify my testing set with the above mentioned  methods and test which one is more efficient .

Chi square distribution

The chi-square (χ²) distribution is a probability distribution that arises in statistics and is commonly used in hypothesis testing and confidence interval construction.

Uses of the Chi-Square Distribution:

  1. Hypothesis Testing for Variances: it is used in the chi-square test for variance to compare sample variances with known or hypothesized population variances.
  2. Model Fit Assessment: the chi-square statistic is used to assess how well the observed data fit the expected model.
  3. In contingency table, the chi-square test for independence is used to determine whether there is a significant association between two categorical variables.

The results of chi square distribution in cdc diabetes 2018 dataframe is shown below :

When is one or two, the chi-square distribution is a curve shaped like a backwards “J.” The curve starts out high and then drops off, meaning that there is a high probability that Χ² is close to zero.

 

Plot for residual vs predicted values

Creating a residual vs. predicted values plot is a common diagnostic tool in regression analysis. It helps you visually assess the relationship between the predicted values from your regression model and the corresponding residuals (the differences between the observed and predicted values).

Plot for residual vs predicted values:

With above graph I am able to analyse the presence of outliers in my dataframe and how much they are diverted from the actual values. And the y =0 line represents the magnitude of deviation, Larger vertical distances indicate more substantial deviations, suggesting that the model made a more significant prediction error for those specific cases.

Points above the line represent positive residuals (model underpredicted), while points below the line represent negative residuals (model overpredicted).

Positive outliers suggest that the model underestimated the actual values, while negative outliers suggest overestimation.

From the above we analyse that there is a presence of heteroscedasticity.