K-medoids

An improvement on the k-means clustering technique known as “k-medoids” solves some of its drawbacks, most notably how it handles noise and outliers. The actual data points (medoids) are used as cluster representations in k-medoids as opposed to the mean (centroid) of the data points within a cluster. A cluster’s medoid is the data point that minimizes the total distances to all other points in the cluster.

Steps in K-medoids:

  1. Initialization:
    • Select K initial data points as the initial medoids.
  2. Assignment:
    • Assign each data point to the cluster represented by the closest medoid.
  3. Update Medoids:
    • For each cluster, choose the data point that minimizes the sum of distances to all other points as the new medoid.
  4. Repeat:
    • Iterate the assignment and medoid update steps until convergence.

k-means clustering

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping clusters. It assigns data points to clusters based on their similarity, with the goal of minimizing the intra-cluster variance (sum of squared distances between data points within the same cluster) and maximizing the inter-cluster variance.

Key Steps in K-means Clustering:

  1. Initialization:
    • Randomly choose K value  initial cluster centroids.
  2. Assignment:
    • Assign each data point to the cluster whose centroid is closest (Euclidean distance).
  3. Update Centroids:
    • Recalculate the centroids of each cluster based on the mean of the data points assigned to that cluster.
  4. Repeat:
    • Iterate the assignment and centroid update steps until convergence (when centroids no longer change significantly or a set number of iterations is reached).

Washington-Post-Police-data

The Washington Post Police Data has entries dating back to January 2, 2015, and it is updated each week with the latest information. Today’s session covered a variety of questions, including the issue of missing values for columns like armed, flee, and race that get string entries. And how can we add the missing values to the data? There are several approaches for doing this:

Mode imputation : Replace the missing values with most frequent entry(Mode) in to the column. I think this method would be suitable for armed column even though it has many entries but gun, knife, replica being the most frequent one.

Forward fill (ffill) or Backward Fill (bfil) : fills in the missing values with either the value above or below the current value. This method might be suitable for flee as it has most of the entries as ‘not’.

Constant Imputation : Replaces missing values with a specified constant. This method would appropriate for body camera and signs of mental illness as they either True or False (constant value).

Alternatively, if we are unsure of how to fill in certain unique columns, such as the one named “state” in this dataset, we can train a machine learning model to predict the values based on the other dataset entries. Discussed above  Various other methods for filling missing entries in a column can be considered, and their effect on model accuracy can be assessed to identify the most effective approach. I would like to verify my testing set with the above mentioned  methods and test which one is more efficient .

Chi square distribution

The chi-square (χ²) distribution is a probability distribution that arises in statistics and is commonly used in hypothesis testing and confidence interval construction.

Uses of the Chi-Square Distribution:

  1. Hypothesis Testing for Variances: it is used in the chi-square test for variance to compare sample variances with known or hypothesized population variances.
  2. Model Fit Assessment: the chi-square statistic is used to assess how well the observed data fit the expected model.
  3. In contingency table, the chi-square test for independence is used to determine whether there is a significant association between two categorical variables.

The results of chi square distribution in cdc diabetes 2018 dataframe is shown below :

When is one or two, the chi-square distribution is a curve shaped like a backwards “J.” The curve starts out high and then drops off, meaning that there is a high probability that Χ² is close to zero.

 

Plot for residual vs predicted values

Creating a residual vs. predicted values plot is a common diagnostic tool in regression analysis. It helps you visually assess the relationship between the predicted values from your regression model and the corresponding residuals (the differences between the observed and predicted values).

Plot for residual vs predicted values:

With above graph I am able to analyse the presence of outliers in my dataframe and how much they are diverted from the actual values. And the y =0 line represents the magnitude of deviation, Larger vertical distances indicate more substantial deviations, suggesting that the model made a more significant prediction error for those specific cases.

Points above the line represent positive residuals (model underpredicted), while points below the line represent negative residuals (model overpredicted).

Positive outliers suggest that the model underestimated the actual values, while negative outliers suggest overestimation.

From the above we analyse that there is a presence of heteroscedasticity.

Residual plot of %Diabetes vs %Inactive

A residual plot is a graphical representation of the residuals (the differences between observed and predicted values) in a regression analysis. Examining the residual plot can provide insights into the validity of assumptions and the overall performance of the regression model.

Residual plot of %Diabetes vs %Inactive:

The red dotted horizontal line indicates the zero residual line and blue dots across the line indicates the residuals.  With the above residual plot I am able to verify Homoscedasticity: Homoscedasticity means that the spread of residuals is constant across all levels of the independent variable.

linearity in the data: A good regression model assumes that the relationship between the independent and dependent variables is linear. In a residual plot, you ideally want to see a random, patternless spread of points around the horizontal axis. If there’s a clear pattern, it suggests that the relationship might not be entirely linear, and the model might need adjustment.

presence of outliers: Residual plots can help identify outliers, which are data points that deviate significantly from the overall pattern. Outliers can have a substantial impact on regression results.

and model fit assesment etc.

Cross validation

cross-validation is a technique which is used to assess how well a predictive model will perform on new, unseen data(test data).

We’re testing different ways of predicting diabetes based on obesity and inactivity. We try simple and more complex models (polynomials of degree 1 to 4).

In the dataframe(combined_data) we have common data for all the 3 variables %DIABETIC, %INACTIVE, and %OBESE. The three variables—obesity, inactivity, and diabetes—have records for 354 data points.

In cross validation we basically perform :

1. Divide and Check:   – We split our data into 5 parts (folds).

 – Train models on 4 parts and check how well they predict on the remaining part

2. What We Look For:

   – We’re trying to find the best balance – not too simple (underfit) and not too complex (overfit).

3. Results:

   – After doing this 5 times (each fold as a test once), we compare which model works best.

   – This helps us choose the best way to predict diabetes with obesity and inactivity.

Linear Regression for %diabetic and %inactive (2nd post)

From the last blog we saw he values obtained after performing linear regression is shown below :

The above values indicates:

The linear regression model indicates that, for every one-unit increase in %inactive:

Slope (Coefficient for %inactive): %diabetic is expected to increase by approximately 0.23.

Intercept: When %inactive is zero, the predicted %diabetic is around 3.77.

R-squared value: The model explains about 19.51% of the variation in %diabetic.

P-value: The very low p-value (1.63e-66) suggests a statistically significant relationship.

Standard Error: The standard error (0.0128) reflects the precision of the model’s predictions.

In conclusion, the model indicates that %inactive and %diabetes have a statistically significant association. But the R-squared value shows that the model only partially explains the variance in %diabetes, indicating that %diabetic may be influenced by other factors not included in the model.

Linear Regression of % Diabetes vs % Inactivity

Performing linear regression  is commonly done using the statsmodels or scikit-learn libraries.

Performance of linear regression :

slope, intercept, r_value, p_value, std_err = linregress(diainac_df[‘% INACTIVE’], diainac_df[‘% DIABETIC’])

Where % DIABETIC,%INACTIVE are columns from diainac dataframe.

Plotting the data and the regression line:

plt.scatter(diainac_df[‘% INACTIVE’], diainac_df[‘% DIABETIC’], s=10,label=’Data’)
plt.plot(diainac_df[‘% INACTIVE’], regression_line, color=’red’, label=’Linear Regression’)

Plot is shown below :

The values obtained are :

Correlation between %Diabetic and %Inactive

correlation is a statistical measure which indicates the extent to which two variables vary together. It quantifies the strength and direction of a linear relationship between two variables. The correlation coefficient is a numerical value that ranges from -1 to 1

Retrive dataframe using variable diainac_df :

To obtain correlation between %Diabetic and %Inactive perform : diainac_df[‘% DIABETIC’].corr(diainac_df[‘% INACTIVE’]),

The ‘corr’ method is a pandas function used to compute correlation coefficients. The resulting value is stored in the variable correlation coefficient

A correlation coefficient of 0.441706 indicates a moderate positive linear relationship between the two variables. As one variable increases, the other tends to increase, and vice versa.