Cross validation

cross-validation is a technique which is used to assess how well a predictive model will perform on new, unseen data(test data).

We’re testing different ways of predicting diabetes based on obesity and inactivity. We try simple and more complex models (polynomials of degree 1 to 4).

In the dataframe(combined_data) we have common data for all the 3 variables %DIABETIC, %INACTIVE, and %OBESE. The three variables—obesity, inactivity, and diabetes—have records for 354 data points.

In cross validation we basically perform :

1. Divide and Check:   – We split our data into 5 parts (folds).

 – Train models on 4 parts and check how well they predict on the remaining part

2. What We Look For:

   – We’re trying to find the best balance – not too simple (underfit) and not too complex (overfit).

3. Results:

   – After doing this 5 times (each fold as a test once), we compare which model works best.

   – This helps us choose the best way to predict diabetes with obesity and inactivity.

Linear Regression for %diabetic and %inactive (2nd post)

From the last blog we saw he values obtained after performing linear regression is shown below :

The above values indicates:

The linear regression model indicates that, for every one-unit increase in %inactive:

Slope (Coefficient for %inactive): %diabetic is expected to increase by approximately 0.23.

Intercept: When %inactive is zero, the predicted %diabetic is around 3.77.

R-squared value: The model explains about 19.51% of the variation in %diabetic.

P-value: The very low p-value (1.63e-66) suggests a statistically significant relationship.

Standard Error: The standard error (0.0128) reflects the precision of the model’s predictions.

In conclusion, the model indicates that %inactive and %diabetes have a statistically significant association. But the R-squared value shows that the model only partially explains the variance in %diabetes, indicating that %diabetic may be influenced by other factors not included in the model.

Linear Regression of % Diabetes vs % Inactivity

Performing linear regression  is commonly done using the statsmodels or scikit-learn libraries.

Performance of linear regression :

slope, intercept, r_value, p_value, std_err = linregress(diainac_df[‘% INACTIVE’], diainac_df[‘% DIABETIC’])

Where % DIABETIC,%INACTIVE are columns from diainac dataframe.

Plotting the data and the regression line:

plt.scatter(diainac_df[‘% INACTIVE’], diainac_df[‘% DIABETIC’], s=10,label=’Data’)
plt.plot(diainac_df[‘% INACTIVE’], regression_line, color=’red’, label=’Linear Regression’)

Plot is shown below :

The values obtained are :

Correlation between %Diabetic and %Inactive

correlation is a statistical measure which indicates the extent to which two variables vary together. It quantifies the strength and direction of a linear relationship between two variables. The correlation coefficient is a numerical value that ranges from -1 to 1

Retrive dataframe using variable diainac_df :

To obtain correlation between %Diabetic and %Inactive perform : diainac_df[‘% DIABETIC’].corr(diainac_df[‘% INACTIVE’]),

The ‘corr’ method is a pandas function used to compute correlation coefficients. The resulting value is stored in the variable correlation coefficient

A correlation coefficient of 0.441706 indicates a moderate positive linear relationship between the two variables. As one variable increases, the other tends to increase, and vice versa.

 

The Breusch-pagan test for heteroscedasticity

The Breusch-Pagan test is a statistical test used to detect the presence of heteroskedasticity in regression models. Heteroskedasticity refers to the situation where the variability of the errors (residuals) in a regression model is not constant across all levels of the independent variables.

‘statsmodels’ library can be used to perform The Breusch-pagan test for heteroscedasticity in pyhton. This test to ensure the validity of regression assumptions, improve model accuracy, and enable robust statistical inferences.

H0:The residuals are uniformly scattered. (No heteroskedasticity)

HA:The residuals are not uniformly scattered. (Heteroskedasticity detected), If the P-value is less than 0.05 we reject the null hypothesis otherwise we fail to reject the null hypothesis in general. 

CDC Diabetes 2018 project(2nd post)

Q-Q plot for skewness and kurtosis

A QQ plot is a scatterplot created by plotting two sets of quantiles against one another, t tells us whether a data set is normally distributed.
Q-Q plot skewness for %Diabetic and %Inactive.
The graph above shows over-dispersed data or  positive excess kurtosis, data appears as a flipped S shape. The smallest observations are smaller than you would expect from a normal distribution, the largest observations are larger than you would expect from a normal distribution both relative to normal distribution.
Q-Q plot kurtosis for %Diabetic and %Inactive.
The graph shows that the data is left skewed distribution, on a Q-Q plot left-skewed data appears as a concave curve, the upper tail of the data’s distribution has been reduced, relative to a normal distribution.

CDC Diabetes 2018 project1(1st post)

I have created a variable diainac_df, which includes combined data for %diabetic and %inactive which also includes data of FIPS, COUNTY, YEAR and STATE, this dataset has 1370 rows(length).

And generated a smooth histogram for %Diabetic to analyse the data points.

and calculated the basic statistics mean : 7.628832, median : 7.45, Standard deviation : 1.016278, Skewness : 0.658616,  shows that the distribution is right-skewed or positively skewed, on the left side of the distribution.

Kurtosis : 4.130265 shows the distribution has more data points clustered around the mean.

Found these to analyse the data like, where the most of  data is presented, kurtosis describes the shape of the data and to pick the right analysis methods and to explore the data efficiently.

 

Analysis of CDC Diabetes 2018

The CDC Diabetes 2018 contains the federal information processing standards (FIPS) data for the variables %obesity,%diabetic and % inactivity.  There are 1370 common data points between %inactivity and %diabetic and 354 common data points for %diabetic,%inactivity and %obesity.

so for the analysis between %diabetic and %inactivity we use 1370 data points and 354 data points for all of them together.

As there are relatively large number of data points for %inactive and % diabetic, first obtain analysis between them.  To find their correlation, Mean, median, kurtosis, skewness etc.

 

Linear regression and its application.

Linear regression is used to model and quantify the relationship between a dependent variable and one or more independent variables. It’s employed when:

Modeling Relationships: Assumes a linear relationship between variables. Predictive Modeling: Predicts values based on historical data.

Understanding Variable Influence: Identifies and quantifies the impact of predictors on the response.

Hypothesis testing : 

1. The hypothesis test for linear regression determines whether variables are related.
2. Alternative hypothesis (\(H_1\)): A link exists; Null hypothesis (\(H_0\)): No link (\(\beta = 0\)).
3. Locate the regression model’s coefficients using the data.
4. Examine the coefficient significance of a statistic (such as the t-statistic).
5. Make a decision based on p-values: If \(H_0\) indicates a substantial association, reject it; if not, continue it.

In the coming CDC2018 data set we will be using linear regression to predict relation between diabetes and inactivity and obesity.