The Breusch-pagan test for heteroscedasticity

The Breusch-Pagan test is a statistical test used to detect the presence of heteroskedasticity in regression models. Heteroskedasticity refers to the situation where the variability of the errors (residuals) in a regression model is not constant across all levels of the independent variables.

‘statsmodels’ library can be used to perform The Breusch-pagan test for heteroscedasticity in pyhton. This test to ensure the validity of regression assumptions, improve model accuracy, and enable robust statistical inferences.

H0:The residuals are uniformly scattered. (No heteroskedasticity)

HA:The residuals are not uniformly scattered. (Heteroskedasticity detected), If the P-value is less than 0.05 we reject the null hypothesis otherwise we fail to reject the null hypothesis in general. 

CDC Diabetes 2018 project(2nd post)

Q-Q plot for skewness and kurtosis

A QQ plot is a scatterplot created by plotting two sets of quantiles against one another, t tells us whether a data set is normally distributed.
Q-Q plot skewness for %Diabetic and %Inactive.
The graph above shows over-dispersed data or  positive excess kurtosis, data appears as a flipped S shape. The smallest observations are smaller than you would expect from a normal distribution, the largest observations are larger than you would expect from a normal distribution both relative to normal distribution.
Q-Q plot kurtosis for %Diabetic and %Inactive.
The graph shows that the data is left skewed distribution, on a Q-Q plot left-skewed data appears as a concave curve, the upper tail of the data’s distribution has been reduced, relative to a normal distribution.

CDC Diabetes 2018 project1(1st post)

I have created a variable diainac_df, which includes combined data for %diabetic and %inactive which also includes data of FIPS, COUNTY, YEAR and STATE, this dataset has 1370 rows(length).

And generated a smooth histogram for %Diabetic to analyse the data points.

and calculated the basic statistics mean : 7.628832, median : 7.45, Standard deviation : 1.016278, Skewness : 0.658616,  shows that the distribution is right-skewed or positively skewed, on the left side of the distribution.

Kurtosis : 4.130265 shows the distribution has more data points clustered around the mean.

Found these to analyse the data like, where the most of  data is presented, kurtosis describes the shape of the data and to pick the right analysis methods and to explore the data efficiently.

 

Analysis of CDC Diabetes 2018

The CDC Diabetes 2018 contains the federal information processing standards (FIPS) data for the variables %obesity,%diabetic and % inactivity.  There are 1370 common data points between %inactivity and %diabetic and 354 common data points for %diabetic,%inactivity and %obesity.

so for the analysis between %diabetic and %inactivity we use 1370 data points and 354 data points for all of them together.

As there are relatively large number of data points for %inactive and % diabetic, first obtain analysis between them.  To find their correlation, Mean, median, kurtosis, skewness etc.

 

Linear regression and its application.

Linear regression is used to model and quantify the relationship between a dependent variable and one or more independent variables. It’s employed when:

Modeling Relationships: Assumes a linear relationship between variables. Predictive Modeling: Predicts values based on historical data.

Understanding Variable Influence: Identifies and quantifies the impact of predictors on the response.

Hypothesis testing : 

1. The hypothesis test for linear regression determines whether variables are related.
2. Alternative hypothesis (\(H_1\)): A link exists; Null hypothesis (\(H_0\)): No link (\(\beta = 0\)).
3. Locate the regression model’s coefficients using the data.
4. Examine the coefficient significance of a statistic (such as the t-statistic).
5. Make a decision based on p-values: If \(H_0\) indicates a substantial association, reject it; if not, continue it.

In the coming CDC2018 data set we will be using linear regression to predict relation between diabetes and inactivity and obesity.