SARIMA

SARIMA, which stands for Seasonal AutoRegressive Integrated Moving Average, is a time series forecasting model that extends the ARIMA model to handle seasonal patterns. The SARIMA model is particularly useful when a time series exhibits both non-seasonal and seasonal trends.

1. Seasonal Component (S): This represents the repeating pattern within each season. For example, if you have monthly data and observe a consistent pattern each year, the seasonal component captures this.

2. AutoRegressive Component (AR): Similar to ARIMA, this accounts for the relationship between a current observation and its past values. It models the linear dependence of the current value on its previous values.

3. Integrated Component (I): Denotes the number of differences needed to make the time series stationary. Stationarity is a key assumption in time series analysis.

4. Moving Average Component (MA): Like in ARIMA, this considers the relationship between the current observation and a residual term based on past observations.

The SARIMA model is denoted as SARIMA(p, d, q)(P, D, Q)s, where:

– p, d, q: Parameters for the non-seasonal component (ARIMA).
– P, D, Q: Parameters for the seasonal component.
– s: The length of the seasonal cycle (e.g., 12 for monthly data).

 

Time series forecasting

Time Series Analysis:

Definition:Time series analysis involves studying and modeling data collected over time to identify patterns, trends, and make predictions.

Methods:
1. Moving Averages: Smoothens fluctuations over time.
2. Exponential Smoothing: Assigns varying weights to different data points.
3. ARIMA (AutoRegressive Integrated Moving Average): Models temporal dependencies and trends.
4. Seasonal Decomposition of Time Series (STL): Breaks down data into trend, seasonality, and remainder components.
5. Prophet: Developed by Facebook for forecasting with daily observations and multiple seasonality.

Uses:
– Forecasting: Predict future values based on historical patterns.
– Anomaly Detection: Identify unusual patterns or events.
– Trend Analysis: Understand long-term developments.
– Financial Market Analysis: Predict stock prices and market trends.
– Demand Planning: Optimize inventory based on future demand.
– Economic Indicators: Analyze and predict economic trends over time.

Washington-post-police-data(2nd post)

I have created a dataframe called ‘gl’ which has data of latitude and longitude of washington post police data. And the count is 7162 for both longitude and latitude. The mean is -97.040644 for latitude and 36.675719 for longitude.

And I have created a geographic histogram of gl dataframe for USA.

Decision trees(NOV 8)

Decision trees are a popular and intuitive machine learning algorithm used for both classification and regression algorithms. They are a tree-like model where an internal node represents a decision based on the value of a particular feature, and each leaf node represents the  prediction or the outcome.

Advantages of using decision trees :

-They are simple to understand and interpret as the tress can be visualized.

-Able to handle multi-output problems.

-Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

How decision trees are used :

  1. Classification:
    • Training: Given a dataset with labeles , the decision tree algorithm recursively splits the data based on features to create a tree.
    • Prediction: For a new data point, it traverses the tree, making decisions at each node based on feature values until it reaches a leaf node, providing the predicted class.
  2. Regression:
    • Training: Similar to classification but applied to tasks where the output is a continuous value.
    • Prediction: The tree predicts a continuous value by averaging the target values of the instances in the leaf node.
  3. Handling Missing Values:
    • Decision trees can handle missing values in the data by selecting the best available split based on the available features.

Comparison of k-medoids and DB scan

K-medoids:

  1. Cluster Shape:
    • Can handle clusters of arbitrary shapes.
  2. Outlier Handling:
    • Less sensitive to outliers due to medoid calculation.
  3. Density Sensitivity:
    • Not explicitly density-sensitive.
  4. Parameter Dependency:
    • Sensitive to the number of clusters (K) and initial medoids.
  5. Use Cases:
    • Effective for datasets with irregularly shaped clusters.
  6. Computational Efficiency:
    • Can be more computationally expensive, especially for large datasets.

DBSCAN:

  1. Cluster Shape:
    • Can find clusters of arbitrary shapes.
  2. Outlier Handling:
    • Robust to outliers due to density-based approach.
  3. Density Sensitivity:
    • Sensitive to varying cluster densities.
  4. Parameter Dependency:
    • Less dependent on specifying the number of clusters.
  5. Use Cases:
    • Effective for datasets with irregularly shaped clusters and varying densities.
  6. Computational Efficiency:
    • Can be more computationally expensive, especially for large datasets.

Based on the features of your data and the particular objectives of your clustering operation, decide between K-medoids and DBSCAN. DBSCAN is useful for density-sensitive clustering, but K-medoids might be favored due to their resistance to outliers.

comparison of k-means clustering and DB scan

K-means:
– Assumes spherical clusters, sensitive to outliers.
– Not density-sensitive, requires specifying K.
– Suitable for well-defined, spherical clusters.
– Generally computationally less expensive.

DBSCAN:
– Handles arbitrary cluster shapes, robust to outliers.
– Density-sensitive, less dependent on specifying clusters.
– Effective for irregularly shaped clusters with varying densities.
– Can be more computationally expensive.

Summary:

  • Cluster Shape:
    • K-means assumes spherical clusters; DBSCAN can handle arbitrary shapes.
  • Outlier Handling:
    • K-means is sensitive to outliers; DBSCAN is robust to outliers.
  • Density Sensitivity:
    • K-means is not sensitive to density variations; DBSCAN is density-sensitive.
  • Parameter Dependency:
    • K-means requires specifying the number of clusters (K); DBSCAN is less dependent on this.

Choose K-means for spherical clusters with minimal noise, and DBSCAN for irregularly shaped clusters with varying densities and robustness to outliers.

The Bayesian approach

The Bayesian approach is a statistical framework based on Bayesian probability theory, which provides a systematic way to update beliefs about uncertainty by adopting a new evidence.

Key Concepts:

  1. Bayesian Probability:
    • Represents degrees of belief rather than frequencies.
    • Probability is assigned to hypotheses, reflecting the degree of belief in their truth.
  2. Prior Probability:
    • Represents the initial belief or probability before considering new evidence.
    • Based on prior knowledge, experience, or assumptions.
  3. Bayesian Inference:
    • Process of updating beliefs based on observed data.
    • Uses Bayes’ theorem to calculate the posterior probability.

Applications:

  1. Machine Learning:
    • Bayesian methods in machine learning include Bayesian networks, Bayesian regression, and Bayesian optimization.
  2. Statistics:
    • Bayesian statistics is used in parameter estimation, hypothesis testing, and model comparison.
  3. Natural Language Processing:
    • Bayesian models are applied in language modeling, text classification, and information retrieval.

The Bayesian approach provides a coherent framework for updating beliefs and making decisions under uncertainty. It has diverse applications across various fields, with advantages in handling uncertainty and incorporating prior knowledge. However, its adoption may require overcoming challenges related to computational complexity and potential subjectivity in the choice of priors.

 

comparison of k-means clustereing and k-medoids

  • Centroid vs. Medoid:
    • K-means: Uses centroids (mean points), it is sensitive to outliers.
    • K-medoids: Uses medoids (minimizes distances), less sensitive to outliers.
  • Robustness to Outliers:
    • K-means: Sensitive to outliers.
    • K-medoids: More robust to outliers.
  • Initialization:
    • K-means: Sensitive to initial centroids.
    • K-medoids: Less sensitive to initial medoids.
  • Cluster Shape:
    • K-means: Assumes spherical clusters.
    • K-medoids: Handles arbitrary shapes.
  • Computational Complexity(cost) :
    • K-means: Less computationally expensive.
    • K-medoids: Can be more computationally expensive.
  • Cluster Connectivity:
    • K-means: Does not naturally connect non-contiguous regions.
    • K-medoids: Connects points based on density.
  • Use Cases:
    • K-means: Well-defined, spherical clusters, computational efficiency.
    • K-medoids: Irregular clusters, robustness to outliers.

Finally, k-means clustering  is faster, suitable for spherical clusters, and sensitive to outliers.

K-medoids is more robust to outliers, handles arbitrary shapes, but can be computationally more expensive.

DB scan

A clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clusters data points according to how densely they are arranged in the feature space. DBSCAN may find clusters of any shape, unlike k-means, which assumes that clusters have a spherical shape. It can treat outliers as noise and is especially good at detecting clusters divided by different densities.

Steps in DBSCAN:

  1. Parameter Selection:
    • Choose two parameters: eps (epsilon) and min_samples.
    • eps: Radius around a data point to define its neighborhood.
    • min_samples: Minimum number of points required to form a dense region.
  2. Core Point Identification:
    • Identify core points by counting the number of points in the epsilon neighborhood of each data point.
  3. Cluster Expansion:
    • Form clusters by connecting core points that are within each other’s epsilon neighborhood.
  4. Label Border Points:
    • Label border points that are in the epsilon neighborhood of a core point but are not core points themselves.
  5. Noise Identification:
    • Assign noise points that are neither core nor border points.

October 25

In today’s class, we delved into the age distribution of individuals involved in police shootings, specifically focusing on disparities between Black and White individuals. Our analysis utilized Mathematica, with plans to develop a Python equivalent for future sessions.

Key Findings:

1. Overall Age Distribution:
– Range: 6 to 91 years
– Average: 37.1 years, Median: 35 years
– Standard Deviation: 13 years
– Slight right skewness (0.73) and kurtosis close to 3, indicating a distribution resembling the normal curve.

2. Age Distribution for Black Individuals:
– Range: 13 to 88 years
– Average: 32.7 years, Median: 31 years
– Standard Deviation: 11.4 years
– Right skewness (1) and kurtosis of 3.9, suggesting a slightly more pronounced tail.

3. Age Distribution for White Individuals:
– Range: 6 to 91 years
– Average: 40 years, Median: 38 years
– Standard Deviation: 13.3 years
– Slightly right-skewed (0.53) with kurtosis of 2.86, indicating a distribution with a less pronounced peak and tails.

4. Comparison Between Black and White Individuals:
– Average age difference of approximately 7.3 years, with White individuals being older on average.
– Monte Carlo simulation confirmed statistical significance, implying a low likelihood of this result occurring by chance.
– Cohen’s d effect size of 0.58 indicates a medium magnitude for the observed age difference.

This session provided a thorough exploration of age distributions in the context of police shootings, shedding light on distinct patterns between racial groups and general age tendencies in the dataset.