Resubmitted: MTH Project 1
Project 3
MTH522 Project 2
Signs of Mental Illness and Fatal Police Shootings: A Complex Relationship
The bar plot indicates that the presence of signs of mental illness in people who were shot might play a role in some instances of fatal police shootings. However, individuals who didn’t show signs of mental illness were at a higher risk of getting shot by the police when compared to those who had mental health conditions.
Factors Influencing Police-Involved Shootings: Fleeing and Fatal Outcomes
The plot clearly demonstrates how the chances of a fatal outcome depend on whether the person was trying to escape from the police when the shooting occurred. Those who were not fleeing had a higher likelihood of being shot by the police in contrast to those who were attempting to flee
.
California Tops List of States in Fatal Police Shootings
California has the highest number of fatal police shootings when compared to all the other states, as indicated by the graph displayed above. The graph also reveals the specific order in which the top 10 states with the most fatal police shootings are ranked.
Gender Disparity in Fatal Police Shootings
The data visualization vividly illustrates a gender gap in fatal police shootings, with a higher likelihood of men being shot by law enforcement compared to women.
Age Disparities in Fatal Police Shootings: The Impact on Those Aged 25 to 35
Our analysis revealed that individuals across a wide range of age groups, spanning from children to older adults, were tragically affected by fatal police shootings. However, a striking concentration of these incidents occurred within the age bracket of 25 to 35 years. This pattern indicates that individuals within this specific age group bear a disproportionate burden of the violence resulting from interactions with law enforcement.
Methodology
For our analysis, we relied on the Washington Post Police Shootings Database, which offers a comprehensive dataset covering police shootings in the United States from 2015 to 2023. To ensure the integrity of our analysis, we took steps to address missing age values. In such cases, we thoughtfully replaced missing age data with the available age information to maintain the dataset’s completeness.
Additionally, we meticulously managed NaN (Not-a-Number) and null values across all other columns within the dataset. This process involved careful handling and appropriate treatment of these missing values to ensure the reliability of the data used for plotting and subsequent analysis.
By taking these measures, we aimed to ensure that our analysis is based on a robust and complete dataset, allowing us to draw meaningful insights and generate accurate visual representations from the Washington Post Police Shootings Database
MATH 522 project 2
This report provides a comprehensive analysis of demographic factors in police shootings using data from the Washington Post Police Shootings Database. Our main goal is to better understand age distribution, race, mental illness, gender, and other relevant factors related to police violence.
Age Distribution: We examine age distribution to identify trends and disparities, shedding light on police violence prevalence across different age groups.
Mental Illness as a Condition: We explore the prevalence of mental illness, emphasizing the role of mental health and the need for appropriate interventions and support.
Gender: We analyze gender disparities to contribute to discussions about gender-related aspects of police violence.
Other Factors: We consider additional dataset factors like location and time of day to gain insights into the circumstances of these incidents.
This report is a valuable resource for policymakers, law enforcement, and advocacy groups working toward criminal justice reform. Our data-driven insights aim to inform evidence-based decision-making and address issues related to police shootings.
Mth522_project1
K-Fold Cross Validation and Bootstrapping
K-Fold Cross-Validation is a widely utilized method for estimating test errors in machine learning and statistics. The fundamental idea behind it is to randomly divide the dataset into K equal-sized parts or folds. One of these folds, denoted as “part K,” is set aside as a validation set, while the model is trained on the remaining K-1 folds (combined). Predictions are then generated for the left-out Kth fold. This process is iteratively repeated for each fold, where K can take values from 1 to the total number of folds, typically denoted as “n.” Finally, the results from each iteration are combined to obtain an overall estimate of the model’s performance.
It’s important to note that since each training set consists of only (K-1)/K of the original data, the estimates of prediction error may be biased upward. This bias tends to be minimized when K is equal to n, the total number of data points, but such an estimate can have high variance.
Bootstrapping, on the other hand, is a resampling technique used in statistics to estimate the sampling distribution of a statistic. It achieves this by repeatedly drawing random samples, with replacement, from the original dataset. This method is particularly useful when the population distribution is unknown or challenging to model.For example, consider a dataset containing products for hair of 60 types of hair . The bootstrapping process involves creating multiple bootstrap samples, each of the same size as the original dataset . The following steps are taken:
Randomly select data points from the original dataset with replacement to form each bootstrap sample.Calculate the desired statistic for each of these bootstrap samples. In our example, we compute the mean for each bootstrap sample.
Repeat the resampling process and statistic calculation a large number of times to generate a distribution of the statistic. This distribution represents the variability of the statistic under different random samples.The resulting bootstrap distribution can then be used to estimate confidence intervals or assess the variability of the statistic. For instance, one might compute a at most 95% confidence interval for the mean exam score based on the bootstrap distribution.
METHODS IN RESAMPLING
There are 5 Methods in Resampling that involve the retraining of a selected model using subsets extracted from the training dataset, with the objective of gaining further insights into the model’s performance. This process aims to provide additional information about the fitted model, including estimating prediction errors on a test dataset and evaluating the standard deviation and bias associated with parameter estimates.
1)Distinguishing Between Test and Training Errors:
Test error signifies the average error incurred when applying a statistical learning method to predict responses for new observations not included in the model’s training process. Conversely, training error can be readily calculated by applying the same method to the data used during the model’s training phase. It’s noteworthy that the training error rate often differs significantly from the test error rate, with the former frequently underestimating the latter.
2)Balancing Bias and Variance:
Bias and variance collectively contribute to prediction errors, and a trade-off between them must be minimized. These factors converge to determine the test error.
3)Validation Set Approach:
In this approach, the available dataset is randomly partitioned into two subsets: a training set and a validation (or hold-out) set. The model is trained on the training set, and the resultant model is utilized to predict responses for the observations in the validation set. The validation set error furnishes an estimate of the test error. This estimation typically employs mean squared error (MSE) for quantitative responses and the misclassification rate for categorical responses.
4)Limitations of the Validation Approach:
The validation estimate of the test error may exhibit significant variability, contingent on the specific observations included in the training and validation sets. In this approach, only a subset of the observations—those in the training set rather than the validation set—is employed to fit the model. Consequently, the validation set error may tend to overestimate the test error for the model fitted to the entire dataset. This discrepancy arises because a larger dataset generally leads to lower error rates.
5)K-Fold Cross-Validation:
K-Fold Cross-Validation stands as a widely adopted technique for estimating test error. The fundamental idea involves randomly dividing the data into K equal-sized segments or folds. One of these segments (part k) is excluded, and the model is trained using the remaining k-1 segments. Predictions are then generated for the omitted Kth segment. This process is iteratively repeated for each segment (K = 1, 2, 3, …, K), and the results are aggregated. Since each training set is only (k-1)/K the size of the original training set, prediction error estimates frequently display an upward bias. This bias is minimized when K equals the total number of data points (K = n), but this estimate is associated with high variance.
Pre-Molt and Post-Molt and their Relation
In our examination of crab sizes before and after molting, we employ statistical analysis. Upon comparing the histograms of pre-molt and post-molt crab sizes, we find that their distributions are quite similar, except for a notable mean difference of 14.6858 units. To ascertain the statistical significance of this discrepancy, we turn to a commonly used method known as a t-test, yielding an estimated p-value of 0.0341998. Since this p-value is less than 0.05, we can confidently reject the null hypothesis that there is no real difference in crab sizes.
The t-test, while a valuable tool, may not always provide a clear understanding of the p-value calculation when comparing means with software. To address this, we employ a Monte-Carlo Procedure. We begin with 472 data points for post-molt crab sizes and another 472 for pre-molt sizes. By combining these sets into a single dataset of 944 points and randomly dividing it into two buckets—Bucket A with 472 data points and Bucket B with the remainder—we can calculate the difference in means between these buckets. We repeat this process N times, recording how often the difference in means equals or exceeds 14.6858. We then compute the probability, denoted as P, as the ratio of these occurrences (n) to the total number of repetitions (N).
Linear Regression models and Quadratic models
At the outset, we have a response variable denoted as Y and a fundamental mean function in linear regression:
Y = β0 + β1 + ϵ
Now, let’s introduce a second variable, X2, with the aim of understanding how Y depends on both X1 and X2 simultaneously. By incorporating X2 into our analysis, we formulate a mean function that considers the values of both X1 and X2:
Y = β0 + β1 x1 + β2 x2 + ϵ
The primary objective of introducing X2 is to account for the part of Y that hasn’t yet been elucidated by X1.
In the context of diabetes prediction, we establish a relationship between the percentages of inactivity and obesity, serving as predictors or factors, and the percentage of diabetes using the Generalized Linear Model. This model extends the fundamental concepts of linear regression by introducing a link function that connects the linear model to the response variable and by permitting the measurement variance to be influenced by the predicted value of each measurement.
Breusch-Pagan Test
The P-value, short for probability value, is a fundamental concept in mathematical statistics. It quantifies the strength of evidence against a null hypothesis in hypothesis testing. To illustrate, consider a COVID-19 vaccination efficacy experiment with a null hypothesis stating the chemical has no effect. The P-value measures the probability of obtaining the observed results if the drug has no effect. In essence, it assesses the likelihood of our data without any significant impact on the studied group. The Breusch-Pagan Test detects heteroscedasticity, a common issue in regression analysis where error variability changes across independent variable levels. If heteroscedasticity exists, a systematic link between independent variables and squared residuals occurs, as the test examines. Heteroscedasticity refers to variable error or residual variability in a regression model, changing as we move along the x-axis. Detecting and addressing heteroscedasticity is crucial because it can introduce bias and inefficiency in parameter estimates, undermining the model’s reliability for predictions. The Test for Heteroscedasticity is a vital diagnostic tool for uncovering and addressing this issue. R-squared measures how much of the dependent variable’s variance is explained by the independent variables. A housing price prediction model indicates the proportion of variation in prices attributed to the chosen variables. A high R-squared suggests a strong model fit, while a low R-squared implies the model doesn’t capture much of the price variation.
What is Linear and Multilinear regression ? How can we use them?
Greetings, fellow data enthusiasts! I’m thrilled to share my journey into the interesting area of linear regression with you today.
The simplicity of linear regression is what initially drew me to it. We all have an innate understanding of the process, which is essentially about finding the best-fit line between data points. Finding hidden connections between variables is similar to putting puzzle pieces together.
The linear regression equation is as follows
Y=+α+β⋅X+ε
Here Y is the dependent variable that we are going to predict, and X is our independent variable
If we apply the above equation to the Data which The dependent variable is the %diabetes=Y, while the independent variable is the %inactivity.=X
then we get “%diabetes = α + β %inactivity + ε”, using this equation we can predict the %diabetic using %inactivity
And the coming to Multilinear regression helps to enhance the simplicity of the preceding statement by allowing for several independent variables. In other words, we can take into account the combined impact of multiple factors on our dependent variable rather than being restricted to just one predictor
This can be used when there are multiple variables in our data set We can observe that there are 3 variables % obesity, % inactivity, and %diabetes
Here we are going to find the %diabetes (dependent variable ) and two independent variables which are % obesity, % inactivity
We can also observe that there are 1370 points of %inactivity and for every %inactivity there is a %diabetes so we have common point of data sets between them.
In upcoming days we are going to find the relation between all three %diabetes % obesity, % inactivity
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!