Enjoy Upto 50% off on all Your Assignments ORDER NOW

Business Analytics

Executive Summary of Breast Cancer Evaluation

This is a study of Breast cancer evaluation. There are in all 116 patients in the study. The data across various parameters have been calculated. Most of these parameters are indicators measured from the blood sample of the individuals. The recordings across most of the variables are numeric in nature. However the response variable Is a factory type. There have been changes made in the datafile such as imputing the missing values by the average as the possible best substitute. The factors that were fed as numeric variables were converted into factor type so that the interpretations produced make sense and the relevant information of the forecast can be exhibited in the most effective way along with good precision standards. A multinomial Logistic regression was carried out where the results from an initial model were noted and the efficiency was determined.

Later all the possible exploratory variables were included to create a new model with better representation of the real time data and precision of the classifying standards. The implication of the various statistic parameters in non-statistical verbal explanation has been stated. This will help in construction of various such measures in due course of time help in understanding the relevance of classifying the patients based on inputs. This will also help in producing awareness among individuals to monitor few parameters in their blood by monitoring diet and relevant medicnes over a period of time.

Introduction to Hereditary

Hereditary variety within the human genome is a developing asset for examining cancer, a complex set of maladies characterized by both environmental and hereditary commitments. The worldwide center to combat cancer needs to be on cancer mindfulness, early detection, diagnosis, and accessibility and reasonableness of treatment in all cancer.

Nowadays, cancer could be a common family word, with each of us closely related with at least one close and expensive one, a family part or a friend, a neighbor or a colleague, diagnosed with cancer. There's a perception that cancer frequency is on the increment; and a hope that maybe with the progresses in technology, cancer is analyzed more frequently, possibly alter in our demeanor and approach, the myths related with cancer are vanishing and we are more open to accepting cancer determination and talking about cancer more openly

Let us closely have a look at the dataset on the Breast Cancer data taken from the https://archive.ics.uci.edu/ml/datasets.php and conduct a exploratory analysis the 10 indicators which provide us the necessary information with the medical parameters. These parameters will help us providing the necessary information on the modelling the datasets and helping us predict the exact change of predicting an individual in he is healthy or a potential carrier of the disease based on the medical parameters that are listed in the dataset.

There are 10 indicators, all quantitative, and a parallel subordinate variable, showing the nearness or nonattendance of breast cancer. The indicators are anthropometric information and parameters which can be assembled in schedule blood analysis. Prediction models based on these indicators, on the off chance that exact, can possibly be utilized as a biomarker of breast cancer.

Dataset and Variables description:

Quantitative Attributes:
Age (years)
BMI (kg/m2)
Glucose (mg/dL)
Insulin (µU/mL)
HOMA
Leptin (ng/mL)
Adiponectin (µg/mL)
Resistin (ng/mL)
MCP-1(pg/dL)
Labels:
1=Healthy controls
2=Patients

Data Cleaning Methodologies implemented

All the variables under study are of the integer type. These being medical parameters and obtained from blood sample there is very less scope for us to actually make a lot of changes to the data as the statistical and medical interpretations of the dataset would virtual undergo change. There needs to be a lot of intervention on the ways of handling the missing data and impute values into the variables that would not cause a serious hazard to the response variable. The 3 parameters Glucose, HOMA and Resistin have showed a presence of missing values. These variables being Numeric in nature could not be left blank for the study. In order to proceed with the analysis the missing values have been imputed by the gross average value and the analysis has been carried out.

Moreover in this project our response variable here is a classification variable which needs us to classify the person as a Healthy one or a patient with breast cancer status based on these medical parameters. Hence this study is a classic model which needs classification to be done based on the response variable. The response variable is to be measured as factor variable which takes only two values

1: denoting the patient is healthy.

2: The individual is classified as a patient.

Hence the numerical variable taken by this response variable needed to be changed into factors. So that the regression model we are going to fit treats this variable as a factor with 2 levels. We intend to “1” as “normal Heathy individual” as our reference based on which we can carry out the classification. Kindly note we are working on a datasheet that consists of female data only.

Model Implemented

We look forward to implement Multinomial Logistic Regression model which takes the response variable as the factor and all other measured medical parameters as the exploratory variables that are significantly contributing to the response variable and helping us making the best possible classification based on the model.

Data Analysis

A Descriptive summary statistics of all the variables under study reveals the following details:

> summary(Cancer)

 Age BMI Glucose Insulin HOMA

 Min. :24.0 Min. :18.37 Min. : 27.98 Min. : 2.432 Min. : 0.4674 

 1st Qu.:45.0 1st Qu.:22.97 1st Qu.: 83.75 1st Qu.: 4.359 1st Qu.: 0.9180 

 Median :56.0 Median :27.66 Median : 92.00 Median : 5.925 Median : 1.3809 

 Mean :57.3 Mean :27.58 Mean : 93.48 Mean :10.012 Mean : 2.7812 

 3rd Qu.:71.0 3rd Qu.:31.24 3rd Qu.:101.00 3rd Qu.:11.189 3rd Qu.: 2.8578 

 Max. :89.0 Max. :38.58 Max. :201.00 Max. :58.460 Max. :25.0503 

Leptin Adiponectin Resistin MCP.1 Classification

 Min. : 4.311 Min. : 1.656 Min. : 3.210 Min. : 45.84 Min. :1.000 

 1st Qu.:12.314 1st Qu.: 5.474 1st Qu.: 7.048 1st Qu.: 269.98 1st Qu.:1.000 

 Median :20.271 Median : 8.353 Median :10.828 Median : 471.32 Median :2.000 

 Mean :26.615 Mean :10.181 Mean :14.843 Mean : 534.65 Mean :1.552 

 3rd Qu.:37.378 3rd Qu.:11.816 3rd Qu.:17.755 3rd Qu.: 700.09 3rd Qu.:2.000 

 Max. :90.280 Max. :38.040 Max. :82.100 Max. :1698.44 Max. :2.000 

The summary statistics indicates the individuals under study have minimum age as 24 maximum age as 89 years. Hence the study consists of individuals from broad spectrum of age indicating a great variety and this will help us evolve criteria in understanding the possible age interval that is potential onset of this disease and help us conduct research to evaluate the tentative age interval showing spike on the onset of the disease. The median age tells us point at below which we have the 50 percent of our data classified. There are similar summary statistics derived for all other medical terminologies that are a part of this study.

Here we observe the Classification which is our response variable treated as numeric in nature. We further make changes to this data and convert it into factor.

> mydata<-Cancer

> str(mydata)

> mydata$Classification<-factor(mydata$Classification)

> str(mydata)

tibble [116 x 10] (S3: tbl_df/tbl/data.frame)

 $ Age : num [1:116] 48 83 82 68 86 49 89 76 73 75 ...

 $ BMI : num [1:116] 23.5 20.7 23.1 21.4 21.1 ...

 $ Glucose : num [1:116] 70 92 91 77 92 92 77 118 97 83 ...

 $ Insulin : num [1:116] 2.71 3.12 4.5 3.23 3.55 ...

 $ HOMA : num [1:116] 0.467 0.707 1.01 0.613 0.805 ...

 $ Leptin : num [1:116] 8.81 8.84 17.94 9.88 6.7 ...

 $ Adiponectin : num [1:116] 9.7 5.43 22.43 7.17 4.82 ...

 $ Resistin : num [1:116] 8 4.06 9.28 12.77 10.58 ...

 $ MCP.1 : num [1:116] 417 469 555 928 774 ...

 $ Classification: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

Hence this tells us that our classification variable has got converted into factors.

Model Development

Let us look towards making an initial model which will help us classify the individuals based on certain factors. Firstly let us keep the response 1 as the reference factor based on which we shall conduct out study.

Initial Model:

> mydata$out<-relevel(mydata$Classification,ref="1")

Initially let us consider our model of calculating our response variable using a multinomial regression model with Age, BMI and Glucose as our exploratory variables. This model is explained and the predictions based on the model are stored in the variable mymodel.

> library(nnet)

# weights: 5 (4 variable)

initial value 80.405073

final value 74.992281

converged

> summary(mymodel)

Call:

multinom(formula = out ~ Age + BMI + Glucose, data = mydata)

Coefficients:

Values Std. Err.

(Intercept) 1.57415346 1.331703257

Age -0.01478909 0.012748967

BMI -0.09239280 0.041942128

Glucose 0.02202289 0.008896601

Residual Deviance: 149.9846

AIC: 157.9846

The summary statistics of the model shows the value of the estimates and the standard errors.

The intercept value is positive and the coefficient of age and BMI negative telling that age has negative impact on the response variable while present of Glucose has a positive impact on the response variable. Unlike the OLS estimator in regression we cannot interpret the residual variance here as it is more complicated in this situation. Our model is actually the odds and interpreting the residuals for odds is appropriate here. However the AIC value is a good indicator and as the model refines we expect the AIC value and the residual deviance going on decreasing making the model more efficient in terms of the estimating and classifying the data.

> predict(mymodel,mydata,type="prob")

This statement indicates the probability calculations under the model. It indicates the probability wit which an individual is classified.

> #predictions by the model

> #misclassifications

> cm<-table(predict(mymodel),mydata$Classification)

> print(cm)

Real time data

Classification 1 27 21

Based on the model 2 25 43

This is the Misclassification table which gives us information about the number of the correct Classifications the model has made across the real time data exhibited. The table shows the Diagonal elements as correct assignments. While the off diagonal elements are the

Misclassifications happened.

> 1-sum(diag(cm))/sum(cm)

[1] 0.3965517

> #39 % of the times there has been a misclassification

This indicates the percentage of misclassifications made by the Model. In our case the initial model has been inefficient as the proportion of misclassification made are approximately equal to 40%.

> #two tailed Z test

> z<-summary(mymodel)$coefficients/summary(mymodel)$standard.errors

> p<-(1-pnorm(abs(z),0,1))*2

> p

(Intercept) Age BMI Glucose

 0.23718180 0.24603972 0.02760435 0.01330770

According to the Model let us look at the p-values which tell us the confidence with which they are estimated. We observe that according to the model the parameters BMI and Glucose have p-value less than 0.05 indicating that the variable has significant contribution in classifying an individual as Normal or Patient.

Final Model:

> mymodel_1<-multinom(out~Age+BMI+Glucose+Insulin+HOMA+Leptin+Adiponectin+Resistin+MCP.1,data=mydata)

# weights: 11 (10 variable)

initial value 80.405073

iter 10 value 69.828701

iter 20 value 58.856741

final value 58.855715

converged

> summary(mymodel_1)

Call:

multinom(formula = out ~ Age + BMI + Glucose + Insulin + HOMA +

Leptin + Adiponectin + Resistin + MCP.1, data = mydata)

Coefficients:

Values Std. Err.

(Intercept) 4.5076743519 2.3139935204

Age -0.0202298798 0.0148970862

BMI -0.1087858008 0.0628997873

Glucose -0.0152764281 0.0176457078

Insulin -0.8807611160 0.3288478310

HOMA 4.1452225369 1.4368271188

Leptin -0.0147885316 0.0175261309

Adiponectin -0.0204376574 0.0356204767

Resistin 0.0586020336 0.0274720474

MCP.1 0.0003542605 0.0007726222

Residual Deviance: 117.7114

AIC: 137.7114

This is a full model under which we are using all the possible medical parameters to classify the individuals as healthy Or patients. We observe the coefficients of variables Age, BMI, Glucose, Insulin, Leptin and Adiponectin as negative Indicating they produce a negative impact on the response variable. However the variables MCP.1, HOMA, Resistin Produce a positive impact on the response variable. We now observe with the addition of other relevant variables the Residual Deviance and the AIC values have significantly reduced. This implies that there has been a positive impact and the new model produced is more robust with better efficiency in estimating the effect of the exploratory variables thereby, producing a good impact in terms of effective classification of the data and the model is able to get the correct sense of the data and model it as the reality.

> predict(mymodel_1,mydata)

> predict(mymodel_1,mydata,type="prob")

> #predictions by the model_1

> #misclassificationsby model_1

> cm_1<-table(predict(mymodel_1),mydata$Classification)

> print(cm_1)

Real time data Classifcation

Classficiation based 1 43 14

on the final model 2 9 50

This is the Misclassification table which gives us information about the number of the correct Classifications the model has made across the real time data exhibited. The table shows the Diagonal elements as correct assignments. While the off diagonal elements are the Misclassifications happened. In comparison to the table from Model 1 we find that the number produced on the diagonal have substantially increased showing a good fit and an enhanced classification criteria.

> 1-sum(diag(cm_1))/sum(cm_1)

[1] 0.1982759

> #19 % of the times there has been a misclassification

This final model is showing a reduced number of misclassifications. In our initial model the percentage of Miscalculations happened to account to 40% however, in our current model the percentage of miscalculations have drastically reduced to 19%. A lower level of miscalculation also shows greater efficacy to understand the data through the model,. This implies the Model in quite a sense is a true representation of real data.

> #two tailed Z test under Model_1

> z1<-summary(mymodel_1)$coefficients/summary(mymodel_1)$standard.errors

> p1<-(1-pnorm(abs(z1),0,1))*2

> p1

(Intercept) Age BMI Glucose Insulin HOMA Leptin

0.051414189 0.174471422 0.083717888 0.386637950 0.007399155 0.003914342 0.398781739

Adiponectin Resistin MCP.1

0.566129219 0.032912317 0.646580986

The p- values generated in the new model give us the p-values associated with each of the variable under study. We can observe that these estimates have been estimated with 95% level of confidence. The p-values across the variables

Insulin, HOMA, Resistin are significantly contributing to the response variable. These three variables in the model have

Significantly helped in classifying the individuals as Healthy and patients.

Conclusion on Breast Cancer Evaluation

This data has been a classic example in understanding the role of multinomial logistic regression model and helping us classify the individuals as healthy and patients in the real-time scenario with the help of various interesting bio-medical

parameters and successfully simulating a model which can give future insights in understanding the pattern of individuals

proximity to be a potential patient or remain as a healthy individual.

References for Breast Cancer Evaluation

Long, J. S. and Freese, J. (2006) Regression Models for Categorical and Limited Dependent Variables Using Stata, Second Edition. College Station, Texas: Stata Press.

Hosmer, D. and Lemeshow, S. (2000) Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc..

Agresti, A. (1996) An Introduction to Categorical Data Analysis. New York: John Wiley & Sons, Inc.

Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Computer Science Assignment Help

Upto 50% Off*
Get A Free Quote in 5 Mins*
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
Upload your assignment

Why Us


Complete Confidentiality
All Time Assistance

Get 24x7 instant assistance whenever you need.

Student Friendly Prices
Student Friendly Prices

Get affordable prices for your every assignment.

Before Time Delivery
Before Time Delivery

Assure you to deliver the assignment before the deadline

No Plag No AI
No Plag No AI

Get Plagiarism and AI content free Assignment

Expert Consultation
Expert Consultation

Get direct communication with experts immediately.

Get
500 Words Free
on your assignment today

It's Time To Find The Right Expert to Prepare Your Assignment!

Do not let assignment submission deadlines stress you out. Explore our professional assignment writing services with competitive rates today!

Secure Your Assignment!
Order Now

Online Assignment Expert - Whatsapp Tap to ChatGet instant assignment help

refresh