Data Analytics- Part 1


  • This document contains unedited notes and has not been formally proofread.
  • The information provided in this document is intended to provide a basic understanding of certain technologies.
  • Please exercise caution when visiting or downloading from websites mentioned in this document and verify the safety of the website and software.
  • Some websites and software may be flagged as malware by antivirus programs.
  • The document is not intended to be a comprehensive guide and should not be relied upon as the sole source of information.
  • The document is not a substitute for professional advice or expert analysis and should not be used as such.
  • The document does not constitute an endorsement or recommendation of any particular technology, product, or service.
  • The reader assumes all responsibility for their use of the information contained in this document and any consequences that may arise.
  • The author disclaim any liability for any damages or losses that may result from the use of this document or the information contained therein.
  • The author and publisher reserve the right to update or change the information contained in this document at any time without prior notice.


Linear Regression

Statistics means most of likely event.

35. Two Types of problems

1.     Regression
2.     Classification
Regression: is a continuous variable, which is influenced by dependent variables.

36. Conditional Mean:  

Separating the height from boys and girls.
·      Conditional models have least predicted error than the mean average representation

37. Conditional predictive analysis

1.     Dependent variables
2.     Independent variables

38. Simple linear regression

            Linear means a straight line
Y = m X + c
m = slope
c = constant
Y = intercept

Y1 = a + b x1 + error1
Actual y = structure y1 + error1

39. Correlation: always lies between “ -1 to +1”
·      r = 0.3 – low correlation
·      r = 0.6 – medium correlation
·      r = 0.8 – high correlation

40. Intercept: - where it cuts’ Y axis

41. Std Error: -

 sum of SD of all the points
·      Estimate the standard error

42. How far is far?

·      Here P- Value places the important role


·      P-value indicates farther from zero.
·      P-value is small = Confidence interval is high = Std Error is small

43. F-statistic:  

            Collective information to give overall Model P-value.
If Overall or Collective Model is good, it signifies – different from Zero.
If equal to zero then- there is no information.

Eg. Code in R-studio
> # Predict children’s height # 7% variances is explained by father’s height
> x<- c(-4,-2,2,4,10)
> y<- c(-2,4,2,6,8)
> summary(lm(y~x))

Output: -
lm(formula = y ~ x)

   1    2    3    4    5
-2.0  2.8 -1.6  1.2 -0.4

            Estimate Std. Error t value Pr(>|t|) 
(Intercept)   2.4000     1.1155   2.151   0.1205 
x             0.6000     0.2108   2.846   0.0653 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.309 on 3 degrees of freedom
Multiple R-squared:  0.7297,  Adjusted R-squared:  0.6396

F-statistic:   8.1 on 1 and 3 DF,  p-value: 0.06532

44.  Adjusted R-Square Value
When there are more number of variables (x) added to the model the “Multiple R-squared” doesn’t provide the right information,
For that “Adjusted R-square” value is used.
45. Prediction equation
Predict height of the children, given father’s and mother’s height.
Eg. Y = 34.65+ 0.42 * father_height + 5.17(male)
Intercept          34.46.113
Father              0.42
(Gender) Male 5.17

Here Mother and father are “X” used but only the factor (gender) M & father values are displayed

The left out “X” Mother’s height is the – Intercept values
So, the left-out “X” values are always represented in the intercept.

45. Error Sum of Squares

Residual Sum of Squares SSres
 Y = X b + e

46. Matrix multiplication

Hat matrix – (refer to more internet content)

y          = X      b          + E
nXl      nXp     pXl      nXl

fitted or predicted values from y = Xb

b hat = (XTX)-l XT y  
y hat = X(XTX)-' XT y
H = X(XTX)-lXT

y hat = Hy

H(pxp)= Hat matrix = X’(X’X)-1X

Xb is the predicted vector
Predicted vector = X(X’X)-1X’y

= y-y(predicted)

= Iy-Hy

= (I-H)y

standardized residuals has sqrt(1-hii) in the denominator


47. Logistic regression

Logistic regression is called as Classification Method.
The dependent variables are a class like – Yes/No, Zero/One, etc.

Here are the Diagram to find the confusion matrix
False “-ve” %
True “+ve” %
True “-ve” %
False “+ve” %

Confusion matrix to determine the Goodness of the model

How to determine the model built is best?
McFadden goodness of fit measure is used for the same.

47.1 ROC – Receiver Operating Curve

ROC = True “+ve” %  vs   False “+ve” %

How much willing to allow 10% error to attain 80% of True “+ve” %.



F1 Measure = weighted average of Precision and Recall

Precision = (True “+ve”) / (True “+ve” + False “+ve”)
Recall = (True “+ve”) / (False “-ve” + True “+ve”)

·      Credit card fraud
·      Health problem detection
·      Insurance Buying

48. Review of Regression and Logistic regression

1.     With Condition: prediction becomes Closer
2.     Why prediction works: Variance of condition expectation is lesser than non-conditioned expectations
3.     What is prediction: is condition expectation.
4.     Residuals: Difference b/w Actual – Predicted values
5.     Residual Analysis: 4types of graphs – Residuals Vs Fitted, Normal Q-Q, scale-location & Residuals Vs Leverage  
a.     par (mfrow = c (2,2); plot(model)
6.     What is Leverage Points: The One point that changes the intercept of slope from true points
7.     True points: the whole data set points.
8.     Predict equation: child height = (39.110) + (0.399 * father’s height)
9.     “P” -value: is that value, when P is small the “t” is big and it is significant value.
a.     P-value tells that estimate/ Std. Error/ t-value are significantly different from Zero.
10.  Std. Error: is the Std. Deviation of the estimate
11.  R = “-1” = “-ve” perfectly correlated
a.     R = 0 = There is no linear relationship, they don’t increase in linear form.
12.  Adjusted R2 : As the  no of variances increases the normal R2 increases, giving a wrong picture. Where by Adjusted-R2 will give indications whether the increase is significant or not
13.  BOX-COX: Transformation will get non-normal data into normal data.
14.  HAT Matrix: is used to calculate
a.     Cooks Distance ()
b.     Hat Value ()
c.     Covratio ()
15.  5-Fold Cross – Validation is must for complex models
a.     Data = 80%; Test = 20%
b.     Data is now created 5 distinct sets
c.     These are verified against the test data
16.  Confusion Matrix
a.     Logistic: - Likely hood of occurrence
b.     Regression: - Minimizing Sum of Squares
17.  ROC: - Where the cut has to be made, to allow the False “-ve” to attain highest True “+ve”
18.  F1 Measure = Harmonic Mean
2(PR/ (P+R))
19.  Precision: - Among the selected How many are the Targets (True)
20.  Recall: - Among the Targets, how many are selected.
21.  Specificity: -  False “+ve”
22.  Sensitivity: - True “+ve”



Popular Posts

Enable additional Security layer for Logging into: Google, Facebook and Twitter

Contacts of Helpline for: Child rescue, Food wastage, blood group

Privacy Settings for windows