Data Analytics- Part 1

Disclaimer:

This document contains unedited notes and has not been formally proofread.
The information provided in this document is intended to provide a basic understanding of certain technologies.
Please exercise caution when visiting or downloading from websites mentioned in this document and verify the safety of the website and software.
Some websites and software may be flagged as malware by antivirus programs.
The document is not intended to be a comprehensive guide and should not be relied upon as the sole source of information.
The document is not a substitute for professional advice or expert analysis and should not be used as such.
The document does not constitute an endorsement or recommendation of any particular technology, product, or service.
The reader assumes all responsibility for their use of the information contained in this document and any consequences that may arise.
The author disclaim any liability for any damages or losses that may result from the use of this document or the information contained therein.
The author and publisher reserve the right to update or change the information contained in this document at any time without prior notice.

*********************************************************************************

Linear Regression

Statistics means most of likely event.
35. Two Types of problems
1.     Regression
2.     Classification
Regression: is a continuous variable, which is influenced by dependent variables.

36. Conditional Mean:
Separating the height from boys and girls.
·      Conditional models have least predicted error than the mean average representation

37. Conditional predictive analysis
1.     Dependent variables
2.     Independent variables

38. Simple linear regression
            Linear means a straight line
Y = m X + c
m = slope
c = constant
Y = intercept

Y₁= a + b x₁ + error₁
Actual y= structure y₁ + error₁

39. Correlation: always lies between “ -1 to +1”
·      r = 0.3 – low correlation
·      r = 0.6 – medium correlation
·      r = 0.8 – high correlation

40. Intercept: - where it cuts’ Y axis

41. Std Error: -
sum of SD of all the points
·      Estimate the standard error

42. How far is far?
·      Here P- Value places the important role

Diagram

·      P-value indicates farther from zero.
·      P-value is small = Confidence interval is high = Std Error is small

43. F-statistic:
            Collective information to give overall Model P-value.
If Overall or Collective Model is good, it signifies – different from Zero.
If equal to zero then- there is no information.

Eg. Code in R-studio
> # Predict children’s height # 7% variances is explained by father’s height
> x<- c(-4,-2,2,4,10)
> y<- c(-2,4,2,6,8)
> summary(lm(y~x))

Output: -
Call:
lm(formula = y ~ x)

Residuals:
   1    2    3    4    5
-2.0 2.8 -1.6 1.2 -0.4

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.4000     1.1155   2.151   0.1205
x             0.6000     0.2108   2.846   0.0653 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.309 on 3 degrees of freedom
Multiple R-squared: 0.7297, Adjusted R-squared: 0.6396

F-statistic:   8.1 on 1 and 3 DF, p-value: 0.06532

44. Adjusted R-Square Value
When there are more number of variables (x) added to the model the “Multiple R-squared” doesn’t provide the right information,
For that “Adjusted R-square” value is used.
45. Prediction equation
Predict height of the children, given father’s and mother’s height.
Eg. Y = 34.65+ 0.42 * father_height + 5.17(male)
                        estimate
Intercept          34.46.113
Father              0.42
Factor
(Gender) Male 5.17

Here Mother and father are “X” used but only the factor (gender) M & father values are displayed

The left out “X” Mother’s height is the – Intercept values
So, the left-out “X” values are always represented in the intercept.

45. Error Sum of Squares

Residual Sum of Squares SS_res
Y = X b + e
46. Matrix multiplication

Hat matrix – (refer to more internet content)

y          = X      b          + E
nXl      nXp     pXl      nXl

the
fitted or predicted values from y = Xb

b hat = (X^TX)^-lX^Ty
y hat = X(X^TX)^-'X^Ty
H = X(X^TX)^-lX^T

y hat = Hy

H(pxp)= Hat matrix = X’(X’X)^-1X

Xb is the predicted vector
Predicted vector = X(X’X)^-1X’y

Error
= y-y(predicted)

= Iy-Hy

= (I-H)y

standardized residuals has sqrt(1-h_ii) in the denominator

Diagram

47. Logistic regression

Logistic regression is called as Classification Method.
The dependent variables are a class like – Yes/No, Zero/One, etc.

Here are the Diagram to find the confusion matrix
False “-ve” %
True “+ve” %
True “-ve” %
False “+ve” %
Diagram

Confusion matrix to determine the Goodness of the model

How to determine the model built is best?
McFadden goodness of fit measure is used for the same.

47.1 ROC – Receiver Operating Curve

ROC = True “+ve” % vs False “+ve” %

How much willing to allow 10% error to attain 80% of True “+ve” %.

Diagram

47.2
F1 Measure = weighted average of Precision and Recall

Precision = (True “+ve”) / (True “+ve” + False “+ve”)
Recall = (True “+ve”) / (False “-ve” + True “+ve”)

Examples
·      Credit card fraud
·      Health problem detection
·      Insurance Buying

48. Review of Regression and Logistic regression

1. With Condition: prediction becomes Closer
2. Why prediction works: Variance of condition expectation is lesser than non-conditioned expectations
3. What is prediction: is condition expectation.
4. Residuals: Difference b/w Actual – Predicted values
5. Residual Analysis: 4types of graphs – Residuals Vs Fitted, Normal Q-Q, scale-location & Residuals Vs Leverage
a. par (mfrow = c (2,2); plot(model)
6. What is Leverage Points: The One point that changes the intercept of slope from true points
7. True points: the whole data set points.
8. Predict equation: child height = (39.110) + (0.399 * father’s height)
9. “P” -value: is that value, when P is small the “t” is big and it is significant value.
a. P-value tells that estimate/ Std. Error/ t-value are significantly different from Zero.
10. Std. Error: is the Std. Deviation of the estimate
11. R = “-1” = “-ve” perfectly correlated
a. R = 0 = There is no linear relationship, they don’t increase in linear form.
12. Adjusted R2 : As the no of variances increases the normal R2 increases, giving a wrong picture. Where by Adjusted-R2 will give indications whether the increase is significant or not
13. BOX-COX: Transformation will get non-normal data into normal data.
14. HAT Matrix: is used to calculate
a. Cooks Distance ()
b. Hat Value ()
c. Covratio ()
15. 5-Fold Cross – Validation is must for complex models
a. Data = 80%; Test = 20%
b. Data is now created 5 distinct sets
c. These are verified against the test data
16. Confusion Matrix
a. Logistic: - Likely hood of occurrence
b. Regression: - Minimizing Sum of Squares
17. ROC: - Where the cut has to be made, to allow the False “-ve” to attain highest True “+ve”
18. F1 Measure = Harmonic Mean
2(PR/ (P+R))
19. Precision: - Among the selected How many are the Targets (True)
20. Recall: - Among the Targets, how many are selected.
21. Specificity: - False “+ve”
22. Sensitivity: - True “+ve”

******************************************************************************

49. Unstructured Algorithm

Basics to see here on how to – find underlying factors or reduce the Dimensions using below techniques
·      Factor Analysis – FA
·      Recommendation Engine
·      Principal component Analysis - PCA
·      Singular Value decomposition – SVD
·      Eigenvalue decomposition – EVD
·      Clustering Methods
§  K – Means clustering
§  Hierarchy clustering
§  DB Scan
§  OPTICS

50. Factor Analysis – FA

·      FA is about to Pull out or explain hidden factors or underlying factors in their relationship of variables
·      The information received from these hidden factors can be used to reduce the number of set of variables
·      These number of factors are determined using Scree-plot.
·      There are a number of rotations: -
§  Varimax
§  Quartimax
·      FA looks for the correlation values
·      Factors that have similar overloading can be grouped into cluster
51. Principal Component Analysis

Search This Blog

Secur-tainment

HOME LAB : HANDS-ON

Data Analytics

Data Analytics- Part 1

Disclaimer:

Linear Regression

35. Two Types of problems

36. Conditional Mean:

37. Conditional predictive analysis

38. Simple linear regression

40. Intercept: - where it cuts’ Y axis

41. Std Error: -

42. How far is far?

43. F-statistic:

46. Matrix multiplication

47. Logistic regression

47.2

48. Review of Regression and Logistic regression

49. Unstructured Algorithm

50. Factor Analysis – FA

51. Principal Component Analysis

Comments

Post a Comment

Popular Posts

Marriage Registration Online steps [Tamil Nadu]

HOME LAB : HANDS-ON

Contacts of Helpline for: Child rescue, Food wastage, blood group