Statistic | Nominal | Ordinal | Interval | Ratio |
Equality | x | x | x | x |
Counts/Mode | x | x | x | x |
Rank/Order | x | x | x | |
Median | ~ | x | x | |
Add/Subtract | x | x | ||
Mean | x | x | ||
Multiply/Divide | x |
Highlight the "central" feature in a dataset.
These statistics give context a measure of central tendency.
Frequency of occurrence in a qualitative data.
Probability of occurrence in a quantitative dataset.
±1 $\sigma$ | 68% of observations |
±2 $\sigma$ | 95% of observations |
±3 $\sigma$ | 99.7% of observations |
Useful for quantitative data.
Data rarely fits a normal distribution perfectly:
Near Normal
Skewed Normal
Highly Skewed
If you are working with nominal or ordinal data, and you want see how many observations you have for each category, you would use:
Allows us to account for confounding variables that mask or hide patterns in our data.
Highly Correlated
No Correlation
It isn't always straightforward to account for multiple variables.
Also allow us to compare between two or more variables in different units / scales.
Which of these countries has the highest population density? * Populations and Areas are approximate, given to two significant figures for convenience.
Unsupervised:
Supervised:
Vancouver dissemination area populations
One of the simplest classification schemes.
Another of the simplest classification schemes.
Slightly more complex classification scheme.
More complex, data is split using the Jenks algorithm.
Informative to "experts", but not accessible for all.
Supervised: User defines break values.
This classification method seeks to maximize the similarity between data values within groups and maximize the dissimilarity in data values between groups. It tries to find the "optimal" splits within a dataset.
There are many classification methods that are a bit too complex to actually perform in this course.
Algorithm uses random steps to group data into clusters.
Used for automated detection of outliers.
Fit training data to user defined categories.
Multiple trees (>100) can be averaged to increase performance and generalization.
Multiple trees (>100) can be averaged to increase performance and generalization.
One of the most complex methods.
Unsupervised classification methods typically require more user input than supervised classification methods.