Finding patterns in complex datasets, emphasizing aspects we think are important.
Operation | Nominal | Ordinal | Interval | Ratio |
Equality | x | x | x | x |
Counts/Mode | x | x | x | x |
Rank/Order | x | x | x | |
Median | ~ | x | x | |
Add/Subtract | x | x | ||
Mean | x | x | ||
Multiply/Divide | x |
Highlight the “central” feature in a dataset.
Data is ranked to find the center point
These statistics give context a measure of central tendency.
Frequency of occurrence, typically in a qualitative dataset.
Probability of occurrence in a quantitative dataset.
±1 \(\sigma\) | 68% obs. |
±2 \(\sigma\) | 95% obs. |
±3 \(\sigma\) | 99.7% obs. |
Sorts and groups data into bins of consistent width.
Data rarely fits a normal distribution perfectly:
Skew: deviates from a normal distribution
Tails with outliers
Kurtosis: deviates from a normal distribution
Dispersed or clustered
Near Normal
Data rarely fits a normal distribution perfectly:
Skew: deviates from a normal distribution
Tails with outliers
Kurtosis: deviates from a normal distribution
Dispersed or clustered
Skewed Normal
Data rarely fits a normal distribution perfectly:
Skew: deviates from a normal distribution
Tails with outliers
Kurtosis: deviates from a normal distribution
Dispersed or clustered
Highly Skewed
If you are working with nominal or ordinal data, and you want see how many observations you have for each category, you would use:
Scaling a variable by another can reveal hidden patterns in our data.
Highly Correlated
Scaling a variable by another can reveal hidden patterns in our data.
No Correlation
Not always straightforward to account for multiple variables.
Allow us to compare between two or more variables in different units / scales.
Which of these countries has the highest population density? Populations and Areas are approximate.
Unsupervised:
Supervised:
Vancouver dissemination areas by population total:
Not classified
One of the simplest classification schemes.
Another of the simplest classification schemes.
Slightly more complex classification scheme.
More complex, data is split using the Jenks algorithm.
Informative to “experts”, but not accessible for all audiences.
Supervised: User defines break values.
This classification method seeks to maximize the similarity between data values within groups and maximize the dissimilarity in data values between groups. It tries to find the “optimal” splits within a dataset.
There are many classification methods that are a bit too complex to actually perform in this course.
Algorithm uses random steps to group data into clusters.
Frequently used for automated outlier detection.
Fit training data to user defined categories.
Multiple trees (>100) can be averaged to increase performance and generalization.
Multiple trees (>100) can be averaged to increase performance and generalization.
One of the most complex methods, an algorithm learns complex relationships in dataset.
Unsupervised classification methods typically require more user input than supervised classification methods.