Classifying Data

Finding patterns in complex datasets, emphasizing aspects we think are important.

Descriptive Statistics

Operation Nominal Ordinal Interval Ratio
Equality x x x x
Counts/Mode x x x x
Rank/Order x x x
Median ~ x x
Add/Subtract x x
Mean x x
Multiply/Divide x

Measures of Central Tendency

Highlight the “central” feature in a dataset.

  • Mode: The most frequent value in a set
  • Median: The middle value in a set
    • Data is ranked to find the center point

      • 50% above, 50% below+ Not impacted by outliers
  • Mean: The sum of all values divided by the number of values
    • Impacted by outliers

Measures of Dispersion

These statistics give context a measure of central tendency.

  • Range: Difference between the maximum and minimum
  • Inter-quartile Range: 75th to 25th percentile
    • Spread around the median, not influenced by outliers
  • Standard Deviation: \(\sigma = \sqrt{\frac{1}{N}\Sigma_{i=1}^n (x_i - U)^2}\)
    • Spread around the mean(U), influenced by outliers

Frequency Distribution

Frequency of occurrence, typically in a qualitative dataset.

  • Bar charts help visualize frequency distributions
    • Counts per category

Probability Distribution

Probability of occurrence in a quantitative dataset.

  • Normal Distribution:
    • Idealized, based on distance from the mean in standard deviations.
    • Assumed in many statistical tests.

±1 \(\sigma\) 68% obs.
±2 \(\sigma\) 95% obs.
±3 \(\sigma\) 99.7% obs.

Histograms

Sorts and groups data into bins of consistent width.

  • Approximate a probability distribution
    • Grouping data into classes
    • Outlier detection
  • Not the same as bar charts

Deviating from the Norm

Data rarely fits a normal distribution perfectly:

  • Skew: deviates from a normal distribution

  • Tails with outliers

  • Kurtosis: deviates from a normal distribution

  • Dispersed or clustered

Near Normal

Deviating from the Norm

Data rarely fits a normal distribution perfectly:

  • Skew: deviates from a normal distribution

  • Tails with outliers

  • Kurtosis: deviates from a normal distribution

  • Dispersed or clustered

Skewed Normal

Deviating from the Norm

Data rarely fits a normal distribution perfectly:

  • Skew: deviates from a normal distribution

  • Tails with outliers

  • Kurtosis: deviates from a normal distribution

  • Dispersed or clustered

Highly Skewed

TopHat Question 1

If you are working with nominal or ordinal data, and you want see how many observations you have for each category, you would use:

  • A histogram to plot a probability distribution
  • A histogram to plot a frequency distribution
  • A bar chart to plot a probability distribution
  • A bar chart to plot a frequency distribution

Normalizing Data

Scaling a variable by another can reveal hidden patterns in our data.

  • Income vs. money spent on food

Highly Correlated

Normalizing Data

Scaling a variable by another can reveal hidden patterns in our data.

  • Income vs. money spent on food
  • Population vs. shape area

No Correlation

Multiple Confounders

Not always straightforward to account for multiple variables.

  • e.g., COVID by age groups

Standardizing

Allow us to compare between two or more variables in different units / scales.

  • \(z = \frac{x-\overline{U}}{\sigma}\)
  • Similar to normalizing:
    • Remove the mean and standard deviation from multiple variables

TopHat Question 2

Which of these countries has the highest population density? Populations and Areas are approximate.

  • Monaco: Pop (37,000), Area (2 km2)
  • Singapore: Pop (5,500,000), Area (720 km2)
  • China: Pop (1,400,000,000), Area (9,600,000 km2)

Classification Methods

Unsupervised:

  • Data defined classes - the user decides on the number of classes
  • The rest is left of up to an algorithm

Supervised:

  • User defined - the user explicitly defines classes
  • Or provides set of classes as training data
  • Degree of user input is variable, more than unsupervised

Common Examples in Arc Pro

Vancouver dissemination areas by population total:

Not classified

  • Color scheme is stretched between min/max
  • Difficult to see patterns

Equal Interval

One of the simplest classification schemes.

  • Data is split to classes of equal width based on the range.
  • Unsupervised: user defines number of bins.

Defined Interval

Another of the simplest classification schemes.

  • Data is split to classes of equal width based on the range.
  • Unsupervised: user defines bin width.

Quantiles

Slightly more complex classification scheme.

  • Data is split into classes by percentiles.
    • e.g. 0-20%, 20-40%, … 80-100%.
  • Unsupervised: user defines number of bins.

Natural Breaks

More complex, data is split using the Jenks algorithm.

  • Optimizes splits, by maximizing within group similarity and between group dissimilarity.
    • “Natural” classes.
  • Unsupervised: user defines number of bins.

Standard Deviation

Informative to “experts”, but not accessible for all audiences.

  • “Distance” from mean in standard deviations.
    • Interval data: diverging color maps.
  • Unsupervised: user defines number of bins.

Manual Breaks

Supervised: User defines break values.

  • Allows us to choose more meaningful break values if necessary.
  • Incorporate multiple factors
  • Influence the way the data is perceived.

TopHat Question 3

This classification method seeks to maximize the similarity between data values within groups and maximize the dissimilarity in data values between groups. It tries to find the “optimal” splits within a dataset.

  • Manual Breaks
  • Quantiles
  • Natural Breaks
  • Equal Interval
  • Standard Deviation

More Complex Methods

There are many classification methods that are a bit too complex to actually perform in this course.

  • I’m introducing some because they are important to be aware of them.
    • You’ll encounter them if you continue with GIS.

K-means

Algorithm uses random steps to group data into clusters.

  • Unsupervised: user defines number of bins & iterations.

Median Absolute Deviation

Frequently used for automated outlier detection.

  • Unsupervised: user defines error threshold.

Decision Trees

Fit training data to user defined categories.

  • Supervised: user provides training classes.
  • Automated: algorithm determines break values.
  • Risk of over-fitting

Random Forests

Multiple trees (>100) can be averaged to increase performance and generalization.

  • Supervised: user provides training classes and “hyperparameters”.
  • Automated: algorithm determines break values.
  • Low risk of over-fitting

Landscape Classification

Multiple trees (>100) can be averaged to increase performance and generalization.

  • Supervised: user provides training classes and “hyperparameters”.
  • Automated: algorithm determines break values.
  • Low risk of over-fitting

Neural Networks

One of the most complex methods, an algorithm learns complex relationships in dataset.

  • Supervised: user provides training classes and “hyperparameters”.
  • Risk of over-fitting
    • Requires careful inspection

TopHat Question 4

Unsupervised classification methods typically require more user input than supervised classification methods.

  • True
  • False