Uncertainty in GIS

Uncertainty = Accuracy + Precision + Ambiguity + Vagueness + Logical Fallacies

Arises from our inability to measure phenomena perfectly and flaws in our conceptual models.

Data quality: Instrument limitations, sampling costs, etc.
Generalizations when representing phenomena: i.e., Bonini's Paradox
Incomplete knowledge, misunderstandings, biases, etc.

There is no standardized measure of data quality in GIS.

Flaws may pass through many users before discovery.
Must trust the data was collected and processed correctly.
Risk of the users misinterpreting valid products.

When baking mistakes are obvious.
Often in GIS, they are not.

Data must be assessed on a case by case basis.

Is the resolution of the data sufficient for the scale of my analysis?

What is the smallest feature I need to be able to resolve?

Is this the most up to date version of this dataset?

When was the data published? Has it been updated since?

Can I trust the organization publishing this data?

Who published the data? Are they a private or public organization? Might they have any biases that could impact data quality?

The terms are related, but the distinction is very important.

Accuracy: The degree to which a set of measurements correctly matches the real world values.

How close is a measurement to the real value?

Precision: The degree of agreement between multiple measurements of the same real world phenomena.

How repeatable is a measurement?

Accuracy: A systematic (consistent) offset from the real world value.

Errors have a bias

Precision: A random offset from the real world value;

Errors are unbiased

Accuracy and precision can be quantified.

TopHat Question 1

Accuracy is biased, it is related to systematic errors in a measurement. Precision is unbiased, it is related to ___ errors in a measurement

Skewed
Random
Systematic
Weird

Statistical methods can be used to quantify error.

These measures won't tell us for sure that we are correct, but they can give us some insight.
We can quantify any offset (bias) and the spread of the measurements (unbiased)

Mean Absolute Error (MAE):

The absolute value of the error averaged across all samples.
How close the samples (observations or estimate) are to be to the true values.

$MAE = \frac{\sum_{i=1}^N \lvert{x_i-t_i}\rvert}{N}$

$x_i$ = the ith sample value
$t_i$ = the ith true value
$N$ = the total number of samples

Mean Squared Error (MSE):

The squared error averaged across all samples.
Similar to MAE, but:

More harshly penalizes large deviations
MSE is not in the same units as the original data.

$MSE = \frac{\sum_{i=1}^N \left({x_i-t_i}\right)^2}{N}$

$x_i$ = the ith sample value
$t_i$ = the ith true value
$N$ = the total number of samples

Root Mean Squared Error (RMSE):

The square root of the squared error averaged across all samples
Same as MSE, except:

With the square root, RMSE is in the same units as the original data

Frequently the preferred metric

No need to memorize equation, but do try to understand what it means

$RMSE = \sqrt{\frac{\sum_{i=1}^N \left({x_i-t_i}\right)^2}{N}}$

$x_i$ = the ith sample value
$t_i$ = the ith true value
$N$ = the total number of samples

Standard Deviation ($\sigma$):

The square root of squared deviation averaged across all samples.
Similar to RMSE, except:

Instead of characterizing error, it quantifies the dispersion of a dataset.

$\sigma=\sqrt{\frac{\sum_{i=1}^N \left({x_i-\overline{X}}\right)^2}{N}}$

$x_i$ = the ith sample value
$\overline{X}$ = the mean of all samples
$N$ = the total number of samples

Confidence Intervals (CI):

Used to convey our confidence in an estimated average value.

i.e. $\overline{X} = \frac{\sum_{i=1}^Nx_i}{N}$

If the $x_i$ values are close together, we have higher confidence in $\overline{X}$.
If the $x_i$ values are more dispersed, we have lower confidence in $\overline{X}$.

$CI = \frac{\sigma}{\sqrt{N}} z$

$\sigma$ = the standard deviation
$N$ = the total number of samples
$z$ = a z-score

Confidence Intervals (CI):

For Normally Distributed data:

Can be used to specify a confidence level (i.e., 90%, 95%, etc).
Typically presented as a range around the mean.

Inter Quartile Range (IQR):

For Non-normally Distributed data:

Cannot be used to specify a confidence level.
But it can give us some idea of the dispersion in a dataset.

TopHat Question 2

Which of the following metrics can be used to describe the accuracy of an estimate?

Confidence Interval
Root Mean Squared Error
Standard Deviation
Inter Quartile Range

The terms are also related. The distinction is subtle and can be confusing. They aren't synonymous, but you can essentially treat them as such. Often when something is vague, it is also ambiguous, and vice versa.

Vagueness: When a definition is not clearly stated or defined.

Arises when boundaries or labels are poorly defined.

Ambiguity: When something can reasonably be interpreted in multiple ways.

Arises when terms or labels can apply to multiple features.

These statements are vague because they lack detail. They are ambiguous because they have multiple interpretations.

"I'm going to London."

London, UK?
London, Ontario?
London Drugs?

"I'm at the bank".

A financial institution?
The edge of a river?

The position of objects are unclear or changeable.

Where does a forest end and a grassland begin?
Coastal boundary file

High tide?
Low tide?
Mean water level?

Ambiguity and vagueness are difficult to quantify numerically. But they still must be addressed whenever possible.

The key with these issues:

Be explicit and transparent when conducting and communicating your work.

Present things clearly and thoroughly.
Try to think through possible misconceptions.

Where does uncertainty come from and what can we do to minimize it?

Uncertainty = Accuracy + Precision + Ambiguity + Vagueness + Logical Fallacies

Accuracy: biases in measurements
Precision: random errors in measurements
Ambiguity: multiple interpretations
Vagueness: unclear definition
Logical Fallacies: any issue not included elsewhere

Some sources of error are out of our control. The instruments we use to collect data can only so precise.

It is actually impossible to know all the physical quantities of an object.

Heisenberg Uncertainty Principle: you cannot concurrently measure a particles exact position and momentum!

The concentration of samples in space and time dictates the level of accuracy & precision you can attain.

Lower resolution data is by definition less precise, but not necessarily less accurate.
High resolution data can be very precise, but have a significant bias.
What do you do if your data are collected at different resolutions?
Low resolution result in vagueness & ambiguity.

Things we do have some control over:

Input datasets & sample sizes.
Typos during tabular data entry.
Digitizing errors when creating vector features.

Gaps: there should be a feature, but there is not.
Overlaps: one polygon sits over another polygon.
Slivers: a feature is created between two features when it should not be.

Errors that arise when creating vector features:

Slivers: a feature is created between two features when it should not be.
Gaps: there should be a feature, but there is not.
Overlaps: one polygon sits over another polygon.

Errors that arise when creating vector features:

Under/overshoots: vertex misses a connection.
Extra nodes: unnecessary vertices.
Missing features: features or attributes missing.

TopHat Question 3

Digitization errors arise when we manually create features.

True
False

Since geographic phenomena often don’t have clear, natural units, we are often forced to assign zones and labels in our work (i.e. Census Data).

A convenient way of simplifying complex processes.
Often vague and/or ambiguous.
May be difficult to defend because they are arbitrary.
Where to draw a boundary and what to call a zone are likely to vary significantly between different people/groups.

Much of the data we use to learn about society is collected in aggregate. We take average values for many individuals within a group or area (i.e. Census Data).

Lets us explore the general attributes (e.g. mean, median, etc.) for each group/area.
Compare the attributes between different groups or regions
Determine the allocation of resources.

Even with "perfect" data; GIS operations can add uncertainty:

Re-projecting to different coordinate systems.
Converting between data types.
Perform generalizations (i.e. data classification)

missing

A flaw in our reasoning that undermines the logic of our argument.

Can be made both by accident and on purpose.
Often lacking evidence to support the claims/decisions made, or evidence is presented in a misleading way.
“Hasty generalizations” are an example of logical fallacies:

‘I saw a violent protester on TV … Protesters are inciting violence.’

Applying data collected/presented in aggregate for a group/region and applying it to an individual or specific place.

You cannot make an assumption about individuals within a group based on the aggregate data for that group.
Census data is averaged for an area

The information about individual values is lost

Basically: don't make assumptions

When you assume ... you make an ass of u and me

When we take aggregated data and aggregate it again at a higher level. You can't take the average of averages.

The US Electoral College is an example of this in practice:

Totaling votes per state ... then totaling "delegates" by state.

Imagine a population of turtles living on logs in a pond. You work for the turtle census and visit each log and ask the turtles their age.

Log 1: (10, 15, 15, 15, 20)

Mean age = 15

Log 2: (5, 15)

Mean age = 10

Mean age of all turtles

(10 + 15 + 15 + 15 + 20 + 15 + 10)/7= 13.571

Mean age by log

(15 + 10)/2 = 12.5

Modifiable, arbitrary boundaries can have a significant impact on descriptive statistics for areas.

When areas are grouped together, the way you choose to group them can change the values of the groups.
See this video for more context.

Data collected at a finer level of detail is being combined into larger areas of lower detail that can be manipulated.

Used to imply things that are not necessarily true.
Serious ethical implications.
Gerrymandering is a prime example.

TopHat Question 4

Gerrymandering exploits the atomistic fallacy to skew election results.

True
False

Errors are cumulative:

Uncertainty will propagate through the model/system.
Errors made early on will be compounded as the move through a project!
Some errors are unavoidable

So we need to be careful to avoid the ones we can!

Uncertainty in GIS

Uncertainty

Data Quality

Data Quality

Error: Accuracy & Precision

Error: Accuracy & Precision

TopHat Question 1

Quantifying Error

Quantifying Accuracy

Quantifying Accuracy

Quantifying Accuracy

Quantifying Precision

Quantifying Precision

Quantifying Precision

Quantifying Precision

TopHat Question 2

Vagueness and Ambiguity

Vagueness and Ambiguity

Vagueness and Ambiguity

Qualifying Vagueness and Ambiguity

Sources of Uncertainty

Measurements

Data Resolution

Data Entry

Digitization Errors

Digitization Errors

TopHat Question 3

Labels and Boundaries

Data Aggregation

Conversion and Processing

Logical Fallacies

Ecological Fallacy

Atomistic Fallacy

Atomistic Fallacy

Modifiable Areal Unit Problem

Modifiable Areal Unit Problem

Modifiable Areal Unit Problem

Modifiable Areal Unit Problem

Modifiable Areal Unit Problem

TopHat Question 4

Error Propagation