Uncertainty = Accuracy + Precision + Ambiguity + Vagueness + Logical Fallacies
Arises from our inability to measure phenomena perfectly and flaws in our conceptual models.
There is no standardized measure of data quality in GIS.
In baking mistakes are obvious. Often in GIS, they are not.
Data must be assessed on a case by case basis.
The terms are related, but the distinction is very important.
Accuracy: The degree to which a set of measurements correctly matches the real world values.
Precision: The degree of agreement between multiple measurements of the same real world phenomena.
Accuracy is biased, it is related to systematic errors in a measurement. Precision is unbiased, it is related to ___ errors in a measurement.
Statistical methods can be used to describe uncertainty.
Quantify the offset (bias) and dispersion (unbiased) of data points.
Won’t tell us if we are correct with 100% certainty, but they can give us some insight.
Mean Absolute Error (MAE)
\(MAE = \frac{\sum_{i=1}^N \lvert{x_i-t_i}\rvert}{N}\)
\(x_i\) = ith sample value
\(t_i\) = ith true value
\(N\) = number of samples
Root Mean Squared Error (RMSE):
More harshly penalizes large deviations.
The squared error for each sample
\(RMSE = \sqrt{\frac{\sum_{i=1}^N \left({x_i-t_i}\right)^2}{N}}\)
\(x_i\) = ith sample value
\(t_i\) = ith true value
\(N\) = number of samples
Standard Deviation (\(\sigma\)):
\(\sigma=\sqrt{\frac{\sum_{i=1}^N \left({x_i-\overline{X}}\right)^2}{N}}\)
\(x_i\) = ith sample value
\(\overline{X}\) = the mean of all samples
\(N\) = number of samples
Confidence Intervals (CI) can be used to specify a confidence level (i.e., 90%, 95%, etc):
Confidence Intervals (CI):
\(CI = \frac{\sigma}{\sqrt{N}} z\)
\(\sigma\) = standard deviation
\(N\) = number of samples
\(z\) = a z-score
Inter Quartile Range (IQR):
Which of the following metrics can be used to describe the accuracy of an estimate?
Vagueness: When something is not clearly stated or defined.
Ambiguity: When something can reasonably be interpreted in multiple ways.
Difficult to quantify numerically, but they must be addressed whenever possible.
The key with these issues:
Where does uncertainty come from and what can we do to minimize it?
Some sources of error are out of our control. The instruments we use to collect data can only so precise.
The concentration of samples in space and time dictates the level of accuracy & precision you can attain.
Others occur when we create data.
Errors that arise when creating vector features:
Errors that arise when creating vector features:
Digitization errors arise when we manually create features.
Geographic phenomena often don’t have clear, natural units. We are often forced to assign boundaries (i.e. Census Data).
Much of the data we use to learn about society is released in aggregate: average values for many individuals within a group or area (i.e. Census Data).
Even with “perfect” data; GIS operations add uncertainty:
A flaw in our reasoning that undermines the logic of our argument.
Applying data collected/presented in aggregate for a group/region and applying it to an individual or specific place.
Occurs when we take aggregated data and aggregate it again at a higher level.
The US Electoral College is an example of this in practice:
Modifiable, arbitrary boundaries can have a significant impact on descriptive statistics for areas.
Data collected at a finer level of detail is being combined into larger areas of lower detail that can be manipulated.
Gerrymandering exploits the modifiable areal unit problem to skew election results.
Errors are cumulative: