1 - Using Python for image analysis


Both R and Python languages have their pros and cons. Python is an elegant, all-purpose language better suited to handle geospatial data (and is sometimes used as the language for proprietary geospatial software, i.e., ESRI's ArcMap), while R has a vast library of specialized packages for statistical analysis and data visualization. The open-source Python packages 'rasterio' and 'fiona' are pythonic modules of the 'Geospatial Data Abstraction Library (GDAL)' written in C++. Rasterio is designed to handle raster data, while fiona is meant for vector data. Similarly, the open-source Python package 'shapely' is a pythonic module of the C++ package 'Geometry Engine Open Source (GEOS)', and is meant to handle vector data. 'Geopandas' is a Python package that "extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and descartes and matplotlib for plotting" (Geopandas developers, 2019). Although beyond the scope of this manual, fiona, shapely and geopandas offer functionality to manipulate, query, transform and visualize vector data sets.

Although Python boasts a number of packages that implement a wide variety of algorithms (i.e. scikit-learn), certain functionality available in R's caret package is not readily available in open-source Python packages. Scikit-learn offers a long list of supervised and unsupervised learning models. However, unlike the Caret package, which offers functionality for automatically testing and tuning certain parameter values for each model, sci-kit learn has a more hands-on approach to parameter tuning, requiring in-depth knowledge of each and every model the researcher intends to run and compare. Sci-kit learn offers two 'brute force' options for parameter tuning: exhaustive grid search and randomized parameter optimization. Both of these options require the researcher specify which parameters should not accept default values but instead should be defined explicitly and that the researcher define the valid space for these parameters. Alternatives to brute force parameter searches are only available for a short list of models.

In order to combine the strengths of both languages while catering to a Python-oriented coder, this manual makes use of the Python package 'rpy2', a "high-level interface designed to facilitate the use of R by Python programmers. R objects are exposed as instances of Python-implemented classes, with R functions as bound methods to those objects in a number of cases. Users of the Python signature numerical package 'numpy' can continue using the data structures they are familiar with and share objects seamlessly with R" (Gautier & rpy2 contributors, 2016). This extension package compiles R to run in Python and, as a result, might incur slower processing times than other non-embedded options, such as 'pyRserve'. Although other options exist, rpy2 was chosen for the relative ease of installation and set-up, and the recent development attention this project has received. This package is implemented in section 5 of this manual.

2 - Python workspace setup


To run the code included with this documentation, the following steps must be taken:

  1. Download Anaconda
  2. Create a virtual environment
  3. Open the Spyder IDE and run the code

Once these steps have been completed, please refer to section 2.4 ('How to re-open this project') to re-open the environment for a project that has already been created.

2.1 - Download Anaconda

Either the full Anaconda Distribution or Miniconda can be downloaded for use in this project. When selecting an installation to download, be sure to use the most recent version of Python. Keep all the default settings during installation.

These two options are similar in that they include a command prompt, a package manager (conda), and a virtual environment manager. The full distribution is a much larger download and includes extra software, over 1,500 pre-installed packages (many of which will go unused), as well as a graphic user interface. Miniconda comes with the same functionality but without extra software and only a handful of necessary packages. Overall, Miniconda is easier to set up.

About the Anaconda Prompt

Once installed, Anaconda is used via the Anaconda Prompt. This can now be found in system applications. All commands included in this documentation are executed from this prompt. The Integrated Development Environment (IDE) must also be launched from this terminal.

After the initial configuration of the project, it is only necessary to remember a handful of commands in order to launch the IDE and start working. These are listed in section 2.4 of this manual.

About the package manager: Conda

Conda is a tool that manages packages and their dependencies automatically. It functions like Python’s PIP or the RStudio Package Manager. All commands related to packages are executed by calling conda <...> in the Anaconda Prompt. Packages are installed and maintained in a library within the currently activated virtual environment. When in doubt, conda help can be entered into the Anaconda Prompt to display a list of available commands.

Documentation for conda can be found here.

2.2 - Create a virtual environment

A virtual environment is a self-contained folder that allows a project to run independently of other software installed on a system. An environment is typically created for each project and makes it easy to download and run different versions of Python and packages without interfering with other projects. This is not to be confused with the Python Manual folder that comes with this documentation. The environment points to all the packages needed to run the code and can be deleted, recreated, and shared without affecting any documents on the system.

Setting up a virtual environment is a three-step process:

Create a new environment

A new, empty environment will need to be created. There are many ways to create and share environments, the specifics of which are outside the scope of this manual. Detailed information can be found in conda documentation about managing environments.

The following command will create a new virtual environment named python-manual:

conda create --name python-manual

To ensure that the environment was created, type the command conda info --env into the prompt. A list of all virtual environments should appear, one of which should be named python-manual.

Activate the environment

The Python code included with this documentation will not work unless it is run from an active environment set up for this specific project. To activate the python-manual environment, enter the following command:

conda activate python-manual

Configure the environment & install packages

Packages must now be installed. The environment must load most of its packages from one of the more robust repositories, conda-forge, instead of the default. The only exception are the packages, rpy2, mro-base, and r-caret; these must be installed from the R channel before any other packages are installed. Any attempt to download rpy2, mro-base, or r-caret from any other channel will cause an error in the scripts in Section 5.

The conda-forge channel seems to be the best option for the purposes of this manual. Downloading non-R packages from the conda-forge channel reduces compatibility issues between packages from different channels. See conda forge's explanation here.

If the terminal presents any errors that seem to stem from a dependency issue, it is recommended that the user create a new environment and run through all steps in this section.

  1. Download the R-related packages (N.B. For unknown reasons, issues seem to occur when r-caret is installed on the same line as the other two packages. As a result they appear on separate lines.):

     conda install --channel r rpy2 mro-base
     conda install --channel r r-caret
  2. Install all other required packages from the conda-forge repository:

     conda config --env --add channels conda-forge
     conda install --channel conda-forge spyder pandas fiona rasterio scikit-learn tzlocal

2.3 - Open the Spyder IDE and run the code

The Spyder IDE (Integrated Development Environment) is designed for data scientists using Python. It works within the virtual environment and must be launched using the Anaconda Prompt so that project libraries can be accessed when running code. Executing a program outside of the Anaconda Prompt in a normal code editor will result in errors due to missing libraries. This IDE operates much like R Studio - variables created in a script are saved in Spyder's global environment and are accessible through other scripts or blocks of code even if executed from a different file or directory.

To launch Spyder and start working on the project from the newly activated environment, enter the following command into the Anaconda Prompt:

spyder

Create a project

This project folder can be saved as a project in Spyder. This will cause the application to open at the last stopping point and the directory will be visible in the left-hand sidebar. To save this folder as a project:

  1. Select 'Projects' in the menu bar
  2. Click 'New Project'
  3. Create a project from an existing directory and select this entire folder.

Run 2.3_setup.py

Section 5 relies on the Python package, rpy2, (description in the beginning of Section 5) in order to use functionality available in the R package, caret. The caret package has many dependencies, which may change as the package is updated and more modeling functionality is added. Although all necessary dependencies were current at the time of writing, the researcher should refer to the list of available models and respective R libraries available through caret if any dependency or library-related errors occur. Run this script first to download all necessary packages, which will then be permanently saved to the currently active conda environment.

In [ ]:
'''
    2.3 - Setup

    Please run this script before doing anything else with this project.
'''
# Import rpy2
from rpy2 import robjects

# Create a method that allows users to execute R code on the fly
r = robjects.r

r('''
    if(!require("caret")) install.packages("caret", dependencies=c("Depends", "Suggests"))
    
    # Dependencies for script 6.0
    if(!require("tidyverse")) install.packages("tidyverse")
    if(!require("raster")) install.packages("raster")
    if(!require("shapefiles")) install.packages("shapefiles")
    if(!require("rgdal")) install.packages("rgdal")
    if(!require("rlang")) install.packages("rlang")
    if(!require("rgeos")) install.packages("rgeos")
    if(!require("sf")) install.packages("sf")
    if(!require("Rtools")) install.packages("Rtools")
    if(!require("devtools")) install.packages("devtools")
    
    
    # Extra caret dependencies
    if(!require("lazyeval")) install.packages("lazyeval")
    if(!require("ModelMetrics")) install.packages("ModelMetrics")
    if(!require("RcppRoll")) install.packages("RcppRoll")
    if(!require("backports")) install.packages("backports")
    if(!require("ddalpha")) install.packages("ddalpha")
    if(!require("DEoptimR")) install.packages("DEoptimR")
    if(!require("dimRed")) install.packages("dimRed")
    if(!require("gower")) install.packages("gower")
    if(!require("spmod")) install.packages("spmod")
    
    
    # Load these additional libraries depending on methods used
    if(!require("C50")) install.packages("C50")
    if(!require("class")) install.packages("class")
    if(!require("Cubist")) install.packages("Cubist")
    if(!require("deepnet")) install.packages("deepnet")
    if(!require("earth")) install.packages("earth")
    if(!require("elasticnet")) install.packages("elasticnet")
    if(!require("foba")) install.packages("foba")
    if(!require("kernlab")) install.packages("kernlab")
    if(!require("kknn")) install.packages("kknn")
    if(!require("klaR")) install.packages("klaR")
    if(!require("MASS")) install.packages("MASS")
    if(!require("mboost")) install.packages("mboost")
    if(!require("nnet")) install.packages("nnet")
    if(!require("party")) install.packages("party")
    if(!require("pls")) install.packages("pls")
    if(!require("plyr")) install.packages("plyr")
    if(!require("randomForest")) install.packages("randomForest")
    if(!require("rpart")) install.packages("rpart")
    if(!require("xgboost")) install.packages("xgboost")
     
''')

2.4 - How to re-open this project

Open the Anaconda Prompt from the system’s applications folder and enter the following:

conda activate python-manual
spyder

Once Spyder loads, any code run in the IDE will access the libraries from the python-manual environment.

2.5 - Project folder organization

The project folder is organized to make it easier to sort through scripts, documentation, and data. The following is a representation of the downloaded directory's layout:

python-manual /
    |- Data /
        |- ProjectShapefilesAndRaster /
        |- *modelResults.txt
        |- *cmResults.txt
        |- *tTestResults.txt
        |- *outputImage.img
    |- 4.1.1_ImportClassificationTrainingData.py
    |- 4.1.2_ImportClassificationTrainingData.py
    |- ...
    |- ...
    |- Python-Manual.html

*Results and images created by the scripts in Sections 5 & 6 output the files to the Data folder.

Root folder: python-manual

All scripts are saved in this directory. When referencing a file path in the code, this folder is represented by './'.

Data folder

All data should be saved in this folder to make project management easier. If using this code for multiple projects or if testing with smaller datasets, it is advised to store all the files for that project in a unique folder. All output data (model results and confusion matrix results, see section 5) are saved to the Data folder.

Documentation

The documentation, 'Python-Manual.html', is provided in HTML format and can be opened using any browser.

2.6 - Troubleshooting the installation

The entire setup - from Anaconda installation to the loading of R packages - can be finicky. The build instructions (see section 2.2) must be strictly followed. Deviation from these instructions may cause issues that are difficult to diagnose. Beyond the initial installation, there are a few other things to consider.

Testing with a small data set to ensure proper environment setup

Python's rpy2 is still under development. As a result, issues occurring within section 5 or section 6 may not result in error messages. The program may freeze without any indication that something is wrong. The time required to complete the analyses detailed in this manual is dependent on many factors (amount of data, type of analysis, number of analyses chosen for comparison, etc.) and can be hard to anticipate. Therefore, it can be difficult to differentiate between a still-running program and one that simply needs more time. We recommend that researchers use a small subsample of data to test for functionality of all aspects, from data import to final image export, before proceeding with the full data set. This ensures that all packages were successfully installed. Package errors have been common in the development of this manual, and the following methods may assist in troubleshooting.

Systems with Anaconda already installed

The channel_priority must be set to flexible in conda's global environment before creating this environment. This is the default setting for the first system installation of Anaconda or Miniconda, but if this setting has been changed in the past, enter conda config --env --set channel_priority flexible into the Anaconda prompt before running through the steps in section 2.2.

Package error resulting from scripts in section 4

If any errors occur while running the scripts in section 4, the error is most likely due to mistakes in the build of the current environment. This error can be avoided by copy/pasting all the commands from section 2.2 and by updating Anaconda before creating this environment. If this error does occur, delete the environment and create another using the instructions in section 2.2.

Delete a broken python-manual environment with the following conda command:

conda remove --name python-manual --all

Package errors resulting from any other script or 'frozen' script

If any errors occur in section 2.3, follow the directions in the paragraph above this. This is also true if any error referring to 'spmod' is received. The error may be due to an improper installation of Python's rpy2 package. This was a common error in the development of this manual and was only solved by deleting the environment and creating a new environment according to instructions. Microsoft R Open(MRO) is the default R installation for Anaconda. Using a CRAN R version seems to create complications in package installation. If MRO is installed outside of the conda environment, ensure that this version matches the version installed in the python manual. Incompatible versions may result in erroneous file paths to packages or libraries. To ensure that all R related packages are compatible with MRO, install all R related packages from the 'r' channel (see section 2.2).

If errors occur in section 5 or section 6, run the the script from section 2.3 a couple of times. For unknown reasons, rpy2 may not install all of the packages on the first run of the script. This has been the most frequent issue during the development of this manual. While running the script from section 2.3, messages appear indicating named package installations. The simplest solution is to continue re-running the script until these messages seem to stop appearing.

3 - Using Python for image analysis


3.1 - Image preparation

This manual makes use of functionality from the Python package 'rasterio' to read in raster images. Rasterio is a pythonic module of the C/C++ package 'GDAL', which supports a vast number of raster drivers. Visit GDAL's documentation pages to see a full list of acceptable image types. Like the R version of this manual, it is assumed that the researcher has combined all input raster data into a single, multi-layer file (we commonly use .img file format).

3.1.1 - Handling 'no data' values

Unlike R, rasterio reads raster images directly in their native data type (i.e. unsigned 8-bit integer (uint8), signed 16-bit integer (int16)). This necessitates that 'no data' values are carefully handled. If no 'no data' value is specified in the image's metadata, rasterio defaults to using 0 as the value to represent 'no data'. This can be problematic if 0 is a valid data value, as rasterio will ignore these cells, which may result in a processing error as output lengths by band will not match. The researcher must enter a valid 'no data' value for their data type (i.e. uint8 accepts values 0-255; uint16 accepts values 0-65,535). For non-rectangular image study areas requiring a background value to represent areas outside the study area, the researcher may set this value before bringing the image into Python. For example, in ESRI's ArcMap software, when dealing with file-based rasters that are full-bit range a common practice is to set a 'no data' value to a value outside the data range. If the data type were uint8, the researcher might set the 'no data' value to 256. ArcMap promotes this file to uint16 data type in order to accept a value of 256 (Environmental Systems Research, Inc., 2016). We emphasize that this type of processing should occur before bringing data into Python. This manual allows the specification of 'no data' values in configuration sections (explained in detail throughout) by the researcher, and uses the default value of 255 if 'no data' values are left unspecified.

3.1.2 - Controlling the cell extraction method

Rasterio's mask function is used to extract raster cell values for input shapefiles. This function allows the researcher to control whether a pixel is included in extracted values by specifying the Boolean parameter 'all_touched'. If 'True', a pixel is included if it touches any of the input shapes at all; if 'False', a pixel is included only if the cell centroid is covered by the input shapes or is selected by Bresenham's line algorithm. The default value is 'False'.

3.1.3 - Other considerations

Although beyond the scope of this manual, there are many functions within 'rasterio' that allow the researcher to query, manipulate and visualize images. One such useful function is 'imagename.meta' which returns any available metadata for the entire raster including: affine, band count, coordinate reference system, driver (file type), data type, height and width, no data value, and transformation.

As stated in the R manual, it is necessary that the researcher ensure both training and validation data have all possible values represented for categorical data analyses. The example given is an aspect layer including categories for north, west, south, east and flat. All five categories should be represented in both training and validation data - failure to do so results in an inability to generate an accuracy assessment confusion matrix or an output predicted image.

4 - Importing Data


4.1 - Importing Classification Training Data

4.1.1 - Shapefiles, one shapefile for each class

Each shapefile should be named with the name of the desired output class, followed by "train" (e.g., foresttrain.shp).

Using the code file "4.1.1_ImportClassificationTrainingData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'all_touched', 'no_data_value', and 'img'. The argument 'train_files' does not need to be edited, as long as the researcher has named each shapefile with the name of the desired output class, followed by "train" (e.g., watertrain.shp). File paths for 'img' and 'train_files' do not need to be altered as long as all data have been placed in the project folder named 'Data' as directed in section 2.5 of this manual, Project folder organization. Explanations for each input are notated below in lines beginning with “#”. These comments describe the functionality of the step following the code but do not call any code. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the shapefile and raster data, extracts all raster band data for shapefile areas, and returns a pandas DataFrame named "train_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import glob
import re
import rasterio
from rasterio.mask import mask
import fiona
import pandas as pd
import numpy as np


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reserved value indicating `no data`
# See section 3.1 of the manual, "Image preparation"
# The issue is here if the following error is recieved:
# `ValueError: Length of values does not match length of index`
no_data_value = 255

# Include a pixel if it is touched at all by polygon by entering "True"
# Include a pixel only if centroid is covered by polygon by entering "False"
all_touched = False

# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract training data from each shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Get the names of the training shapefiles with pattern matching 
# `os.path.normpath()` ensures path is written correctly for this OS
train_files = glob.glob(os.path.normpath('./Data/*train.shp'))

# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive training data
train_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# Loop through each training file, extract data, add sequentially to `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for file in train_files:
    # Empty DataFrame to receive train predictors data
    train_by_class = pd.DataFrame()
    
    # Read in shapefile
    with fiona.open(file, 'r') as shapefile:
        # Get list of geometries. Includes type and coordinates
        train_locations = [feature['geometry'] for feature in shapefile]
    
    
    # If point shapefile
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    if train_locations[0]['type'] == 'Point':
        # Transform train_locations into a pandas DataFrame
        # This format required for use in `img.sample()`
        train_locations = pd.DataFrame(train_locations)
        
        # Empty list to receive point values
        point_values = []
        
        # Sample raster using point coordinates. Append values to `point_values`
        for val in img.sample(train_locations['coordinates']):
            point_values.append(val)
        
        # Add `point_values` to `train_by_class`
        train_by_class = pd.DataFrame(point_values, columns=column_names)
         
        
    # If polygon shapefiles
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    else:
        # Extract cell values from the image and mask cells outside polygons.
        # Saved as a 3d numpy array
        out_image, out_transform = mask(img, train_locations, nodata=no_data_value, all_touched=all_touched)

        # For each layer in the image
        for i in range(0, img.count):
            # Extract non-masked values from the relevant layer
            band_values = np.extract(out_image[i] != no_data_value, out_image[i])
            column_name = column_names[i]
            
            # Concat a single column for band `i`
            train_by_class[column_name] = band_values
    
    
    # Append to `train_full`
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Add a column for the class type and populate with the filename minus
    # the last nine characters ('train.shp' or 'valid.shp')
    class_type = os.path.basename(file).lower()
    train_by_class['type'] = class_type[:-9]
    
    # Append `train_by_class` to `train_full`. Loop to the next shapefile
    train_full = train_full.append(train_by_class, ignore_index=True)


# Set dtypes for `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(train_full):

    if column_name == 'type':
        train_full[column_name] = train_full[column_name].astype(str)
    elif pattern.search(column_name):
        train_full[column_name] = train_full[column_name].astype(float)

4.1.2 - One shapefile with multiple classes

This shapefile must have one data column that identifies the classes (in the code below, the field is called “type”). The shapefile can be of any feature type, but different code is used for points versus any other type.

Using the code file "4.1.2_ImportClassificationTrainingData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'no_data_value', 'img', and 'fp_shapefile'. File paths for 'img' and 'fp_shapefile' do not need to be altered as long as all data have been placed in the project folder named 'Data' as directed in section 2.5 of this manual, Project folder organization. Explanations for each input are notated below in lines beginning with “#”. These comments describe the functionality of the step following the code but do not call any code. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the shapefile and raster data, extracts all raster band data for shapefile areas, and returns a pandas DataFrame named "train_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import rasterio
from rasterio.mask import mask
import fiona
import pandas as pd
import numpy as np


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reserved value indicating `no data`
# See section 3.1 of the manual, "Image preparation"
# The issue is here if the following error is recieved:
# `ValueError: Length of values does not match length of index`
no_data_value = 255

# Include a pixel if it is touched at all by polygon by entering "True"
# Include a pixel only if centroid is covered by polygon by entering "False"
all_touched = False

# Replace the [shapefile] with the name of training shapefile
fp_shapefile = './Data/[shapefile]'

# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract training data for shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in shapefile
with fiona.open(fp_shapefile, 'r') as shapefile:
    # Get list of geometries. Includes type and coordinates
    train_locations = [feature['geometry'] for feature in shapefile]
    # # Get list of class types
    class_types = [feature.get('properties').get('type') for feature in shapefile]
    
    
# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive training data
train_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# If point shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if train_locations[0]['type'] == 'Point':
    # Transform train_locations into a pandas DataFrame
    # This format required for use in `img.sample()`
    train_locations = pd.DataFrame(train_locations)
    
    # Empty list to receive point values
    point_values = []
    
    # Sample raster using point coordinates. Append values to `point_values`
    for val in img.sample(train_locations['coordinates']):
        point_values.append(val)
    
    # Add `point_values` to `train_full`
    train_full = pd.DataFrame(point_values, columns=column_names)
    
    # Add a column for the class types
    train_full['type'] = class_types


# If polygon shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
else:
    # Loop through each train location
    for j, location in enumerate(train_locations):
        # Empty DataFrame to receive train predictors data
        train_by_class = pd.DataFrame()
        
        # Extract cell values from the image and mask cells outside polygons.
        # Saved as a 3d numpy array
        out_image, out_transform = mask(img, train_locations, nodata=no_data_value, all_touched=all_touched)
    
        # For each layer in the image
        for i in range(0, img.count):
            # Extract non-masked values from the relevant layer
            band_values = np.extract(out_image[i] != no_data_value, out_image[i])
            column_name = column_names[i]
            
            # Concat a single column for band `i`
            train_by_class[column_name] = band_values
            
        # Get location type
        location_type = class_types[j]
        
        # Add class type to `train_by_class`
        train_by_class['type'] = location_type
        
        # Append `train_by_class` to `train_full`. Loop to the next shapefile
        train_full = train_full.append(train_by_class, ignore_index=True)


# Set dtypes for `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(train_full):

    if column_name == 'type':
        train_full[column_name] = train_full[column_name].astype(str)
    elif pattern.search(column_name):
        train_full[column_name] = train_full[column_name].astype(float)

4.1.3 - Csv files, one .csv file for each class

Each .csv file should be named with the name of the desired output class, followed by "train" (e.g., foresttrain.csv) and should contain the coordinates for the training data, with columns headed x and y.

Using the code file "4.1.3_ImportClassificationTrainingData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'img'. The argument 'train_files' does not need to be edited, as long as the researcher has named each shapefile with the name of the desired output class, followed by "train" (e.g., watertrain.shp). File paths for 'img' and 'train_files' do not need to be altered as long as all data have been placed in the project folder named 'Data' as directed in section 2.5 of this manual, Project folder organization. Explanations for each input are notated below in lines beginning with “#”. These comments describe the functionality of the step following the code but do not call any code. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the .csv and raster data, extracts all raster band data for all coordinate points, and returns a pandas DataFrame named "train_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import glob
import rasterio
import pandas as pd


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract training data from each csv file
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Get the names of the training csv files with pattern matching 
# `os.path.normpath()` ensures path is written correctly for this OS
train_files = glob.glob(os.path.normpath('./Data/*train.csv'))

# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive training data
train_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# Loop through each training file, extract data, add sequentially to `train_full`
for file in train_files:
    # Empty DataFrame to receive train predictors data
    train_by_class = pd.DataFrame()
    
    # Read in csv file
    train_locations = pd.read_csv(file)
    
    # Empty list to receive point values
    point_values = []
    
    # Sample raster using point coordinates. Append values to `point_values`
    for val in img.sample(list(zip(train_locations['x'],train_locations['y']))):
        point_values.append(val)

    # Poulate `train_full` with point_values
    train_by_class = pd.DataFrame(point_values, columns=column_names)
    
    
    # Append to `train_full`
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Add a column for the class type and populate with the filename minus
    # the last nine characters ('train.csv' or 'valid.csv')
    class_type = os.path.basename(file).lower()
    train_by_class['type'] = class_type[:-9]
    
    # Append `train_by_class` to `train_full`. Loop to the next csv file
    train_full = train_full.append(train_by_class, ignore_index=True)
    

# Set dtypes for `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(train_full):

    if column_name == 'type':
        train_full[column_name] = train_full[column_name].astype(str)
    elif pattern.search(column_name):
        train_full[column_name] = train_full[column_name].astype(float)

4.1.4 - Csv file with training data already extracted and assembled

Data can, of course, be extracted using alternative methods and programs. Data assembled outside of Python that is to be used with the code in this manual should have a column for the data from each band and a column headed "type" containing class names for each observation. The pandas package has functionality for reading in many types of data, a list of which can be found in the documentation. This manual only shows an example for reading text files (a.k.a. flat files) using the function 'read_csv()'. For other file types, please see the linked documentation.

Using the code file "4.1.4_ImportClassificationTrainingData.py", the researcher should edit the argument "train_full". The file path name does not need to be altered as long as all data have been placed in the project folder named 'Data' as directed in section 2.5 of this manual, Project folder organization. Explanations for each input are notated below in lines beginning with “#”. These comments describe the functionality of the step following the code but do not call any code. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the .csv file and returns a pandas DataFrame named "train_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import re
import pandas as pd


# Extract training data from csv
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in training data
# Replace the [csv file] with the name of training csv file
train_full = pd.read_csv('./Data/[csv file]')


# Set dtypes for `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(train_full):

    if column_name == 'type':
        train_full[column_name] = train_full[column_name].astype(str)
    elif pattern.search(column_name):
        train_full[column_name] = train_full[column_name].astype(float)

4.1.5 Object-oriented datasets

Data preparation for object-oriented analysis will depend on the software being used to create the objects. One approach is to convert the objects to a shapefile and then to use the resulting shapefiles in a training/validation role as described earlier. Another approach would be to export the data to a text file, which can then be imported using the directions for csv files. The object-oriented approach is most useful when the landscape can be partitioned into homogeneous groupings of pixels based on an auxiliary layer, such as agricultural landscapes delineated at the field level by a using a cadastral layer (Long et al., 2013).

4.2 - Importing Classification Accuracy Assessment Data

The process of importing accuracy assessment data is almost identical to that of importing training data. Code is provided in the following sections, with file names altered for validation data.

4.2.1 - Shapefiles, one shapefile for each class

Each shapefile should be named with the name of the desired output class, followed by "valid" (e.g., watervalid.shp).

Using the code file "4.2.1_ImportClassificationAccuracyAssessmentData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'has_polygons', 'no_data_value', and 'img'. The argument 'valid_files' does not need to be edited, as long as the researcher has named each shapefile with the name of the desired output class, followed by "valid" (e.g., watervalid.shp). File paths for 'img' and 'valid_files' do not need to be altered as long as all data have been placed in the project folder named 'Data'. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the shapefile and raster data, extracts all raster band data for shapefile areas, and returns a pandas DataFrame named "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import glob
import rasterio
from rasterio.mask import mask
import fiona
import pandas as pd
import numpy as np


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reserved value indicating `no data`
# See section 3.1 of the manual, "Image preparation"
# The issue is here if the following error is recieved:
# `ValueError: Length of values does not match length of index`
no_data_value = 255

# Include a pixel if it is touched at all by polygon by entering "True"
# Include a pixel only if centroid is covered by polygon by entering "False"
all_touched = False

# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract validation data from each shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Get the names of the validation shapefiles with pattern matching 
# `os.path.normpath()` ensures path is written correctly for this OS
valid_files = glob.glob(os.path.normpath('./Data/*valid.shp'))

# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive validation data
valid_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# Loop through each validation file, extract data, add sequentially to `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for file in valid_files:
    # Empty DataFrame to receive valid predictors data
    valid_by_class = pd.DataFrame()
    
    # Read in shapefile
    with fiona.open(file, 'r') as shapefile:
        # Get list of geometries. Includes type and coordinates
        valid_locations = [feature['geometry'] for feature in shapefile]
    
    
    # If point shapefile
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    if valid_locations[0]['type'] == 'Point':
        # Transform valid_locations into a pandas DataFrame
        # This format required for use in `img.sample()`
        valid_locations = pd.DataFrame(valid_locations)
        
        # Empty list to receive point values
        point_values = []
        
        # Sample raster using point coordinates. Append values to `point_values`
        for val in img.sample(valid_locations['coordinates']):
            point_values.append(val)
        
        # Add `point_values` to `train_by_class`
        valid_by_class = pd.DataFrame(point_values, columns=column_names)
        
        
    # If polygon shapefiles
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    else:
        # Extract cell values from the image and mask cells outside polygons.
        # Saved as a 3d numpy array
        out_image, out_transform = mask(img, valid_locations, nodata=no_data_value, all_touched=all_touched)

        # For each layer in the image
        for i in range(0, img.count):
            # Extract non-masked values from the relevant layer
            band_values = np.extract(out_image[i] != no_data_value, out_image[i])
            column_name = column_names[i]
            
            # Concat a single column for band `i`
            valid_by_class[column_name] = band_values
    
    
    # Append to `valid_full`
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Add a column for the class type and populate it with the filename minus
    # the last nine characters ('train.shp' or 'valid.shp')
    class_type = os.path.basename(file).lower()
    valid_by_class['type'] = class_type[:-9]
    
    # Append `valid_by_class` to `valid_full`. Loop to the next shapefile
    valid_full = valid_full.append(valid_by_class, ignore_index=True)


# Set dtypes for `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(valid_full):

    if column_name == 'type':
        valid_full[column_name] = valid_full[column_name].astype(str)
    elif pattern.search(column_name):
        valid_full[column_name] = valid_full[column_name].astype(float)

4.2.2 - One shapefile with multiple classes

This shapefile must have one data column that identifies the classes (in the code below, the field is called “type”). The shapefile can be of any feature type, but different code is used for points versus any other type.

Using the code file "4.2.2_ImportClassificationAccuracyAssessmentData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'has_polygons', 'no_data_value', 'img', and 'fp_shapefile'. File paths for 'img' and 'fp_shapefile' do not need to be altered as long as all data have been placed in the project folder named 'Data'. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the shapefile and raster data, extracts all raster band data for shapefile areas, and returns a pandas DataFrame named "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import fiona
import rasterio
from rasterio.mask import mask
import pandas as pd
import numpy as np


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reserved value indicating `no data`
# See section 3.1 of the manual, "Image preparation"
# The issue is here if the following error is recieved:
# `ValueError: Length of values does not match length of index`
no_data_value = 255

# Include a pixel if it is touched at all by polygon by entering "True"
# Include a pixel only if centroid is covered by polygon by entering "False"
all_touched = False

# Replace the [shapefile] with the name of validation shapefile
fp_shapefile = './Data/[shapefile]'

# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract validation data for shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in shapefile
with fiona.open(fp_shapefile, 'r') as shapefile:
    # Get list of geometries. Includes type and coordinates
    valid_locations = [feature['geometry'] for feature in shapefile]
    # # Get list of class types
    class_types = [feature.get('properties').get('type') for feature in shapefile]
    
# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive validation data
valid_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# If point shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if valid_locations[0]['type'] == 'Point':
    # Transform valid_locations into a pandas DataFrame
    # This format required for use in `img.sample()`
    valid_locations = pd.DataFrame(valid_locations)
    
    # Empty list to receive point values
    point_values = []
    
    # Sample raster using point coordinates. Append values to `point_values`
    for val in img.sample(valid_locations['coordinates']):
        point_values.append(val)
    
    # Add `point_values` to `valid_full`
    valid_full = pd.DataFrame(point_values, columns=column_names)
    
    # Add a column for the class types
    valid_full['type'] = class_types


# If polygon shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
else:
    # Loop through each valid location
    for j, location in enumerate(valid_locations):
        # Empty DataFrame to receive valid predictors data
        valid_by_class = pd.DataFrame()
        
        # Extract cell values from the image and mask cells outside polygons.
        # Saved as a 3d numpy array
        out_image, out_transform = mask(img, valid_locations, nodata=no_data_value, all_touched=all_touched)
    
        # For each layer in the image
        for i in range(0, img.count):
            # Extract non-masked values from the relevant layer
            band_values = np.extract(out_image[i] != no_data_value, out_image[i])
            column_name = column_names[i]
            
            # Concat a single column for band `i`
            valid_by_class[column_name] = band_values
            
        # Get location type
        location_type = class_types[j]
        
        # Add class type to `valid_by_class`
        valid_by_class['type'] = location_type
        
        # Append `valid_by_class` to `valid_full`. Loop to the next shapefile
        valid_full = valid_full.append(valid_by_class, ignore_index=True)


# Set dtypes for `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(valid_full):

    if column_name == 'type':
        valid_full[column_name] = valid_full[column_name].astype(str)
    elif pattern.search(column_name):
        valid_full[column_name] = valid_full[column_name].astype(float)

4.2.3 - Csv files, one csv file for each class

Each .csv file should be named with the name of the desired output class, followed by "valid" (e.g., forestvalid.csv) and should contain the coordinates for the validation data, with columns headed x and y.

Using the code file "4.2.3_ImportClassificationAccuracyAssessmentData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'img'. The argument 'valid_files' does not need to be edited, as long as the researcher has named each shapefile with the name of the desired output class, followed by "valid" (e.g., watervalid.shp). File paths for 'img' and 'valid_files' do not need to be altered as long as all data have been placed in the project folder named 'Data'. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the .csv and raster data, extracts all raster band data for all coordinate points, and returns a pandas DataFrame named "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import glob
import rasterio
import pandas as pd


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract validation data from each csv file
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Get the names of the validation csv files with pattern matching 
# `os.path.normpath()` ensures path is written correctly for this OS
valid_files = glob.glob(os.path.normpath('./Data/*valid.csv'))

# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive validation data
valid_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))

# Loop through each validation file, extract data, add sequentially to `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for file in valid_files:
    print('    - %s' % file)
    # Empty DataFrame to receive valid predictors data
    valid_by_class = pd.DataFrame()
    
    # Read in csv file
    valid_locations = pd.read_csv(file)
    
    # Empty list to receive point values
    point_values = []
    
    # Sample raster using point coordinates. Append values to `point_values`
    for val in img.sample(list(zip(valid_locations['x'],valid_locations['y']))):
        point_values.append(val)

    # Poulate `valid_full` with point_values
    valid_by_class = pd.DataFrame(point_values, columns=column_names)
    
    
    # Append to `valid_full`
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Add a column for the class type and populate with the filename minus
    # the last nine characters ('train.csv' or 'valid.csv')
    class_type = os.path.basename(file).lower()
    valid_by_class['type'] = class_type[:-9]
    
    # Append `valid_by_class` to `valid_full`. Loop to the next csv file
    valid_full = valid_full.append(valid_by_class, ignore_index=True)
    

# Set dtypes for `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(valid_full):

    if column_name == 'type':
        valid_full[column_name] = valid_full[column_name].astype(str)
    elif pattern.search(column_name):
        valid_full[column_name] = valid_full[column_name].astype(float)

4.2.4 - Csv file with accuracy assessment data already extracted and assembled

Data assembled outside of Python that is to be used with the code in this manual should have a column for the data from each band and a column headed "type" containing class names for each observation. Using the code file "ImportClassificationAccuracyAssessmentData_4.2.4.py", the researcher should edit the argument "valid_full". The file path name does not need to be altered as long as all data have been placed in the project folder named 'Data'. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the .csv file and returns a pandas DataFrame named "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import re
import pandas as pd


# Extract validation data from csv
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in validation data
# Replace the [csv file] with the name of validation csv file
valid_full = pd.read_csv('./Data/[csv file]')


# Set dtypes for `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(valid_full):

    if column_name == 'type':
        valid_full[column_name] = valid_full[column_name].astype(str)
    elif pattern.search(column_name):
        valid_full[column_name] = valid_full[column_name].astype(float)

4.2.5 - Random extraction of accuracy assessment data from imported training data

If the researcher has a single dataset intended to be split into training and validation data, this can be conducted using the ['train_test_split()' function] (https://scikit-learn.org/stable/modules/cross_validation.html) from the [scikit-learn package] (https://scikit-learn.org/stable/index.html). The researcher should have used the appropriate subsection within section 4.1 'Importing classification training data' of this manual to import data, which should now be a pandas DataFrame named 'train_full'.

Using the code file "4.2.5_ImportClassificationTrainingData.py", under the heading 'Configuration' the researcher should enter the proportion (floating point between 0 and 1) of the data intended to be used for training under the variable name 'train_size'. The default value is 0.75. If the researcher is referring to this section of the manual from section 4.3.4, the researcher should edit 'response_variable_type' to be 'amount'. Otherwise, the researcher may simply press the 'run' button to initiate the function, which will return two pandas DataFrames named "train_full" and "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

This is a common practice. The statistical appropriateness of this approach is not addressed in this article, although analysts should be aware of potential issues related to violation of independence assumptions.

In [ ]:
# Import libraries
from sklearn.model_selection import train_test_split


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Set proportion of data to be used for training purposes, floating point 0-1
train_size=0.75

# Enter the name of the response variable column. Should be "type" or "amount"
response_variable_type = 'type'


# Randomly split data into `train_full` and `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Pandas DataFrame for `train_full` band values
X = train_full.loc[:, train_full.columns != response_variable_type]

# Pandas DataFrame for `train_full` response variables
resp = train_full[response_variable_type]

# Using train_test_split function, these are split into four dataframes
# Rows keep original index # from `train_full`
(   train_X, valid_X, 
 train_resp, valid_resp   ) = train_test_split(X, resp, train_size=train_size)

# Create train and valid _full objects for use in section 5
train_full = train_X.join(train_resp)
valid_full = valid_X.join(valid_resp)

4.3 Importing continuous training and validation data

4.3.1 Point shapefile with values as an attribute

These shapefiles must have at least one field that identifies the continuous values (in the code below, the field is called 'amount').

Using the code file "4.3.1_ImportingContinuousTrainingandValidationData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'fp_train_shapefile', 'fp_valid_shapefile' and 'fp_img'. File paths do not need to be altered as long as all data have been placed in the project folder named 'Data' as directed in section 2.5 of this manual, Project folder organization. Explanations for each input are notated below in lines beginning with “#”. These comments describe the functionality of the step following the code but do not call any code. After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the .csv files and raster data, extracts all raster band data for all coordinate points, and returns two pandas DataFrames named "train_full" and "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import rasterio
import fiona
import pandas as pd


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Replace the [shapefile] with the name of shapefiles
fp_train_shapefile = './Data/[shapefile]'
fp_valid_shapefile = './Data/[shapefile]'

# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract training data for shapefile
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in shapefile
with fiona.open(fp_train_shapefile, 'r') as shapefile:
    # Get list of geometries. Includes type and coordinates
    train_locations = [feature['geometry'] for feature in shapefile]
    # # Get list of class types
    train_amounts = [feature.get('properties').get('amount') for feature in shapefile]

# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive training data
train_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# Extract point data
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Transform train_locations into a pandas DataFrame
# This format required for use in `img.sample()`
train_locations = pd.DataFrame(train_locations)

# Empty list to receive point values
point_values = []

# Sample raster using point coordinates. Append values to `point_values`
for val in img.sample(train_locations['coordinates']):
    point_values.append(val)

# Add `point_values` to `train_full`
train_full = pd.DataFrame(point_values, columns=column_names)

# Add a column for the class types
train_full['amount'] = train_amounts


# Set dtypes for `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(train_full):

    if column_name == 'amount':
        train_full[column_name] = train_full[column_name].astype(str)
    elif pattern.search(column_name):
        train_full[column_name] = train_full[column_name].astype(float)




# Extract Validation data for shapefile
# Skip if validation data will be randomly extracted from `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in shapefile
with fiona.open(fp_valid_shapefile, 'r') as shapefile:
    # Get list of geometries. Includes type and coordinates
    valid_locations = [feature['geometry'] for feature in shapefile]
    # # Get list of class types
    valid_amounts = [feature.get('properties').get('amount') for feature in shapefile]
    
# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive Validation data
valid_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))


# Extract point data
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Transform valid_locations into a pandas DataFrame
# This format required for use in `img.sample()`
valid_locations = pd.DataFrame(valid_locations)

# Empty list to receive point values
point_values = []

# Sample raster using point coordinates. Append values to `point_values`
for val in img.sample(valid_locations['coordinates']):
    point_values.append(val)

# Add `point_values` to `valid_full`
valid_full = pd.DataFrame(point_values, columns=column_names)

# Add a column for the class types
valid_full['amount'] = valid_amounts


# Set dtypes for `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(valid_full):

    if column_name == 'amount':
        valid_full[column_name] = valid_full[column_name].astype(str)
    elif pattern.search(column_name):
        valid_full[column_name] = valid_full[column_name].astype(float)

4.3.2 .csv files with location and value data

The .csv file should contain three columns: the response variable (labeled 'amount' in the code below), x-coordinate (labeled 'x' in the code below) and y-coordinate (labeled 'y' in the code below).

Using the code file "4.3.2_ImportingContinuousTrainingandValidationData.py", the researcher should edit the following arguments under the code block headed 'Configuration': 'fp_train_csv', 'fp_valid_csv' and 'fp_img' with the file names of the training .csv file, the validation .csv file, and the image file, respectively. File paths do not need to be altered as long as all data have been placed in the project folder named 'Data' as directed in section 2.5 of this manual, Project folder organization. Explanations for each input are notated below in lines beginning with “#”. These comments describe the functionality of the step following the code but do not call any code.After defining the correct argument values under Configuration, the researcher may simply press the 'run' button to initiate the function, which reads in the .csv files and raster data, extracts all raster band data for all coordinate points, and returns two pandas DataFrames named "train_full" and "valid_full". Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
import os
import re
import rasterio
import pandas as pd


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Replace the [csv file] with the name of csv files
fp_train_csv = './Data/[csv file]'
fp_valid_csv = './Data/[csv file]'

# Replace [image] with name of image
fp_img = './Data/[image]'


# Extract training data from each csv file
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive training data
train_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))

# Read in csv file
train_location = pd.read_csv(fp_train_csv)

# Empty list to receive point values
point_values = []

# Sample raster using point coordinates. Append values to `point_values`
for val in img.sample(list(zip(train_location['x'],train_location['y']))):
    point_values.append(val)

# Poulate `train_full` with point_values
train_full = pd.DataFrame(point_values, columns=column_names)

# Add amounts to `train_full`
train_full['amount'] = train_location['amount']
    

# Set dtypes for `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(train_full):

    if column_name == 'amount':
        train_full[column_name] = train_full[column_name].astype(str)
    elif pattern.search(column_name):
        train_full[column_name] = train_full[column_name].astype(float)



# Extract Validation data from each csv file
# Skip if validation data will be randomly extracted from `train_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read in raster img
img = rasterio.open(fp_img)

# Empty DataFrame to receive Validation data
valid_full = pd.DataFrame()


# Create list of column names. These must match the layer names in the image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an object to receive the column names
column_names = []

# Get filename without path or extension
base = os.path.basename(img.name)
filename = os.path.splitext(base)[0]

# For each band in raster, append a name to `column_names`
for i in range(1, img.count+1):
    column_names.append('%s.%i' % (filename, i))

# Read in csv file
valid_location = pd.read_csv(fp_valid_csv)

# Empty list to receive point values
point_values = []

# Sample raster using point coordinates. Append values to `point_values`
for val in img.sample(list(zip(valid_location['x'],valid_location['y']))):
    point_values.append(val)

# Poulate `valid_full` with point_values
valid_full = pd.DataFrame(point_values, columns=column_names)

# Add amounts to `valid_full`
valid_full['amount'] = valid_location['amount']
    
    
# Set dtypes for `valid_full`
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# See if column ends with a decimal followed by a number.
# All bands columns are named in this fashion
pattern = re.compile('\.\d+$')

for column_name in list(valid_full):

    if column_name == 'amount':
        valid_full[column_name] = valid_full[column_name].astype(str)
    elif pattern.search(column_name):
        valid_full[column_name] = valid_full[column_name].astype(float)

4.3.3 .csv files with data already extracted and assembled

Follow the instructions for importing classification data in this format. The response variable column should be labeled 'amount'.

4.3.4 Random extraction of validation data from imported training data

Follow the instructions for random extraction of accuracy assessment data from imported classification training data.

5 - Agnostic Classification and Regression Analyses


This section uses Python's [rpy2 package] (https://rpy2.readthedocs.io/en/version_2.8.x/). After download, installation and set-up are complete, this package allows the researcher to make use of R code already written for this manual(see section 1.1 of the R portion of this manual).

The python-manual uses the rpy2 library in a very basic way. For a deeper look into rpy2, please read the documentation. Essentially, this package allows an instance of R and Python to run in parallel by passing information to R from Python and vice-versa. R code can be run independently of Python as well. While the code is synchronous (one line can't execute until the previous line is finished), it's important to note that neither instance is aware of variables defined in or functions native to the other language without the user forcing them to share.

Python can access variables from the R instance by accessing R's global environment: foo = robjects.globalenv['foo']. A Python object can be saved to R's environment by typing the reverse: robjects.globalenv['bar'] = "Hello from Python".

R code can be executed by inserting a string of R code into the function r(): r('foo <- "Hello from R"'). Anything wrapped in an r() function or a function stemming from robjects should be in R language and not Python.

Entire R functions can be saved as a Python object and executed in a pythonic fashion as can be seen through the use of caret.train(). R libraries are imported into python as an object and accept the same arguments. There are some differences between the languages that make this style of coding prone to errors, most notably the use of '.' in R naming conventions or arguments (i.e. the argument 'na.omit'). To get around this issue, Python changes periods to underscores (i.e. 'na.omit' becomes 'na_omit'). This is not consistently effective, but notable for the user who is exploring rpy2 in depth.

5.1 - Model Building

5.1.1 Classification analyses

Mirroring the R portion of this manual (see section 5.1.2), the code below shows model comparison for random forest, support vector machines (linear, radial and polynomial kernels), classification tree analysis, C5.0, k-nearest neighbors, extreme gradient boosting and linear discriminate analysis/maximum likelihood. All models available through the caret package, their respective libraries and model values, and the type of analysis conducted by each model are listed in section 6: Available models of caret package documentation on Github. See the PDF version of the caret package documentation for a detailed description of all functions available in caret.

To set up the model of interest and select the methods of analyses, the researcher should specify the following arguments in the configuration section of the code file "5.1.1_CompareClassificationModels.py": 'equation' and 'methods'. For the 'equation' argument: R is very particular regarding data types. Re-configuring data types after transferring data from Python to R language can create complications. It is the safest practice to specify variables intended for use in the model in Python before beginning work in rpy2. Therefore, code in section 4 automatically reads all values extracted from imagery as numpy 'float' values, which are read as 'numeric' by R. This code reads the 'type' column as a Python 'string' whether it contains only numeric, only character values, or a mix of numeric and character values. R then reads this column as a 'factor'. If the researcher intends to use all extracted columns of data as explanatory variables, the equation would appear:

'type~.'

If the researcher would like to use only certain columns of data as explanatory variables, this needs to be specified. The researcher can either do this in Python after bringing in training and validation data (see Pandas documentation on altering data frames) and then use the above equation format, or these variables can be specified by column name in R language. For example, if the researcher intended to use only three explanatory variables (columns labeled Band1, Band2, Band3) the equation would appear:

'type~Band1+Band2+Band3'

If the researcher has imported already extracted and assembled data, data types can be converted in Python after import (see Pandas documentation on converting data types)in the manner implemented in this manual, ensuring that categorical and continuous data are read correctly by R. This is also the case for explanatory variables that are meant to be read as non-numeric values. Although forced conversion between data types can be completed using functions available in R, depending on the original Python data type this can sometimes yield unexpected results. We recommend Python pre-processing of data types as described above. For a more advanced programmer there are, of course, many alternative methods to ensuring data types are read correctly.

The methods argument should list, in quotations, all models intended for comparison. These models should be listed by their shorthand representation, or 'model value'.

The researcher may now simply press the 'run' button to initiate the function, which will export model diagnostic summaries for individual models in the text file 'modelResults' and confusion matrices, and overall and class accuracies for model comparison in the text file 'cmResults'. Both of these files will be automatically exported to the Data folder. Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
from rpy2 import robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

# Activate pandas2ri to allow Dataframes to be translated between
# languages on the fly
pandas2ri.activate()

# Create a method that allows users to execute r code on the fly
r = robjects.r


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Equation for use in model_train
equation = robjects.Formula('type~.')

# Llist of methods to compare
methods = ["rf", "rpart", "svmLinear", "svmRadial", "svmPoly", "C5.0", "kknn", "xgbTree", "lda"]


# Compare classification models
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reference r libraries in python
caret = importr('caret')
base = importr('base')
stats = importr('stats')
utils = importr('utils')

# Dataframe containing explanitory variables for use in stats.predict()
new_data = valid_full.loc[:, valid_full.columns != 'type']

# Format response variables for use in caret.confusionMatrix()
cm_ref = robjects.vectors.FactorVector(valid_full['type'])


# For each method
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for m in methods:
    # Variable in R to make it easier to visually separate results appended below
    robjects.globalenv['m_id'] = m
    
    
    # Append model results to "modelResults.txt"
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    train_model = caret.train(equation, method=m, data=train_full)
    
    # Insert `train_model` into r's environment
    robjects.globalenv['r_train_model'] = train_model
    
    # Write output to modelresults.txt using `r()`
    r('utils::capture.output(m_id, r_train_model, file="./Data/modelResults.txt", append=TRUE)')
    
    
    # Append confusion matrix results to "cmResults.txt"
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    predict_model = stats.predict(train_model, newdata=new_data)
    results = caret.confusionMatrix(reference=cm_ref, data=predict_model)
    
    # insert `results` into r's environment
    robjects.globalenv['r_results'] = results
    
    # write output to modelresults.txt using `r()`
    r('utils::capture.output(m_id, r_results, file ="./Data/cmResults.txt", append=TRUE)')

5.1.2 Regression analyses

Mirroring the R portion of this manual (see section 6.1.2), the code below shows model comparison for random forest, support vector machines (linear, radial and polynomial kernels), classification tree analysis, neural networks, k-nearest neighbors, partial least squares, ridge regression, cubist, conditional inference random forest, generalized linear model with step AIC feature selection, and multivariate adaptive regression splines. All models available through the caret package, their respective libraries and model values, and the type of analysis conducted by each model are listed in section 6: Available models of caret package documentation on Github). See the PDF version of the caret package documentation for a detailed description of all functions available in caret.

To set up the model of interest and select the methods of analysis, the researcher should specify the following arguments in the configuration section of the code file "5.1.2_CompareRegressionModels.py": 'equation' and 'methods'. For the 'equation' argument: R is very particular regarding data types. Re-configuring data types after transferring data from Python to R language can create complications. It is the safest practice to specify variables intended for use in the model in Python before beginning work in rpy2. Therefore, code in section 4 automatically reads all values extracted from imagery and the 'amount' column as numpy 'float' values, which are read as 'numeric' by R. If the researcher intends to use all extracted columns of data as explanatory variables, the equation would appear:

'amount~.'

If the researcher would like to use only certain columns of data as explanatory variables, this needs to be specified. The researcher can either do this in Python after bringing in training and validation data (see Pandas documentation on altering data frames) and then use the above equation format, or these variables can be specified by column name in R language. For example, if the researcher intended to use only three explanatory variables (columns labeled Band1, Band2, Band3) the equation would appear:

'amount~Band1+Band2+Band3'

If the researcher has imported already extracted and assembled data, data types can be converted in Python after import (see Pandas documentation on converting data types)in the manner implemented in this manual, ensuring that categorical and continuous data are read correctly by R. This is also the case for explanatory variables that are meant to be read as non-numeric values. Although forced conversion between data types can be completed using functions available in R, depending on the original Python data type this can sometimes yield unexpected results. We recommend Python pre-processing of data types as described above. For a more advanced programmer there are, of course, many alternative methods to ensuring data types are read correctly.

The methods argument should list, in quotations, all models intended for comparison. These models should be listed by their shorthand representation, or 'model value'.

The researcher may now simply press the 'run' button to initiate the function, which will export model diagnostic summaries for individual models in the text file 'modelResults' and confusion matrices, and overall and class accuracies for model comparison in the text file 'cmResults'. Both of these files will be automatically exported to the Data folder. Although the function is explained step by step using comments, the researcher should not adjust this portion of the code.

In [ ]:
# Import libraries
from rpy2 import robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

# Activate pandas2ri to allow Dataframes to be translated between
# languages on the fly
pandas2ri.activate()

# Create a method that allows users to execute r code on the fly
r = robjects.r


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Equation for use in model_train
equation = robjects.Formula('amount~.')

# Llist of methods to compare
methods = ["rf", "svmLinear", "svmRadial", "svmPoly", "rpart", "nnet", "kknn", "pls", "foba", "cubist", "cforest", "glmStepAIC", "earth"]


# Compare classification models
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reference r libraries in python
caret = importr('caret')
base = importr('base')
stats = importr('stats')
utils = importr('utils')

# Dataframe containing explanitory variables for use in stats.predict()
new_data = valid_full.loc[:, valid_full.columns != 'amount']

# Format response variables for use in caret.confusionMatrix()
observed = robjects.vectors.FloatVector(valid_full['amount'])


# For each method
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for m in methods:
    # Variable in R to make it easier to visually separate results appended below
    robjects.globalenv['m_id'] = m
    
    
    # Append model results to "modelResults.txt"
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    train_model = caret.train(equation, method=m, data=train_full)
    
    # Insert `train_model` into r's environment
    robjects.globalenv['r_train_model'] = train_model
    
    # Write output to modelresults.txt using `r()`
    r('utils::capture.output(m_id, r_train_model, file="./Data/modelResults.txt", append=TRUE)')
    
    
    # Append confusion matrix results to "cmResults.txt"
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    predicted = stats.predict(train_model, newdata=new_data)
    results = stats.t_test(observed, predicted, paired=True)
    
    # insert `results` into r's environment
    robjects.globalenv['r_results'] = results
    
    # write output to modelresults.txt using `r()`
    r('utils::capture.output(m_id, r_results, file ="./Data/tTestResults.txt", append=TRUE)')

6 - Outputting a final image

Rpy2 and the raster package 'predict' function are used to create the final output image. Although functionality exists in rasterio to read, manipulate and write raster files, the caret package is relied upon for modeling functionality, thus predicted values are created through R.

Using the code file '6_OutputFinalImage.py', the researcher should edit the following arguments under the 'configuration' section: 'equation', 'method', 'fp_img', 'format' and 'datatype'. In the 'equation' argument the researcher should enter the same equation used for model comparison. In the 'method' argument the researcher should insert the model value for the desired method of analysis in quotations. Default settings for 'format' and 'datatype' output an unsigned, 8-bit, ERDAS Imagine file (.img). See section 5.2 of the R portion of this manual for a list of supported file formats, data types, and appropriate shorthand for coding arguments.

In [ ]:
# Import libraries
from rpy2 import robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

# Activate pandas2ri to allow Dataframes to be translated between
# languages on the fly
pandas2ri.activate()

# Create a method that allows users to execute r code on the fly
r = robjects.r


# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create your own equation for use in model_train
equation = robjects.Formula('amount~.')

# Select the method to use in caret.train
method = "rf"

# Replace [image] with the path to the image
# Import raster into R globalenv
r('img <- raster::stack("./Data/[image]")')

# Set output format
r_format = "raster"

# Set output datatype
r_datatype = "INT1U"


# Output the final image
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
print('* 6 - Outputting final image')
# Reference r libraries in python
caret = importr('caret')

# Get the train model in Python and insert into R globalenv
robjects.globalenv['train_model'] = caret.train(equation, method=method, data=train_full)

# Within the R environment, create the raster and save to ./Data/outputImage
r('raster::predict(object=img, model=train_model, filename="./Data/outputImage", fun=predict, format="%s", datatype="%s", overwrite=TRUE, na.rm=TRUE)' % (r_format, r_datatype))

7 - Citations

Anaconda, Inc (2017). Conda package manager for the Anaconda Distribution. Conda version 4.7.11. URL https://conda.io/en/latest/

Anaconda, Inc (2017). Miniconda: Lightweight alternative to the Anaconda Distribution. Miniconda version <…>. URL https://conda.io/en/latest/miniconda.html

Anaconda, Inc. (2019). Open-source Anaconda Distribution enabling data scientists to use Python and R. Anaconda Distribution version 1.7.2. URL https://www.anaconda.com/distribution/

Belopolsky, A., Chapman, B., Cock, P., Eddelbuettel, D., Kluyver, T., Moreira, W., Oget, L., Owens, J., Rapin, N., Slodkowicz, G., Smith, N., Warnes, G., JRI author(s), the R authors, R-help list responders, Numpy list responders, and other contributors. rpy2 version 2.8.x. URL https://rpy2.readthedocs.io/en/version_2.8.x/

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y. and Li, Y. (2019). xgboost: Extreme Gradient Boosting. R package version 0.82.1. URL https://CRAN.R-project.org/package=xgboost

Conda-Forge (2019). Conda-Forge: Package repository for conda.URL https://conda-forge.org/ Cordoba, C. and other contributors (2018). Spyder: The Scientific Python Development Environment. Conda package version 3.3.6. URL https://www.spyder-ide.org/

ESRI 2018. ArcGIS Desktop: Release 10.6. Redlands, CA: Environmental Systems Research Institute.

GDAL/OGR contributors (2019). GDAL/OGR Geospatial Data Abstraction software Library. Open Source Geospatial Foundation. URL https://gdal.org

GEOS contributors (2019). GEOS Geometry Engine Open Source. Open Source Geospatial Foundation. URL https://trac.osgeo.org/geos

Gillies, S. and other contributors. (2019). Fiona is OGR’s neat, nimble, no-nonsense API. Python package version 1.8.6. Toblerity. URL https://github.com/Toblerity/Fiona

Gillies, S. and other contributors (2019). Rasterio: geospatial raster I/O for {Python} programmers. Python package version 1.0.26. Mapbox. URL https://github.com/mapbox/rasterio

Gillies, S. and other contributors. (2019). Shapely: Manipulation and analysis of geometric objects. Python package version 1.6.4. toblerity.org. URL https://github.com/Toblerity/Shapely

Henkel, R. (2011). pyRserve: A Python client to remotely access the R statistic package via network. Release version 0.9.1. URL https://pythonhosted.org/pyRserve/

Hothorn, T., Buehlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. (2006). Survival Ensembles. Biostatistics, 7(3): 355--373.

Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9): 1-20. URL http://www.jstatsoft.org/v11/i09/

Kuhn, M. Contributions from Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., the R Core Team, Benesty, M., Lescarbeau, R., Ziem, A., Scrucca, L., Tang, Y., Candan, C., and Hunt, T. (2019). caret: Classification and Regression Training. R package version 6.0-84. URL https://CRAN.R-project.org/package=caret

Kuhn, M. and Quinlan, R. (2018). C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.2. URL https://CRAN.R-project.org/package=C50

Kuhn, M. and Quinlan, R. (2018). Cubist: Rule- And Instance-Based Regression Modeling. R package version 0.2.2. URL https://CRAN.R-project.org/package=Cubist

Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News, 2(3): 18--22.

Long, J.A, Lawrence, R.L., Greenwood, M.C., Marshall, L. and Miller, P.R. (2013). Object-oriented crop classification using multitemporal ETM+ SLC-off Imagery and Random Forest. GIScience and Remote Sensing, 50(4), 418–436.

McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 51-56. URL http://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf

Mevik, B., Wehrens, R., and Liland, K. (2019). pls: Partial Least Squares and Principal Component Regression. R package version 2.7-1. URL https://CRAN.R-project.org/package=pls

Milborrow, S. Derived from mda:mars by Hastie, T. and Tibshirani, R. Uses Fortran utilities by Miller, A. with leaps wrapper by Lumley, T. (2019). earth: Multivariate Adaptive Regression Splines. R package version 5.1.1. URL https://CRAN.R-project.org/package=earth

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825-2830.

Project Jupyter (2019). Jupyter notebook: A tool for creating interactive documentation capable of displaying mathematics, computations, and their rich media output. Conda package version 1.0.0. URL https://jupyter-notebook.readthedocs.io/en/stable/

Python Software Foundation (2019). Python Language Reference. Python Language version 3.7. URL https://www.python.org/

R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Schliep, K. and Hechenbichler, K. (2016). kknn: Weighted k-Nearest Neighbors. R package version 1.3.1. URL https://CRAN.R-project.org/package=kknn

Strobl, C., Boulesteix, A., Zeileis, A., and Hothorn, T. (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8(25). URL http://www.biomedcentral.com/1471-2105/8/25.

Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9(307). URL http://www.biomedcentral.com/1471-2105/9/307.

Therneau, T. and Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. URL https://CRAN.R-project.org/package=rpart

van der Walt, S., Colbert, C. and Varoquaux, G. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering, 13: 22-30. DOI:10.1109/MCSE.2011.37

Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1): 1-29. URL http://www.jstatsoft.org/v40/i01/.

Venables, W. and Ripley, B. (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Zhang, T. (2008). foba: greedy variable selection. R package version 0.1.