CS 380 Lab 2: Species Distribution Modeling Using MaxEnt (Part I)

The distribution of each species is determined by a combination of factors, including climate, resources, and dependence on other species. This unique combination of factors determines where different species can live successfully. Even if a single species could survive in a particular climate and habitat, they may not have the resources to survive or reproduce. Consider the following example by Park Williams (UCSB):

Consider Joshua trees, which are confined to elevations between 400-1800 m (2,000-6000 ft.) in the Mojave Desert. To grow any lower than 400 m or any further south than the Mojave Desert would be suicide by drought. However, they grew lower in elevation and further south during the cooler and wetter climate of the last glacial period. This means that Joshua trees expand their range when they can. So why don’t they live in coastal southern California? If a Joshua tree could take a coastal vacation, it would likely find the climate to be ideal for growth. However, it would never reproduce. To reproduce, Joshua trees depend on a variety of yucca moth that is genetically programmed for stuffing a little ball of pollen into the cup-shaped stigma of Joshua tree flowers. This relationship is mutually vital for both plant and moth, and for a complexity of reasons that are not fully understood, the Mojave Desert is where these two species have been stuck with each other since the beginning.

Our first guest speaker touched on the challenges of species distribution modeling from both macro and micro perspectives. In that presentation, we also saw some of the problems with MaxEnt for prediction of species distributions. In this lab, you will examine the effects of climate and climate change on the distributions of several species of tree, and then use climate and species-range data to construct computational models of species distribution using MaxEnt. This lab will be split over two weeks, and will augment our in-class discussions on MaxEnt and species distribution modeling. You may (and should if at all possible) work with a partner on this lab. You should submit only one copy of the lab write-up for each group.

Examining Species Distributions

All files needed for this lab are available in ~eeaton/public/cs380/lab2/.

Examine the maps of California in ~eeaton/public/cs380/lab2/climate-maps.pdf, each of which depicts a single climate variables, such as mean annual temperature, mean diurnal temperature range, mean precipitation during the coldest quarter of the year, etc. Overlaid on each climate map are maps of six species’ ranges: bigcone Douglas fir (Pseudotsuga macrocarpa), Bishop pine (Pinus muricata), Blue oak (Quercus douglasii), Jeffrey pine (Pinus jeffreyi), coast redwood (Sequoia sempervirens), and giant sequoia (Sequoia giganteum).

Since later you will be placing graphics into your write-up, it would be easiest to work in a format that supports this, such as LaTeX, MS Word, or OpenOffice. Briefly answer the following questions, typing up your answers:

Examine the first map (BIO1: Annual Average Temperature). Which species appears to survive best in cold temperatures?
The third map (BIO3: Isothermality) compares the day-to-night temperature oscillation versus the summer-to-winter temperature oscillation. A value of 100 would represent a site where the diurnal temperature range is equal to the annual temperature range. A value of 50 would indicate a location where the diurnal temperature range is half of the annual temperature range. Which region has the highest isothermality (same temperate range)? What is a species that appears to grow well in a highly isothermal environment? What is a species that grows across a range of isothermalities?
Examine all of the 20 maps and choose two species to focus on for the rest of the assignment. Note the two species you chose, and examine the distribution of both species as they relate to the various climate variables. Answer the remaining questions in this section based on your two chosen species.
Does your species appear to be confined to regions with cool summers?
Does a spatial pattern in annual rainfall appear to correspond with the boundaries of any species’ range? Or is rainfall only important during a specific time of year?
For each of your two species, what are the two climate variables that you hypothesize to be most the most important in dictating that species’ distribution? Why have you chosen each climate variable?

Learning a Computational Model of Species Distributions using MaxEnt

MaxEnt was developed in a collaboration between machine learning researchers and a biologist (emphasizing the interdisciplinary nature of computational sustainability) in 2004. It is a recent contribution from computer science / artificial intelligence that is now used widely (as we saw in the first guest lecture) by biologists and ecologists.

To learn the species distribution models, MaxEnt takes two inputs: (1) a file containing exact locations where a species of interest is known to grow and (2) a file containing climate data for each of those locations. By evaluating the climate data at each location where the species of interest is present, MaxEnt calculates a probability function that describes the chances of a tree location having any given climate setting. So if we were studying Joshua trees, MaxEnt would predict that if a Joshua tree is growing in a given location, there is a high probability that that location is hot rather than cold during summer. Next, MaxEnt flips this probability function around to predict the probability of species presence given a particular climate type. Therefore, MaxEnt would predict a high likelihood of Joshua tree presence in locations that are hot during summer and a low likelihood of presence in locations that are cold during summer. While this example focused on only one climate variable, MaxEnt generates the model and predicts the presence likelihood using multiple climate variables. We will examine precisely how MaxEnt learns the model next week in class.

The data required by MaxEnt is included in the following three folders under ~eeaton/public/cs380/lab2/:

environmentBaseTemp: the 20 climate parameters depicted in climate-maps.pdf
environmentIncrTemp: the 20 climate parameters, but with a uniform increase of 4°C
speciesData: the species presence locations

The file ~eeaton/public/cs380/lab2/variables.pdf contains textual descriptions of each of the climate parameters.

To learn a computational model for the distribution of each of your two species: (Read these directions completely before you build your first model!)

Run the MaxEnt program: java -Xmx512M -jar ~eeaton/public/cs380/lab2/maxent/maxent.jar
Input the file containing the presence locations for your first species. Load this into the "Samples" section of the MaxEnt program.
Load the climate parameters into the "Environmental Layers" section of the MaxEnt program by selecting the environmentBaseTempfolder.
For this first species, you identified two climate parameters as potentially important in determining the species' distribution. Choose one of these climate parameters and select only the environmental layer corresponding to that parameter. (Hint: use the "Deselect All" button to make the process go quicker.)
Select the options for "Create response curves" and "Make pictures of predictions."
Create a new folder for the MaxEnt output in your own directory space, named according to the species name and environmental variables you're testing. (E.g., jeffpine_annualprecip) Select this folder for the "Output Directory."
Run the model.
Repeat steps 2-7 for each of your two species, testing only one climate variable each time. At the end of this step, you should have four output models (two for each of your two species).

Each output folder will contain a .html webpage that summarizes the model's information, including the predicted species distribution overlayed on a map and several performance curves. Cooler colors (blue/green) indicate areas where the model calculates a low probability of species presence and warmer colors (red/yellow) indicate areas where the model calculates a higher probability of species presence. White squares indicate the locations specified in your species presence file. For the response curve (middle figure), the x-axis represents a variety of climate values (in this case the annual precipitation in mm) and the y-axis indicates the probability of finding the species of interest in an area with any given annual precipitation. So, the response curve below indicates that Jeffrey pine trees are most likely present in areas with an annual precipitation greater than 600mm.

jeffpine annual precipitation prediction map

jeff pine annual precipitation predicted response curve

Jeffrey pine annual precipitation prediction ROC curve

The rightmost figure depicts the receiver operating characteristic (ROC) curve for the model. This is essential for evaluating the quality of the model. Essentially, you can measure the accuracy of a model by taking the area under the ROC curve. Notice that the ROC curve lies in the unit square, so a model with perfect (100%) accuracy would have the red line go all the way to the upper left (coordinates (0,1)) and would have area 1 (although this is seldom achieved). The black line indicates the performance of random guessing, and so it has 50% accuracy. If the red line is below (to the right of) the black line, this indicates that we could have done better simply by random guessing. The area under the curve (AUC) for this example is 0.86 as noted in the graph's legend, so this model is 86% accurate on the known presence data (this is called the "training accuracy" of the model). Note that this does NOT guarantee that the model is 86% accurate for unseen data; in fact, the training accuracy is often a poor indication of general model performance (called "test accuracy" or "generalization accuracy").

To complete your analysis:

Paste and label all four maps, response curves, and ROC curves in your lab write-up. For each response curve, construct a sentence or two that describes the plotted relationship.
Where is one area where each model over predicted the probability of species presence? Why do you think this occurred?
What do the ROC curves tell you about each of these models? Explain in a sentence or two.
For each species, what is at least one other climate variable that you think you could add to improve the model’s performance? What would a successful model look like?

Place your lab write-up (just one per group, please!) in hardcopy into the submission box outside my office (Park 249) by Thursday, Sept. 23rd at 2:30pm (class time) or submit it in-class then. Be sure to keep electronic copies of your lab write-up for next week's continuation of this lab. Keeping your partners around would probably be best as well. :-)

This lab is based on the Species Distribution Modeling assignment developed by Park Williams, UCSB Geography.