Mildred Hado, Noah’s Ark Folk Art Mixed Media Watercolor, Acrylic Painting


In this capstone module, you’ll gets lots of practice using Tidyverse tools on a historical voting of the UN General Assembly via Datacamp. Once you’ve completed this work, you’ll wrap-up the bootcamp by applying your data science skills outside of Datacamp to analyze data from the US Fish and Wildlife Service (USFWS) Law Enforcement Management Information System (LEMIS) data on wildlife and wildlife product imports into the United States. This data was obtained via Freedom of Information Act (FOIA) requests by EcoHealth Alliance and contains over 5.5 million observations (rows) of 28 variables (columns), showing the flow of live wildlife and wildlife products into the USA at every major port city from 2000-2014.

An analysis of a portion of this large dataset was published by EcoHealth Alliance here.

What you need to do

  1. First, complete the Datacamp assignment: Exploratory Data Analysis in R
  2. You’ll need to download this cleaned up version of the LEMIS data
  3. Once the download is complete, unzip the file, and read the data into your R environment. Don’t try to open the file in Excel….you’ll regret it. Also, do not try to View() the object after importing into R….it’s simply too large.

To help you get started, here’s a bit of code to read in the data and explore the variables. With the data in your R environment, you’ll then want to use dplyr and ggplot to complete each of the tasks below. Tip: the magrittr pipe (%>%) is your friend.

lemis <- read_delim("lemis_cleaned.tsv")

#take a minute to explore this huge dataset

#with large datasets like this, it's useful to know how to see the 'levels' for any variable of interest

Using Tidyverse tools, do your best to come up with answers to the following questions related to the LEMIS dataset. Reach out to the TAs if you need help.

  1. Identify the most common (by ‘quantity’) live mammal taken from the wild for import into the US.
  2. Building on your analysis above, produce a plot showing live mammals (use ‘generic_name’) imported for the purposes of science/research. (tip: use geom_col() in ggplot for this). Feel free to play around with different themes to make your plot more exciting.
  3. Identify the countries from which we import the most macaques (again, a simple plot will suffice).
  4. Using the same approach as above, create a plot showing the countries from which we import live bats.
  5. For what purposes do we import bats?
  6. How does the type of bat (use ‘specific_name’) imported differ between countries (hint: use facet_wrap in your ggplot code)?
  7. Identify the most expensive (by ‘value’) shipment of live mammals to enter the US.
  8. How does the answer above compare with the most expensive shipment of any kind (live or not)?
  9. You are alerted to a concerning new viral disease of humans that is believed to originate from Fruit bats (though the exact type of fruit bat is not clear). Identify the US cities that would would be most likely to be exposed to such a virus from the import of live fruit bats.
  10. A recent case of Anthrax in NYC was traced back to a contaminated Wildebeest hide that was stretched and used to make a traditional drum. Through which port(s) did this animal product most likely enter the country? (note: this actually happens).