Corresponding lecture
There is no corresponding lecture.
Description
The three years from 2022-2025 marked a revolution in computer science, with the advent of OpenAI’s now famous ChatGPT (November 30th, 2022), Google’s introduction of NotebookLM (July 12, 2023), and Google’s release of the advanced thinking Large Language Model (LLM), Gemini 2.5 (July 7, 2025).
One clear place where AI and LLMs excel is in coding. The Codex LLM created by OpenAI is trained on Github code, and a production version of this tool is available as Github Copilot.
In the first half of today’s lab, I’ll demonstrate how to use Copilot as a ‘pair programmer’ directly within RStudio, allowing you much more rapidly and seemlessly start new coding projects. You’ll then use Copilot to assist you in using Tidyverse tools to parse a very large dataset.
In the second half of lab, you’ll take the idea of paired programmer to the extreme and use Gemini 2.5 to help you run entire analyses.
We’ll conclude lab by talking briefly about Google’s NotebookLM, and each student will be given a short assignment to complete by the start of lab next week.
What you’ll need to get started
- Google account - I’m assuming most people have one of these but if not, you’ll need one.
- Apply for free access to Github Copilot – During our Git and project management lab a month ago, I asked everyone to use their .edu email address to apply for the GitHub Student Developer Pack (following these step-by-step instructions), which includes free access to Copilot. This should work for everyone associated with an .edu organization, regardless of your role or position.
- Install Node.js - Node.js is an open-source JavaScript runtime environment that allows execution of Javascript code outside of a web browser (e.g. on a cloud computer). We need this to run Gemini-CLI. Follow these instructions to install Node.js. Choose the command line option for installation. For mac, just choose macOS. For Windows, choose Linux and install via WSL2.
- Install and test Gemini-CLI - We’ll now use Node.js to install Gemini-CLI by running the following at the command line:
npm install -g @google/gemini-cli. You can test that the installation was successful by typinggeminiat the command-line. If this is your first time running Gemini, you will be asked to authenticate your Google account. Go ahead and do this.
Task 1 - using Copilot as a pair programmer
You’ve taken a job with the Centers for Disease Control (CDC), and are asked to mine the LEMIS dataset produced by the US Fish and Wildlife Services to find answers to a range of questions related to import of animals and animal products into US ports. To help you get started, here’s a bit of code to read in the data and explore the variables. With the data in your R environment, you’ll then want to use dplyr and ggplot to complete each of the tasks below. Tip: the magrittr pipe (%>%) is your friend.
Download this cleaned up version of the LEMIS data. Once the download is complete, unzip the file, and place in your working directory. Do not try to open the file in Excel. The dataset is simply too large and will crash your computer.
Download this R script which has been stripped of most code, leaving only comments to serve as prompts for the Copilot AI.
Tips
- Open each one of the course scripts in RStudio, so that you can take full advantage of Copilot’s cross-tab awareness.
- With a pair programmer, you are forced to think about ‘prompt engineering’ – your code comments now become AI prompts, and the phrasing, clarity, and detail of these prompts will determine the accuracy of the code provided by the AI model.
Task 2 - using Gemini to execute complex workflows
With node.js and Gemini installed (see above), you’re now ready to use a highly sophisticated LLM that has agentic ‘thinking/doing’ capabilities. Let’s put these abilities to the test by seeing if Gemini can carry out all the basic steps involved in RNA-seq analysis – from fetching raw fastq files, to read mapping, to differential gene expression and plotting.
Options to trick Gemini:
Before we start, we’ll come up with a few tricks to throw at Gemini to see how well it adapts to problems on the fly.
If you’re participating virtually, feel free to choose one of the 8 options below. If you’re attending this lab in-person the numbers below correspond to your table number and indicate your assigned ‘trick’ that you’re going to pull on Gemini.
- Flying blind - Modify the study design file by removing the ‘group’ column so that Gemini won’t know which samples are from healthy controls or paitents with disease.
- All a big mix up - Modify the study design file by swapping the assignments in the group (relabel healthy as disease, and vice versa).
- Not even in the right study - Modify the prompt to give Gemini the wrong accession number (e.g.,try GSE159195).
- Of mice and men - Build the Conda environment with the wrong species reference. Use mouse instead of human.
- Wrong tool for the job - Omit a key piece of software from the Conda environment (e.g., Kallisto)
- Out at homebase - Initiate Gemini while in your base Conda environment
- Walking the wrong path - Initiate Gemini while in the wrong directory (e.g. somewhere other than where your study design file lives)
- Off the charts - Modify the end of the prompt to ask Gemini to produce a type of chart/graphic that is not possible given the data at hand.
Create your working directory
Create a new working directory and navigate to it (I’m making this directory in my home (~), but you are free to set this up wherever you want)
mkdir ~/lab_06
cd ~/lab_06
download study design file
With your working directory set-up, download this simple study design file and place in the same directory.
create a conda environment containing the software Gemini will need.
Create a new Conda environment called diyAI with all the software that Gemini will need for this analysis.
conda create -n diyAI -c conda-forge -c bioconda \
r-base=4.4 \
bioconductor-deseq2 \
bioconductor-tximport \
bioconductor-edger \
bioconductor-limma \
bioconductor-rhdf5 \
bioconductor-ensdb.hsapiens.v86 \
sra-tools \
kallisto \
fastqc \
multiqc \
samtools \
r-tidyverse \
r-pheatmap \
r-plotly \
r-dt \
r-ggrepel \
r-cowplot
Launching the Gemini-CLI
Activate your newly made conda environment and the start Gemini-CLI by running gemini at the command line.
Let’s take a minute to explore the Gemini menu using the / prompt. You should check the models available to you (/models) as well as you current usage stats (```/stats``).
Gemini prompt
We’re finally ready to get down to business. Paste in the following prompt and hit enter (note: your prompt may need to be modified based on instructions for your table). Also, if Gemini requests access to seemingly irrelevant directories on your computer (e.g. photos, music, etc), just choose ‘no’ and proceed.
I would like you to perform bulk RNAseq analysis. You can use the following system resources. Please apply accordingly during processing to speed up as needed:
OS: macOS, or Linux (WSL)
CPU: Apple CPU, or x86 (WSL)
CPU threads: 8 (most laptop since 2016 should have 8 threads)
I have 6 SRAs between 2 conditions I need to analyze. They are present in studyDesign.txt. Read the file, check the metadata to ensure accuracy. Use prefetch to get the sra files, then turn them into fastq files. Note, these are single-end data, so there is only one fastq file per sample. Use pigz to zip up the fastq to save space. Delete the .sra files after fastq.gz files are made.
This is just a test run, so please downsample each fastq file to 1 million reads to expedite downstream analyses.
The human kallisto reference need to be made. You will use kallisto for alignment. When running Kallisto, be sure to save the STDOUT to a log.txt file for each sample, which will allow MultiQC to ingest this log file later. After read mapping with Kallisto is complete, perform basic analysis such as differential expression analysis using R. Make a few plots typical for RNAseq analysis. For each plot output PDF, JPG and PNG. Also output DESeq2 differential expression tables. I would also like you to perform multiqc for all data after analysis is finished.
You are operating inside an activated conda env where most of the tools and R packages needed for this analysis should already be available from path. Here you can find a list of tools available from PATH: sra-tools, kallisto, fastqc, multiqc, samtools. Also here are the available R-packages available to you: deseq2, tximport, edger, limma, tidyverse, pheatmap, plotly, dt, ggrepel, cowplot.
If you need additional tools, use "conda install -c conda-forge -c bioconda <CONDA_PKG_NAME>" to install them. The R packages are also installed via conda, so use the same formula when you need to install any additional R packages.
After you formulate a plan, check with me before you start running. Explain clearly at each step what you plan to do when you ask for permission to execute tool calls.
Monitoring and reporting
Gemini is going to frequently ask you questions (because we told it to at the end of the prompt!). You’ll need to respond to each one in order for the analysis to move forward. Once Gemini is complete, please submit the log file of your chat history with Gemini and submit to me via Discord. To find and copy this file, run the code below. This is due before the start of lab next week (turn in via Discord).
# In terminal (Mac) or WSL (Windows), navigate to the hidden gemini directory located in your home directory
cd ~/.gemini/tmp/[YOURPROJECTFOLDERNAMEHERE]/chats.
# list the files in this directory (there may only be one)
ls -l
# identify (based on date/time stamp) the .json file corresponding to your Gemini run (again, may only be one), and copy this to you desktop
cp [YOURFILENAMEHERE].json ~/Desktop
# Now use your file browser to go to your desktop, and drag and drop the file into a DM in Discord.
Tips
- After your analysis has finsihed you can
/exitout of Gemini. As long as you keep the project folder you were working in, you can always fire back up Gemini and pick up where you left off to expand on an analysis. - To uninstall Gemini, simply use node.js again:
npm uninstall -g @google/gemini-cli
Task 3 - Use AI to interact with course materials
I created a NotebookLM for the course that includes all reading material, lecture slides, videos, and code. NotebookLM is ‘source grounded’, which means it only considers the sources provided in the notebook when generating response.
Access the course NotebookLM and explore the chat feature and the studio content I’ve created.
Each student is required to submit (to me or the TAs) FIVE questions and the responses received by the end of course.
Tips
- Anytime you have questions about material we’ve covered in class, you can simply ask them in the NotebookLM chat (it’s private). If you want the responses to be focused on specific materials (e.g. a particular lecture or paper), simply select the relevant sources in the left-hand panel and then ask your question in the chat.
- Use the flash card and quiz features of the NotebookLM to test your knowledge
- At the time this lab was created, the free tier of NotebookLM allows users to have up to 100 notebooks, each with no more than 50 references. There are also limits on the number of video and audio overviews you can create.
- Check out this detailed style guide for prompting NotebookLM to create slides