Return to Course Home Page
Interactive Session 0-1: Getting ready to Python
🏠Course Home | 🚦 EDS217 Vibes | ➡️ Next Session |
General Plan
There are many ways to set up your local machine to run maintainable python data science code. In addition to the usual need to build your code using version control (e.g. GitHub), any strategy for local computation requires three critical components:
- A system for managing computing environments to ensure that code runs in a consistent environment.
- A system for managing python packages to ensure that code runs with consistent dependencies.
- A system for managing python code, which is usually an integrated development environment (IDE) in which editing and running code occur interactively.
In the past, these three components were often managed separately, but in recent years, there has been a trend towards integrating these components into a single system. For example, the RStudio IDE is a single system that manages all three components for R code. Similarly, the Data Spell IDE is a single system that manages all three components for python code.
In this class, we will use a combination of tools to manage these three components. We will use conda
to manage computing environments and python packages, and we will use Visual Studio Code as our IDE.
These are both very popular tools in the python data science community, and they are both free and open source. However, there are many other options for managing computing environments, python packages, and code execution. We will discuss some of these options below, or you can skip ahead to the instructions for setting up your local machine.
Managing Computing Environments, Libraries, and Dependencies
There are many options for managing computing environments. These days, a common method is to use containers
, in which an entire computational system (including processes, memory, disk space) is spun up as an isolated service on your local (or remote) machine. Tools such as docker or python-specific shiv allow for isolated packaging and execution of python programs. Generally, these are better-suited for deployment of code on remote servers, but they can be used locally too. In this class, we’re not going to use containers. Instead we will use a more traditional approach to managing computing environments and packages.
More traditional approaches to managing computing environments and/or packages and dependencies include a suite of diverse tools. Some of these focus only on managing computing environments, while others focus on managing python packages. Some are designed to work with python only (e.g venv
, conda
), while others are designed to work with any programming language (e.g virtualenv
). A few of the most popular options in each category are listed below.
Package management tools for python
pip
is the standard package management tool for python. It is included with the standard python distribution, and it is the most widely used package management tool for python. However, it does not manage computing environments, so it is not as widely used as other tools below.
pipx
is a tool developed by Brett Cannon that is designed to manage python packages. It has the advantage of being able to install any package hosted on PyPI, as well as packages hosted on github and even packages you’ve made on your local machine. However, it does not manage computing environments, and it does not manage package dependencies.
Environment management tools for python
virtualenv
is a is a tool to create isolated Python environments. Since Python 3.3, a subset of it has been integrated into the standard library under the venv
. While venv
is sufficient to create virtual environments, it does not manage package dependencies, so it is also not as widely used as other tools below.
Tools that manage both environments and packages
conda
is a tool developed by Anaconda that is designed to manage computing environments. However, it also allows you to install packages, and even install binaries of packages directly without the need for local compilation. It also manages package dependencies, ensuring libraries are inter-operable.
mamba
is a new tool developed by the QuantStack team that is designed to be a drop-in replacement for conda
. It allows you to create new environments, install packages, and even install binaries of packages directly without the need for local compilation. It also manages package dependencies, ensuring libraries are inter-operable. It is designed to be faster than conda
and to use fewer resources. It is also designed to be more compatible with pip
than conda
. However, it is not as widely used as conda
and it is not as well-supported by IDEs.
poetry
is another new tool developed by the Python Packaging Authority team that is designed to manage computing environments and python packages. Advantages of poetry
include a simplified approach to dependency management and a simplified approach to building and packaging your own code for distribution. Disadvantages include a lack of support for conda
environments and a lack of support for pip
packages that are not hosted on PyPI (although this may change in the future).
0. Forking, Cloning, & Configuring the course repository
Before we get started, we need to clone the course repository. This will create a local copy of the course repository on your local machine. We will use this local copy to work through the course materials and you will be able to add your own code to this local copy as you work through the course.
Forking the course repository.
To fork the course repository, you must go to the course repository on GitHub and click the “Fork” button. This will add the repository to your own GitHub account, and allow you to make changes to the repository without affecting the original repository.
Cloning the course repository.
Once you have forked the repository (added it to your Github account), you need to clone the course repository to create a copy on your local machine.
You can clone the repo on the github site by going to your local account version of the repository on GitHub:
https://github.com/[your-github-username]/eds217_2023
Click on the green “Code” button and copy the URL that appears in the dropdown menu. Then use that url to clone the repository on your local machine:
git clone https://github.com/environmental-data-science/eds217_2023.git
Make sure you put the course repository in a location where you can find it again. For example, you might want to put it in a folder called dev
or meds
or code
or projects
or something like that. Hopefully you already have organization from your prior work, but if not, now is a good time to start.
Configuring the course repository.
Setting Up the Repository for Clean Jupyter Notebooks
To ensure the integrity and consistency of your Jupyter notebooks, it’s helpful to strip outputs before committing them. This keeps notebooks lightweight and avoids accidentally committing potentially sensitive information. It also makes it easier to collaborate on notebooks, because output changes won’t be flagged as changes in the Git history.
To setup the repository for clean Jupyter notebooks, you’ll need to add a filter to your local Git configuration.
Because filters are scripts that run code on your local machine, we do this installation manually, so you can be sure you aren’t accidentally running malicious code.
git config --local filter.strip-notebook-output.clean "jupyter nbconvert --stdin --stdout --to notebook --ClearOutputPreprocessor.enabled=True"
git config --local filter.strip-notebook-output.smudge cat
git config --local filter.strip-notebook-output.required true
These commands set up a filter only for this repository. They tell Git how to process .ipynb
files before commit and checkout.
With the filter set up, you can now add and commit Jupyter notebooks to the repository. The filter will automatically strip outputs from the notebooks during the commit process.
1. Installing and configuring conda
Throughout this course, we will use conda
for our environment management. We can access conda
through the terminal/command line or within a terminal inside an IDE.
Here are detailed instructions for getting conda and our class environment ready for use on your local machine.
Instructions: Installing & configuring conda
Managing code and execution (IDEs)
Finally, we come to the tool you will use most when coding on your local computer – the Integrated Development Environment, or IDE. The possibilities for IDEs is even more expansive than for either of the other tools. Common python IDEs include:
Any of these would work well for a python data science workflow, but some have more features focused on data science than others. For example, PyCharm is more focused on software engineering, but a new IDE called Data Spell, by the same company (Jet Brains), is squarely centered on data science workflows.
The best IDE is usually the one you are most familiar with. For that reason, RStudio isn’t a terrible choice for an IDE. Many data scientists use Jupyter
(both Jupyter Notebooks and JupyterLab). However, using a non-browser IDE provides more opportunities for customization and removes a layer of complication required when executing your code.
Python Data Science Editor Usage (Primary Editor)
Python Data Science Editor Usage (Secondary Editor)
2. Launching Jupyter Notebooks and Testing your Environment
Jupyter notebooks are the easiest way to get up-and-running with python. They are also a great tool for formatting code and documentation in the same place. All of the class materials for this course were developed as notebook files. These files aren’t pure python, and instead are somewhat similar to .Rmd
files. However, instead of being encoded in markdown
, notebooks are encoded as json
files.
The json
format is much more extensible than markdown, but also much harder to edit (it’s closer to raw html
or xml
). Most python IDEs can render notebook files, but some may support more features than others. To get the most functionality from a notebook (or .ipynb
) file, it’s best to view, edit, and execute the notebook in a Jupyter
server. Let’s start our first server and make sure everything is working well.
Instruction: Launching Jupyter and testing your environment
3. Installing and Configuring VSCode for Python
In this class, we will focus on using Visual Studio Code (VSCode) as our IDE.
This selection provides you with exposure to the single most popular IDE in the datascience world and also one that is well-supported by the conda
environment management system. It is also a very popular IDE for software engineering, so you will be able to use it for other courses and projects as well.