Return to Course Home Page

Session 2-2: Structured Data in Python

⬅️ Previous Session | 🏠 Course Home | 🚦 EDS217 Vibes | ➡️ Next Session |

Probably the easiest mental model for thinking about structured data is a spreadsheet. You are all familiar with Excel spreadsheets, with their numbered rows and lettered columns. In the spreadsheet, data is often “structured” so that each row is an entry, and each column is perhaps a variable recorded for that entry.

Environmental data map really well to this model, especially data collected over time. A spreadsheet of weather observations would have a new row for each observation period, and then the columns would be the meteorological data that was collected at that time. As we move into our data science tools, we will see that a major aspect of these tools is creating efficient ways of working with structured data.

That being said, Python has a number of built-in data types that can be used to organize data in a structured manner. These basic data types containing structured data are all lumped into a single broad category called collections. Within these collection data types, some of the data types are sequences, which means that the items in the collection have a deterministic ordering. You’ve already used the two most common sequence data types: the string and the list. A string is simply an ordered collection of characters, while a list data type structures a collection of anything into an ordered series of elements that can be referenced by their position in the list.

In this lesson, we are going to learn about an additional sequence, called a tuple, as well as two collection data types called sets (set) and dictionaries (dict).

Session Topics

📌 Dictionaries

Accessing elements
keys() and values()
Looping through dictionaries using values()
Accessing un-assigned elements in Dictionaries

📌 Sets

Mutability
Mixed Data Types in Sets
Set Methods

📌 Tuples

Indexing
Immutability
Named tuples

Instructions

We will work through this notebook together. To run a cell, click on the cell and press “Shift” + “Enter” or click the “Run” button in the toolbar at the top.

🐍 This symbol designates an important note about Python structure, syntax, or another quirk.

▶️ This symbol designates a cell with code to be run.

✏️ This symbol designates a partially coded cell with an example.

Collection Summary

Note: Applying the sorted() function to any of these collections will produce a sorted list.

1. Dictionaries

TLDR: Dictionaries are a very common collection type that allows data to be organized using a key:value framework. Because of the similarity between key:value pairs and many data structures (e.g. “lookup tables”), you will see Dictionaries quite a bit when working in python

The first collection we will look at today is the dictionary, or dict. This is one of the most powerful data structures in python. It is a mutable, unordered collection, which means that it can be altered, but elements within the structure cannot be referenced by their position and they cannot be sorted.

You can create a dictionary using the {}, providing both a key and a value, which are separated by a :.

environmental_disciplines = {
    'ecology':'The relationships between organisms and their environments.',
    'hydrology':'The properties, distribution & effects of water on the surface, subsurface, & atmosphere.',
    'geology':'The origin, history, and structure of the earth.',
    'meteorology':'The phenomena of the atmosphere, especially weather and weather conditions.'
    }

🐍 <b>Note.</b> The use of whitespace and indentation is important in python. In the example above, the dictionary entries are indented relative to the brackets <code>{</code> and <code>}</code>. In addition, there is no space between the <code>'key'</code>, the <code>:</code>, and the <code>'value'</code> for each entry. Finally, notice that there is a <code>,</code> following each dictionary entry. This pattern is the same as all of the other <i>collection</i> data types we've seen so far, including <b>list</b>, <b>set</b>, and <b>tuple</b>.

▶️ <b> Run the cell below. </b>

Code

environmental_disciplines = {
    'ecology':'The relationships between organisms and their environments.',
    'hydrology':'The properties, distribution & effects of water on the surface, subsurface, & atmosphere.',
    'geology':'The origin, history, and structure of the earth.',
    'meteorology':'The phenomena of the atmosphere, especially weather and weather conditions.'
}

Accessing elements in Dictionaries

Access an element in a dictionary is easy if you know what you are looking for. For example, if I want to know the definition of ecology, I can simply retireve the value of this defition using the key as my index into the dictionary:

environmental_disciplines['ecology']
>>> 'The relationships between organisms and their environments.'

✏️ Try it.

Try accessing the some of the definitions in the environmental_disciplines dictionary.

Code

environmental_disciplines['ecology']

'The relationships between organisms and their environments.'

Because dictionaries are mutable, it is easy to add additional entries. This is done using the following notation:

    environmental_discplines['geomorphology'] =  'The evolution and configuration of landforms.'

✏️ Try it.

Biogeochemistry is defined as “the chemical, physical, geological, and biological processes and reactions that govern the composition of the natural environment.” Add this discpline to the dictionary environmental_disciplines.

Code

environmental_disciplines['biogeochemistry'] = 'the chemical, physical, geological, and biological processes and reactions that govern the composition of the natural environment.'
print(environmental_disciplines['biogeochemistry'])

the chemical, physical, geological, and biological processes and reactions that govern the composition of the natural environment.

Accessing dictionary keys and values

Every dictionary has builtin methods to retrieve its keys and values. These functions are called, appropriately, keys() and values()


disciplines = environmental_disciplines.keys()
print(disciplines)
>>> dict_keys(['ecology', 'hydrology', 'geology', 'meteorology', 'biogeochemistry'])


definitions = environmental_disciplines.values()
print(definitions)
>>> dict_values(
    ['The relationships between organisms and their environments.', 
     'The properties, distribution & effects of water on the surface, subsurface, & atmosphere.', 
     'The origin, history, and structure of the earth.', 
     'The phenomena of the atmosphere, especially weather and weather conditions.', 
     'The chemical, physical, geological, and biological processes and reactions that govern the composition of the natural environment.'])

🐍 Note. The keys() and values() functions return a dict_key object and dict_values object, respectively. Each of these objects contains a list of either the keys or values. You can force the result of the keys() or values() function into a list by wrapping either one in a list() command.

For example: key_list = list(environmental_disciplines.keys()) will return a list of the keys in environmental_disciplines

Code

for key, value in environmental_disciplines.items():
    print(key, value)
    environmental_disciplines[key] = value.capitalize()

ecology The relationships between organisms and their environments.
hydrology The properties, distribution & effects of water on the surface, subsurface, & atmosphere.
geology The origin, history, and structure of the earth.
meteorology The phenomena of the atmosphere, especially weather and weather conditions.
biogeochemistry the chemical, physical, geological, and biological processes and reactions that govern the composition of the natural environment.

Looping through Dictionaries

Python has an efficient way to loop through all the keys and values of a dictionary at the same time. The items() method returns a tuple containing a (key, value) for each element in a dictionary. In practice this means that we can loop through a dictionary in the following way:

my_dict = {'name': 'Homer Simpson',
           'occupation': 'Nuclear Engineer',
           'address': '742 Evergreen Terrace',
           'city': 'Springfield',
           'state': ' ? '
          }

for key, value in my_dict.items():
    print(f"{key.capitalize()}: {value}.")


>>> Name: Homer Simpson.
    Occupation: Nuclear Engineer.
    Address: 742 Evergreen Terrace.
    City: Springfield.
    State:  ? .

✏️ Try it.

Loop through the environmental_disciplines dictionary and print out a sentence providing the definition of each subject (e.g. “Ecology is the study of….”).

Accessing un-assigned elements in Dictionaries

Attempting to retrieve an element of a dictionary that doesn’t exist is the same as requesting an index of a list that doesn’t exist - Python will raise an Exception. For example, if I attempt to retrieve the definition of a field that hasn’t been defined, then I get an error.

environmental_disciplines['xenohydrology']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-46-d4d91bf18209> in <module>
----> 1 environmental_disciplines['xenohydrology']

KeyError: 'xenohydrology'

While it’s easy to determine what indicies are in a list using the len() command, it’s sometimes hard to know what elements are in a dict (but we’ll learn how soon!). Regardless, to avoid getting an error when requesting an element from a dict, you can use the get() function. The get() function will return None if the element doesn’t exist:

unknown_definition = environmental_disciplines.get('xenohydrology')
print(unknown_definition)
>>> None

The get() function will also allow you to pass an additional argument. This additional argument specifies a “default” value which will be returned for any undefined elements:

environmental_disciplines.get('xenohydrology', 'Discipline not defined.')
>>> 'Discipline not defined.'

list_of_disciplines = ['climatology', 'ecology', 'meteorology', 'geology', 'biogeochemistry']

✏️ Try it.

Using the list of discplines given above, write a for loop that either prints the definition of the discipline, or prints ‘Discipline not defined.’

2. Sets

TLDR: Sets are useful for comparing groups of items to determine their overlap or their differences. Sometimes used in data science, but rarely when working with large datasets.

As opposed to a list or tuple, a set is not a sequence. Although a set is iterable (like the sequences you’ve already met), a set is an unordered collection data type, which means it is not a sequence. However, a set is mutable, which means - like a list - it can be modified after being created. Finally - and most uniquely - a set has no duplicate elements. In this sense, a set in python is very much like a mathematical set.

We’ve seen that a list is implemented using [], while a tuple is implemented using ().

A set is implemented using {}:

num_set = {1, 3, 6, 10, 15, 21, 28}
str_set = {'hydrology', 'ecology', 'geology', 'climatology'}

▶️ <b> Run the cell below. </b>

Code

# Define set variables
num_set = {1, 3, 6, 10, 15, 21, 28}
str_set = {'hydrology', 'ecology', 'geology', 'climatology'}

As with all other collections, you can also create a set using the set() function:

num_set = set([1, 3, 6, 10, 15, 21, 28, 45, 45, 45, 45, 45])

✏️ Try it.

Use the set() function to create a set containing the first four prime numbers.

Code

num_set = set([1, 3, 6, 10, 15, 3, 10, 21, 28])
print(num_set)

{1, 3, 6, 10, 15, 21, 28}

Mutability

Sets are mutable.

To remove an element from a set, use the discard() method:


str_set.discard('ecology')

To add an element from to set, use the add() method:


str_set.add('oceanography')

To add multiple elements to a set at the same time, use the update() method. The items to add should be contained in a list.


str_set.update(['oceanography', 'microbiology'])

✏️ Try it.

Add ‘biogeochemistry’ and ‘meteorology’ to str_set and then remove ‘ecology’.

Many of the same functions that worked on list and tuple also work for a set.

len(str_set)
>>> 4

The min() and max() commands can also be used to find the minimum and maximum values in a tuple. For a tuple of strings, this corresponds to the alphabetically first and last elements.

min(str_set)
>>> 'climatology'

max(str_tuple)
>>> 'oceanography'

✏️ Try it.

Use the len(), min(), and max() commands to find the length, minimum, and maximum of num_set.

Mixed Data Types in Collections and Sequences

As a reminder, it’s usually a good idea to make sure your sets are all of the same basic data type. The reason is because Python doesn’t know how to compare the magnitude of different data types.

Which is larger: ecology, or the number 3? Python doesn’t know the answer, and neither do I. If you try to use functions like max or min on a mixed data type set you will get a TypeError exception.


mixed_set = {3, 4, 'ecology', 'biology'}
max(mixed_set)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-49-a4da84ba3cd4> in <module>
      1 mixed_set = {3, 4, 'ecology', 'biology'}
----> 2 max(mixed_set)

TypeError: '>' not supported between instances of 'int' and 'str'

Set Methods

Given their similarity to mathematical sets, there are some specific functions that allow us to compare and combine the contents of different sets.

Union

A union of sets contains all the items that are in any of the sets.

The union of sets A and B is defined as $ A B $.

odds = {1, 3, 5, 7, 9, 11, 13, 15}
evens = {2, 4, 6, 8, 10, 12, 14, 16}

integers = odds.union(evens)
print(integers)

>>> {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}

▶️ <b> Run the cell below. </b>

Code

knows_python = {'Bart', 'Maggie', 'Homer', 'Lisa', 'Professor Frink', 'Nelson'}
knows_R = {'Homer', 'Nelson', 'Lisa', 'Marge', 'Ralph', 'Milhouse', 'Ms. Krabappel'}

Intersection

An intersection of sets contains all the items that are found in all of the sets.

The intersection of sets A and B is defined as $ A B $.


squares = {4, 9, 16, 25, 36, 49}
multiples_of_nine = {9, 18, 27, 36, 45}

squares_divisible_by_nine = squares.intersection(multiples_of_nine)
print(squares_divisible_by_nine)

>>> {9, 36}

✏️ Try it.

Use the intersection() method to determine who knows both python and R.

Code

# knows_both =

Difference

An difference of two sets contains all the items that are in A but not in B.

The difference (or relative complement) of set A and B is defined as $ A B $.


squares = {4, 9, 16, 25, 36, 49}
multiples_of_nine = {9, 18, 27, 36, 45}

squares_not_divisible_by_nine = squares.difference(multiples_of_nine)
print(squares_not_divisible_by_nine)

>>> {16, 49, 4, 25}

🐍 <b>Note.</b> Because a <b>set</b> is an <i>unordered</i> collection, the result of a set function will return elements in an unpredictable order. In the example above, the intersection returned `{16, 49, 4, 25}` rather than `{4, 16, 25, 49}`, which you may have expected.

✏️ Try it.

Use the difference() method to determine who knows R, but does not know Python.

Symmetric Difference

This method returns all the items that are unique to each set.

The symmetric difference (or disjunctive union) of sets A and B is A \triangle B (also sometimes written as A \oplus B)


squares = {4, 9, 16, 25, 36, 49}
multiples_of_nine = {9, 18, 27, 36, 45}

squares_not_divisible_by_nine = squares.symmetric_difference(multiples_of_nine)
print(squares_not_divisible_by_nine)

>>> {16, 49, 18, 4, 25, 27, 45}

✏️ Try it.

Use the symmetric_difference() method to determine who only knows either R or Python.

Additional Set Methods: isdisjoint(), issubset(), issuperset()

There are three additional set functions that allow you to determine the relationships between two sets. Each of these functions returns either True or False, which means they are Boolean operators.

isdisjoint() determines if two sets are disjoint. It returns True if the contents of two sets are completely distinct, and False if they have any overlap

odds.isdisjoint(evens)
>>> True

🐍 <b>Note.</b> Set <i>A</i> is <b>disjoint</b> from set <i>B</i> if, and only if, the <b>intersection</b> of <i>A</i> and <i>B</i> is <code>None</code>.

issubset() returns True if the content of set A is a subset of set B, and False if it is not a subset.


primes = {1, 3, 5, 7, 11}
primes.issubset(odds)
>>> True

🐍 <b>Note.</b> Set <i>A</i> is a <b>subset</b> of set <i>B</i> if, and only if, the <b>intersection</b> of <i>A</i> and <i>B</i> is <i>A</i>.

issupserset() returns True if the content of set A is a superset of set B, and False if it is not a superset.


odds.issuperset(primes)
>>> True

🐍 <b>Note.</b> Set <i>A</i> is a <b>superset</b> of set <i>B</i> if, and only if, set <i>B</i> is a subset of <i>A</i>.

There is a lot more to learn about dictionaries, including methods for deleting elements, merging dictionaries, and learning about additional collection types like OrderedDict that allow you to preserve the arrangement of dictionary elements (essentially making them sequences). We will keep coming back to them throughout the class. If you want to learn more, check out the great material in our reading: Dictionaries

3. Tuples

TLDR: Tuples are a kind of list that can’t be altered. They are not very common in data science applications, but you might run across them from time to time. “Named Tuples” allow for the creation of simple structured data “objects” that don’t require much coding overhead.

Tuples are a type of sequence, similar to list, which you’ve already seen. They primary difference between a tuple and a list is that a tuple is immutable, which means that it’s value cannot be changed once it is defined. A tuple is implemented using ():

num_tuple = (4, 23, 654, 2, 0, -12, 4391)
str_tuple = ('energy', 'water', 'carbon')

▶️ <b> Run the cell below. </b>

Code

# Define tuple variables
num_tuple = (4, 23, 654, 2, 0, -12, 4391)
str_tuple = ('energy', 'water', 'carbon')

As with a list, a tuple may contain mixed data types, but this is not usually recommended.

Because they are both sequences, tuples and lists share many of the same methods. For example, just like lists, the len() command returns the length of the tuple.

len(str_tuple)
>>> 3

The min() and max() commands can also be used to find the minimum and maximum values in a tuple. For a tuple of strings, this corresponds to the alphabetically first and last elements.

min(str_tuple)
>>> 'carbon'

max(str_tuple)
>>> 'water'

✏️ Try it.

Use the len(), min(), and max() commands to find the length, minimum, and maximum of num_tuple.

Code

# Find the length of num_tuple

# Minimum value of num_tuple

# Maximum value of num_tuple

Other ways to create tuples

Tuples can also be constructed by:

Using a pair of parentheses to indicate an empty tuple: ()
Using a trailing comma for a tuple with a single element: a, or (a,)
Separating items with commas: a, b, c or (a, b, c)
using the tuple() built-in function: tuple(iterable).

🐍 <b>Note.</b> An <i>iterable</i> is any object that is capable of returning its contents one at a time. Strings are iterable objects, so <code>tuple('abc')</code> returns <code>('a', 'b', 'c')</code>.

tuple('earth')
>>> ('e', 'a', 'r', 't', 'h')

✏️ Try it.

Create three separate tuples containing the latitude and longitudes of the following cities:

Los Angeles, CA (34.05, -118.25)
Johannesburg, South Africa (-26.20, 28.05)
Cairo, Egypt (30.03, 31.23)
Create a fourth tuple that is made up of the three tuples (i.e. a “tuple of tuples”).

Code

# Define a three new tuples, one for each city.

# los_angeles = 

# johannesburg = 

# singapore = 

# Create a new tuple that is a tuple made up of the three city location tuples:
# tuple_of_tuples =

Indexing

As you learned with lists, any element of a sequence data type can be referenced by its position within the sequence. To access an element in a sequence by its index, use square brackets [].

Individual elements of tuples are accessed in the exact same manner as lists:

num_tuple[0]
>>> 4

num_tuple[-2]
>>> -12

word_tuple = tuple('antidisestablishmentarianism')

word_tuple[14]
>>> 's'

word_tuple[::3]
>>> 'aistlhnrnm'

✏️ Try it.

Use indexing to create a new tuple from the 2nd element in str_tuple. Find the 3rd element of this new tuple.

Code

# new_tuple = 

# 3rd element of new_tuple:

Immutability

All objects in python are either mutable or immutable. A mutable object is one whose value can change. In contrast, an immutable object has a fixed value. You’ve already been introduced to a few immutable objects including numbers, strings and now, tuples. These objects cannot be altered once created.

🐍 <b>Note.</b>  If you attempt to modify the value of an existing tuple, you will get a <code>TypeError</code> exception from the Python interpreter.

num_tuple[0] = 3

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input> in <module>
----> 1 num_tuple[0] = 1

TypeError: 'tuple' object does not support item assignment

Tuple Operations

Because they are immutable, tuples do not have the same robust set of functions that lists have. Attempting to change a tuple (for example, by trying to append elements) will raise an AttributeError, because the append method isn’t available to tuple objects.


tuple_of_colors = ('red', 'blue', 'green', 'black', 'white')
tuple_of_colors.append('pink') # <- UH-OH!

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-857308f688f6> in <module>
----> 1 tuple_of_colors.append('pink')

AttributeError: 'tuple' object has no attribute 'append'

Instead of appending data to an existing tuple, when you want to change the contents of a tuple, you need to either create a new one, or modify the variable by re-defining it.

▶️ <b> Run the cell below. </b>

Code

tuple_of_colors = ('red', 'blue', 'green', 'black', 'white')
tuple_of_colors = tuple_of_colors + ('pink',)
print(tuple_of_colors)

('red', 'blue', 'green', 'black', 'white', 'pink')

DIVING DEEPER: Named Tuples

Tuples are convenient for storing information that you do not want to change. However, remembering which index should be used for each value can lead to errors, especially if the tuple has a lot of fields and is constructed far from where it is used.

As an example, we created the coordinate location of Cairo, Egypt as:

cairo_location = (30.03, 31.23)

But wait… Are those coordinates stored (latitude, longitude) or (longitude, latitude)? You might think it is easy to sort this out for most cities, but for Cairo it’s really difficult!

Python has an additional immutable collection data type called a namedtuple which assigns names, as well as the numerical index, to each member. The namedtuple is part of the standard python library but it is not immediately available. In order to use the namedtuple data type, you first need to import it to your working environment. We will be using the import command quite a bit in order to extend what python can do and take advantage of all the tools that people have developed for environmental data science. For now, we need to import namedtuple from the collections library within python. The code for that looks like this:

from collections import namedtuple

Once we import the namedtuple, we can create a new kind of custom data type that we can use to store our locations:

Location = namedtuple('Location', ['latitude', 'longitude'])

In the code above, the first argument to the namedtuple function is the name of the new tuple object type you want to create. We called this new object type a Location. The second argument is a list of the field names that the Location objects will have. In general, Location objects on Earth are defined by two pieces of information: the latitude and the longitude.

Now that we’ve defined this new Location object type, we can create a new Location object using this code:

cairo_location = Location(latitude=30.03, longitude=31.23)

Note that this code isn’t that different than the code we used to make a tuple:

cairo_location = tuple(30.03, 31.23)

The difference is that we are using our custom namedtuple type called Location, and we are able to specify exactly which values correspond to the latitude and longitude fields. We can retrieve any field in our Location tuple by specifying the field:

cairo_location.latitude
>>> 30.03

▶️ <b> Run the cell below. </b>

Code

from collections import namedtuple

Location = namedtuple('Location', ['latitude', 'longitude'])
cairo_location = Location(latitude=30.03, longitude=31.23)
cairo_location.latitude, cairo_location.longitude

(30.03, 31.23)