Driving a Visualization with SQLAlchemy

This tutorial demonstrates how you might set up a SQL database to serve data to a visualization application using Tangelo. This example demonstrates how an application-specific approach to building and using a database can provide flexibility and adaptability for your Tangelo application.

In this tutorial we will

  • obtain Star Trek: The Next Generation episode data
  • create an SQLite database from it
  • establish some object-relational mapping (ORM) classes using SQLAlchemy
  • visualize the data using Vega

To begin this tutorial, create a fresh directory somewhere where we can build a new project:

mkdir tng
cd tng

Here, we will create a database, along with appropriate ORM infrastructure; write some web services to be used as runtime data sources to pull requested data from the database; and a simple web frontend made from HTML and JavaScript, using the Vega visualization library.

For convenience, you can download and unpack a ZIP archive of the entire web application as well: tng.zip. However, downloading and inspecting the files as we go, or writing them by hand from the listings below, may encourage a deeper understanding of what’s going on.

Getting the Data

The episode data, gleaned from Memory Alpha by hand, is in these two CSV files:

If you take a look in these files, you’ll see some basic data about episodes of Star Trek: The Next Generation. episodes.csv contains one row per episode, indicating its overall episode number, season/episode, title, airdate, a link to the associated Memory Alpha article, and numeric indices into people.csv to indicate who wrote each teleplay, who developed each story, and who directed each episode.

Creating the Database

SQLAlchemy is a Python library that provides a programmatic API for creating, updating, and accessing SQL databases. It includes an object-relational mapping (ORM) component, meaning it provides facilities for writing Python classes that transparently maintain a connection to the database, changing it as the object is updated, etc.

To install SQLAlchemy, you can use the Python package manager pip as follows:

pip install sqlalchemy==0.9.8

(The version specifier may not be necessary, but this tutorial was designed using SQLAlchemy 0.9.8.)

Establishing ORM Classes

The first step in this visualization project is to establish our data model by writing some ORM classes, then using those classes to read in the CSV files from above and flow them into an SQLite database. The file startrek.py has what we need. Let’s analyze it, section by section.

First, we need some support from SQLAlchemy:

1
2
3
4
5
6
7
8
from sqlalchemy import create_engine
engine = create_engine("sqlite:///tngeps.db", echo=True, convert_unicode=True)

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

from sqlalchemy.orm import sessionmaker
DBSession = sessionmaker(bind=engine)

create_engine simply gives us a handle to a database - this one will exist on disk, in the file tngeps.db. The echo keyword argument controls whether the behind-the-scenes translation to SQL commands will appear on the output when changes occur. For now it may be a good idea to leave this set to True for the sake of education.

declarative_base is a base class that provides the foundation for creating ORM classes to model our data, while the sessionmaker function creates a function that can be used to establish a connection to the database.

Next we need to import some types for the columns appearing in our data:

11
12
13
14
from sqlalchemy import Column
from sqlalchemy import Date
from sqlalchemy import Integer
from sqlalchemy import String

Date and String will be used to store actual data values, such as airdates, and names of people and episodes. Integer is useful as a unique ID field for the various objects we will be storing.

Now, we finally get to the data modeling:

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from sqlalchemy import Table
from sqlalchemy import ForeignKey
from sqlalchemy.orm import relationship

episode_teleplays = Table("episode_teleplays", Base.metadata,
                          Column("episode_id", Integer, ForeignKey("episodes.id")),
                          Column("teleplay_id", Integer, ForeignKey("people.id")))

episode_stories = Table("episode_stories", Base.metadata,
                        Column("episode_id", Integer, ForeignKey("episodes.id")),
                        Column("story_id", Integer, ForeignKey("people.id")))

episode_directors = Table("episode_directors", Base.metadata,
                          Column("episode_id", Integer, ForeignKey("episodes.id")),
                          Column("director_id", Integer, ForeignKey("people.id")))


class Episode(Base):
    __tablename__ = "episodes"

    id = Column(Integer, primary_key=True)
    season = Column(Integer)
    episode = Column(Integer)
    title = Column(String)
    airdate = Column(Date)
    teleplay = relationship("Person", secondary=episode_teleplays, backref="teleplays")
    story = relationship("Person", secondary=episode_stories, backref="stories")
    director = relationship("Person", secondary=episode_directors, backref="directors")
    stardate = Column(String)
    url = Column(String)

    def __repr__(self):
        return (u"Episode('%s')" % (self.title)).encode("utf-8")


class Person(Base):
    __tablename__ = "people"

    id = Column(Integer, primary_key=True)
    name = Column(String)

    def __repr__(self):
        return (u"Person('%s')" % (self.name)).encode("utf-8")

First take a look at the classes Episode and Person. These make use of SQLAlchemy’s declarative_base to establish classes whose structure reflect the semantics of a table in the database. The __tablename__ attribute gives the name of the associated table, while the other named attributes give the names of the columns appearing in it, along with the data type.

Note that the teleplay, story, and director attributes of Episode are a bit more complex than the others. These are fields that cross-reference into the “people” table: each Episode may have multiple writers and story developers [1], each of which is a Person. Of course, a particular Person may also be associated with multiple Episodes, so a special “many-to-many” relationship exists between Episode and Person when it comes to the teleplay, story, and director columns. These are expressed in the “association table” declarations appearing in lines 25, 29, and 33. Such tables simply contain one row for each unique episode-person connection; they are referenced in the appropriate column declaration (lines 46-48) to implement the many-to-many relation.

The effect of these ORM classes is that when, e.g., an Episode object is queried from the database, its teleplay property will contain a list of the correct Person objects, having been reconstructed by examining the “episode_teleplays” table in the database.

Creating the Database

Now it just remains to use the ORM infrastructure to drive the parsing of the raw data and creation of the actual database. The file build-db.py contains a Python script to do just this. Let’s examine this script, section by section. First, as always, we need to import some standard modules:

1
2
3
import csv
import datetime
import sys

Next, we need some stuff from startrek.py:

5
6
7
8
9
from startrek import Base
from startrek import engine
from startrek import DBSession
from startrek import Episode
from startrek import Person

Now, let’s go ahead and slurp in the raw data from the CSV files:

12
13
14
15
16
17
18
19
try:
    with open("episodes.csv") as episodes_f:
        with open("people.csv") as people_f:
            episodes = list(csv.reader(episodes_f))
            people = list(csv.reader(people_f))
except IOError as e:
    print >>sys.stderr, "error: %s" % (e)
    sys.exit(1)

We now have two lists, episodes and people, containing the row data. Before we continue, we need to “activate” the ORM classes, and connect to the database:

22
23
Base.metadata.create_all(engine)
session = DBSession()

Now let’s add the rows from people.csv to the database:

26
27
28
29
30
31
32
33
people_rec = {}
for i, name in people[1:]:
    i = int(i)

    p = Person(id=i, name=name.decode("utf-8"))

    people_rec[i] = p
    session.add(p)

This loop simply runs through the rows of the people data (excluding the first “row,” which is just the header descriptors), saving each in a table indexed by ID: line 30 creates a Person object, while line 33 causes a row in the appropriate tables to be created in the database (using ORM magic). We want the saved table so we can reference Person objects later.

Now that we have all the people loaded up, we can put the episode data itself into the database:

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
for i, season, ep, title, airdate, teleplay, story, director, stardate, url in episodes[1:]:
    # Extract the fields that exist more or less as is.
    i = int(i)
    season = int(season)
    ep = int(ep)

    # Parse the (American-style) dates from the airdate field, creating a Python
    # datetime object.
    month, day, year = airdate.split("/")
    month = "%02d" % (int(month))
    day = "%02d" % (int(day))
    airdate = datetime.datetime.strptime("%s/%s/%s" % (month, day, year), "%m/%d/%Y")

    # Create lists of writers, story developers, and directors from the
    # comma-separated people ids in these fields.
    teleplay = map(lambda writer: people_rec[int(writer)], teleplay.split(","))
    story = map(lambda writer: people_rec[int(writer)], story.split(","))
    director = map(lambda writer: people_rec[int(writer)], director.split(","))

    # Construct an Episode object, and add it to the live session.
    ep = Episode(id=int(i), season=season, episode=ep, title=title.decode("utf-8"), airdate=airdate, stardate=stardate, teleplay=teleplay, story=story, director=director, url=url)
    session.add(ep)

This is a loop running through the rows of the episode data, pulling out the various fields, possibly converting them to appropriate Python types. The teleplay, story, and director columns in particular are converted to lists of Person objects (by looking up the appropriate Person in the table we created earlier). Line 58 adds the newly created Person to the database. The many-to-many relationships we established earlier will be invoked here, updating the association tables according to the Person objects present in each Episode object’s fields.

Finally, we must commit the accumulated database operations:

61
62
session.commit()
sys.exit(0)

If you have downloaded the data files, startrek.py, and build-db.py all to the same directory, you should be able to build the database with this command:

python build-db.py

Because we directed the database engine to echo its activity to the output, you should see SQL commands fly by as they are generated by the various calls to session.add(). This should result in a new file being created, tngeps.db. This is an SQLite database file, and should contain all of the data and relationships established in the raw data files.

Writing Data Services

Now we have a database and some ORM classes to query it. The next step is to write a web service that can pull out some data that we need. We are going to use Vega to create some basic charts of the episode data, and Vega visualizations rely on data presented as a list of JSON objects, one per data point. As a starting point for a visualization project on Star Trek episode data, let’s tally up the number of episodes written or developed by each person in the people table, and use Vega to render a bar chart. To do so, we need to query the database and count how many episodes each person is associated to. We can use the ORM classes to accomplish this. Let’s analyze the file writers.py to see how. First, module imports:

1
2
3
4
5
import json
import tangelo

from startrek import DBSession
from startrek import Episode

Now, the meat of the service, the run() function:

8
9
@tangelo.types(sort=json.loads)
def run(sort=False):

The function signature says that the sort parameter, if present, should be a query argument in JSON-form, defaulting to False. We will use this parameter to sort the list of episode writers by the number of episodes worked on (since this may be an interesting thing to look into). Next we need a connection to the database:

10
    session = DBSession()

and some logic to aggregate writers’ episode counts ():

12
13
14
15
16
17
18
19
20
21
22
    count = {}
    episodes = session.query(Episode)
    for ep in episodes:
        seen = set()
        for writer in ep.teleplay:
            count[writer] = count.get(writer, 0) + 1
            seen.add(writer)

        for writer in ep.story:
            if writer not in seen:
                count[writer] = count.get(writer, 0) + 1

This retrieves a list of Episode objects from the database (line 13), then loops through them, incrementing a count of writers in a dictionary (being careful not to double count writers listed under both teleplay and story for a given episode).

Now we convert the dictionary of collected counts into a list of objects suitable for a Vega visualization:

24
25
26
27
28
    results = [{"name": r.name, "count": count[r]} for r in sorted(count.keys(), key=lambda x: x.id)]
    if sort:
        results.sort(key=lambda x: x["count"], reverse=True)

    return results

This line converts each Person object into a Python dictionary after sorting by the numeric ID (which, because of how the data was collected, roughly corresponds to the order of first involvement in writing for Star Trek: The Next Generation). If the sort parameter was True, then the results will be sorted by descending episode count (so that the most frequent writers will appear first, etc.). And finally, of course, the function returns this list of results.

With this file written we have the start of a web application. To see how things stand, you can launch Tangelo to serve this directory to the web,

tangelo --root .

and then visit http://localhost:8080/writers?sort=false to see the list of JSON objects that results.

Designing a Web Frontend

The final piece of the application is a web frontend. Ours will be relatively simple. Here is the webpage itself, in index.html:

1
2
3
4
5
6
7
8
<!doctype html>
<title>Star Trek: The Next Generation Episode Writers</title>

<script src=http://trifacta.github.io/vega/lib/d3.v3.min.js></script>
<script src=http://trifacta.github.io/vega/vega.js></script>
<script src=index.js></script>

<div id=chart></div>

This is a very simple HTML file with a div element (line 9), in which we will place a Vega visualization.

Next, we have some simple JavaScript to go along with this HTML file, in index.js:

1
2
3
4
5
6
7
8
9
window.onload = function () {
    d3.json("barchart.json", function (spec) {
        vg.parse.spec(spec, function (chart) {
            chart({
                el: "#chart"
            }).update();
        });
    });
};

This simply parses a Vega visualization specification into a JavaScript object, which it then passes to vg.parse.spec(), which in turn renders it into the #chart element of the web page [2].

The final piece of the puzzle is the Vega specification itself, in barchart.json.:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
{
    "name": "barchart",
    "width": 4000,
    "height": 500,
    "data": [
        {
            "name": "table",
            "url": "writers?sort=true"
        }
    ],
    "scales": [
        {
            "name": "y",
            "type": "linear",
            "range": "height",
            "domain": {
                "data": "table",
                "field": "data.count"
            }
        },
        {
            "name": "x",
            "type": "ordinal",
            "range": "width",
            "domain": {
                "data": "table",
                "field": "data.name"
            }
        }
    ],
    "axes": [
        {
            "type": "x",
            "scale": "x",
            "values": []
        },
        {
            "type": "y",
            "scale": "y",
            "grid": false
        }
    ],
    "marks": [
        {
            "type": "rect",
            "from": {
                "data": "table"
            },
            "properties": {
                "enter": {
                    "x": {
                        "scale": "x",
                        "field": "data.name",
                        "offset": -1
                    },
                    "width": {
                        "scale": "x",
                        "band": true,
                        "offset": -1
                    },
                    "y": {
                        "scale": "y",
                        "field": "data.count"
                    },
                    "y2": {
                        "scale": "y", 
                        "value": 0},
                    "fill": {
                        "value": "steelblue"
                    }
                },
                "update": {
                    "fill": {
                        "value": "steelblue"
                    }
                },
                "hover": {
                    "fill": {
                        "value": "firebrick"
                    }
                }
            }
        },
        {
            "type": "text",
            "from": {
                "data": "table"
            },
            "properties": {
                "enter": {
                    "x": {
                        "scale": "x",
                        "field": "data.name"
                    },
                    "dx": {
                        "value": 5
                    },
                    "y": {
                        "value": 505
                    },
                    "angle": {
                        "value": 45
                    },
                    "fill": {
                        "value": "black"
                    },
                    "text": {
                        "field": "data.name"
                    },
                    "font": {
                        "value": "Helvetica Neue"
                    },
                    "fontSize": {
                        "value": 15
                    }
                }
            }
        }
    ]
}

This specification describes a data-driven bar chart. You may wish to experiment with this file (for example, changing the colors used, or the width and height of the visualization, or by setting the sort parameter in the url property to false), but as-is, the specification will deliver a bar chart of Star Trek: The Next Generation writers, ordered by most episodes worked on.

Putting It All Together

Your web application is complete! If Tangelo is not running, start it with

tangelo --root .

and then visit http://localhost:8080. You should see a bar chart appear, in which the trekkies out there will surely recognize some of the names.

In summary, we performed the following actions to write a Tangelo application driven by a database:

  1. Got some data we wanted to visualize.
  2. Developed some ORM infrastructure to model the data, using SQLAlchemy.
  3. Imported the data into a new database, using the data and the ORM models.
  4. Developed a web service using SQLAlchemy to retrieve some of the data and then shape it into a form we needed for Vega.
  5. Developed a Vega specification that can take the web service results and render it as a bar chart.
  6. Developed a simple web application to give Vega a place to work and display its results.

Of course, this is just a simple example of what you can do. With Python’s power, flexibility, and interfaces to many kinds of databases and visualization systems, you can develop a Tangelo application that is suited to whatever problem you happen to be working on.

Footnotes

[1]And even directors, though this only happened in one episode when the original director was fired and a replacement brought on.
[2]For more information on how Vega works, and what you can do with it, see the Vega website at http://trifacta.github.io/vega.