Relational Database Technology

John "Scooter" Morris

April 6, 2017

Overview

Data modeling review
Relational algebra
SQL
From model to schema

Limitations

What am I not telling you about?
- database normalization
- object-based approaches to database design
- object-relational mapping
- .... too much more to mention ....
Ask questions!

Example Problem

A system to automate the tracking and documentation of plasmid construction

Terminology:
- fragment: a length of double-stranded DNA
- plasmid: a circular fragment
- recipe: a series of manipulations of the DNA to produce a new plasmid with cDNA of interest inserted
- ACL: access control list
Needs:
- Data processing -- convert raw data into results
- Visualization -- a way to visualize the results
- Data storage -- store the results (and perhaps the raw data)

Example problem

Data Modeling

The FIRST Step

Structured way to understand the data semantics

Independent of underlying platform

Way to communicate with team members (including users)

Excellent (minimal?) documentation

Example: ER Diagrams

ER Diagrams

Types of Databases

Flat-file
Hierarchical
Network
Relational
Object
Object-Relational

Flat-File Databases

No database-enforced (or provided) linkage between records
Excellent for small or special-purpose databases
Might include support for single or multiple indexes
Major feature: ease of use (Filemaker, Access)
Major drawback: scalability & flexibility
e.g.:
- ndbm
- Berkeley DB (Sleepycat DB)
- vi, grep, sed
- FileMaker
- Access

Hierarchical Databases

Relationship between Recipe and Fragment is one-to-many (master-detail)
Assume two recipes: r1.cr and r2.cr
- r1.cr produces 2 plasmids and 1 fragment:
  - r1.p1, r1.p2, and r1.f1
- r2.cr produces 2 fragments:
  - r2.f1, and r2.f2

Hierarchical Databases

Database provides explicit master-detail support
Ideal for many business applications
Restricted to a strict hierarchy
Queries down the hierarchy are very efficient
Any other queries are very expensive
e.g.
- IMS

What about many-to-many relationships?

Networked Databases

Fragment and Gene have a many-to-many relationship
Not represented well by hierarchical databases

Networked Databases

Based on set theory
Database provides explicit linkage support
Very significant design costs
Queries along the connection path are very efficient
Any other queries are very expensive
e.g.
- CODASYL

What if I want to "discover" other relationships?

Relational Databases

Foundation of most production databases
Based on relational calculus and relational algebra
Allows ad-hoc query capability across record types
Supports a standard query language (SQL)
Can support either hierarchical or network models
Attributes are limited to basic types
e.g.
- SQLite
- MySQL
- Derby
- Oracle
- DB2

Relational Databases

Based on relational views (tables)
Associations are based on data values, not expressed linkages
All data is expressed in tables
Terminology:
- Rows are called tuples
- Columns (attributes) are of a common domain (type)

ER → Relational Schema

First, combine any entities with a one-to-one relationship
Next define tables for our entities:

Note that we've added a new attribute to serve as the primary key for each entity

ER → Relational Schema

Now define tables for relationships, adding attributes for the associations:

Relational Algebra: Selection

Selection
- Selection of tuples based on Boolean criteria

Relational Algebra: Projection

Projection
- Selection of attributes

Relational Algebra: Inner Join

Inner Join
- Matrix product of two relations based on a given join predicate, where each record in the two joined tables has a matching record.
EquiJoin
- Inner join where the join predicate is based on an equality.
Natural Join
- Inner join where the join predicate is implicitly based on attributes with the same name in each of the join tables

Relational Algebra: Outer Join

Left Outer Join
- Join where the result contains all records from the left table, but not necessarily from the right.

Relational Algebra: Example Query

Query: What recipes produce the AMP gene?

First, select the AMP gene from the GENE relation and join it to CONTAINS

TEMP1 = (CONTAINS[FRAG,GENE] times GENE[ID,NAME]) 
              where GENE.ID=CONTAINS.GENE and GENE.NAME='AMP'

Relational Algebra: Example Query

Query: What recipes produce the AMP gene?
- Second, join the result to the PRODUCES relation and select the RCP attribute
```
TEMP2 = (TEMP1 join PRODUCES)[RCP]*
```

Relational Algebra: Example Query

Query: What recipes produce the AMP gene?
- Finally, join the result to the RECIPES relation
```
ANSWER = 
         (TEMP2 join RECIPES) where TEMP2.RCP=RECIPES.RCP
```

Structured Query Language (SQL)

ANSI standard syntax for relational algebra
Supported by all major commercial relational databases
Also supported by many open-source efforts
- e.g. mysql, perl's DBI/DBD
Will only cover:
- CREATE
- INSERT
- SELECT
- JOIN
- UPDATE

SQL - CREATE

Creates database objects (databases, tables, indices)

SYNOPSIS:

CREATE DATABASE database_name

CREATE TABLE table_name
(
  column_name1 data_type,
  column_name2 data_type,
  ......
[PRIMARY KEY (column_name),]
[FOREIGN KEY (column_name) REFERENCES table_name(column_name),]
)

CREATE [UNIQUE] INDEX index_name 
      ON table_name (column_name)

SQL - CREATE

Examples:

CREATE TABLE "GENE"
( 
ID char(16), 
NAME varchar(20), 
PROTEIN varchar, 
START int,
PRIMARY KEY (ID)
);

CREATE TABLE "PRODUCES"
( 
RCP char(16), 
FRAG char(16),
DATE date, 
FOREIGN KEY (RCP) REFERENCES RECIPE(RCP),
FOREIGN KEY (FRAG) REFERENCES FRAG(ID)
);

CREATE UNIQUE INDEX on GENE (ID);

SQL - INSERT

Inserts data into a table row

SYNOPSIS:

INSERT INTO "tablename" (first_column,...last_column) 
           VALUES(first_value,...last_value);

Example:

INSERT INTO GENE (ID, NAME, PROTEIN, START) 
            VALUES ('G1', 'AMP', 'MAKK...', -5);

SQL - SELECT

Selects data from relational tables
Key syntax for expressing relational algebra

SYNOPSIS

SELECT [DISTINCT] column1[,column2] FROM table1[,table2]
    [WHERE "conditions"] 
    [GROUP BY "column-list"] 
    [HAVING "conditions] 
    [ORDER BY "column-list" [ASC | DESC] ]

SELECT - Selection

Selection
- Selection of tuples based on Boolean criteria

SELECT - Projection

Projection
- Selection of attributes

SELECT - Implicit Equijoin

Join (equijoin)
- Matrix product of two relations based on equality of an attribute with the same domain

SQL - JOIN

Joins two or more tables together based on a join predicate. Note that the JOIN keyword is actually part of the SELECT syntax

SYNOPSIS:

SELECT column1[,column2] FROM table1
           INNER JOIN table2 ON join_predicate;

SELECT column1[,column2] FROM table1
           INNER JOIN table2 USING ( column_name);

SELECT column1[,column2] FROM table1
           NATURAL JOIN table2;

SELECT column1[,column2] FROM table1
           LEFT OUTER JOIN table2 ON join_predicate;

Where:
- join_predicate is an equality for an Equijoin, or a comparison for any other join

SQL - INNER JOIN Examples

Examples:

SELECT * FROM PRODUCES NATURAL JOIN;

R1|F1|1985-09-09|r1|r1,cr|scooter
R1|F2|1985-09-09|r1|r1,cr|scooter
R2|F3|1985-10-05|r2|r2.cr|ckw

SELECT * FROM PRODUCES INNER JOIN RECIPE ON PRODUCES.RCP=RECIPE.RCP;

R1|F1|1985-09-09|R1|r1|r1,cr|scooter
R1|F2|1985-09-09|R1|r1|r1,cr|scooter
R2|F3|1985-10-05|R2|r2|r2.cr|ckw

SELECT * FROM PRODUCES JOIN RECIPE USING(RCP);

R1|F1|1985-09-09|r1|r1,cr|scooter
R1|F2|1985-09-09|r1|r1,cr|scooter
R2|F3|1985-10-05|r2|r2.cr|ckw

SQL - OUTER JOIN Examples

Assume we add a new row (R3) into the RECIPE relation

Outer join examples:

SELECT * FROM PRODUCES LEFT OUTER JOIN RECIPE ON PRODUCES.RCP=RECIPE.RCP;

R1|F1|1985-09-09|R1|r1|r1,cr|scooter
R1|F2|1985-09-09|R1|r1|r1,cr|scooter
R2|F3|1985-10-05|R2|r2|r2.cr|ckw

SELECT * FROM RECIPE LEFT OUTER JOIN PRODUCES ON PRODUCES.RCP=RECIPE.RCP;

R1|r1|r1,cr|scooter|R1|F1|1985-09-09
R1|r1|r1,cr|scooter|R1|F2|1985-09-09
R2|r2|r2.cr|ckw|R2|F3|1985-10-05
R3|r3|r3.cr|rst|||

SELECT -- Query Example

Query: What recipes produce the AMP gene?

First, select the AMP gene from the GENE relation and join it to CONTAINS

CREATE TABLE TEMP1 AS 
     SELECT CONTAINS.FRAG,CONTAINS.GENE,GENE.NAME FROM CONTAINS,GENE 
          WHERE GENE.ID=CONTAINS.GENE AND GENE.NAME="AMP";

Note we're doing an implicit Equi-JOIN. To do the same thing more explicitly:

CREATE TABLE TEMP1 AS 
     SELECT CONTAINS.FRAG,CONTAINS.GENE,GENE.NAME FROM CONTAINS
     INNER JOIN GENE ON GENE.ID = CONTAINS.GENE WHERE GENE.NAME="AMP";

SELECT -- Query Example 2

Query: What recipes produce the AMP gene?

Second, join the result to the PRODUCES relation and select the RCP attribute

CREATE TABLE TEMP2 AS 
     SELECT PRODUCES.RCP FROM TEMP1,PRODUCES WHERE TEMP1.FRAG=PRODUCES.FRAG;

Note we're again doing an implicit Equi-JOIN. The explicit syntax would be:

CREATE TABLE TEMP2 AS 
     SELECT PRODUCES.RCP FROM TEMP1
     INNER JOIN PRODUCES ON TEMP1.FRAG = PRODUCES.FRAG;

SELECT -- Query Example 3

Query: What recipes produce the AMP gene?
- Finally, join the result to the RECIPES table
```
SELECT DISTINCT RECIPE.RCP, RECIPE.NAME, RECIPE.FILE, RECIPE.OWNER 
     FROM TEMP2, RECIPE WHERE TEMP2.RCP = RECIPE.RCP;
```
  - Note the DISTINCT keyword to remove duplicate rows

SQL - Query Example (shorthand)

Most modern relational databases have good query optimizers

Usually no need to create intermediate tables:

SELECT DISTINCT RECIPE.RCP, RECIPE.NAME, RECIPE.FILE, RECIPE.OWNER 
     FROM GENE, CONTAINS, PRODUCES, RECIPE 
          WHERE GENE.NAME = 'AMP' AND GENE.ID = CONTAINS.GENE
               AND CONTAINS.FRAG = PRODUCES.FRAG
               AND PRODUCES.RCP = RECIPE.RCP;

SQL - UPDATE

Updates data in a database

SYNOPSIS:

UPDATE tablename 
    SET columnname="newvalue"[,nextcolumn="newvalue2"...]
        WHERE columnname OPERATOR "value" 
            [AND|OR column OPERATOR "value"];

Example:

UPDATE GENE SET NAME='AMP' WHERE ID='G1';

SQL - Other Useful Commands

ALTER - Alter a table after it has been created
- Add or drop columns
- Add or drop primary or foreign keys
DELETE - Delete a row from a table. Syntax is similar to SELECT.
DROP - Delete an entire table or database
SQL Functions - aggregation functions that operate on the results from a select
- Include basic statistics (STDEV,AVG,SUM,VAR), and counting functions like COUNT(column)
- Example:
```
SELECT COUNT(*) FROM RECIPE,PRODUCES 
 	   WHERE RECIPE='scooter' AND RECIPE.RCP=PRODUCES.RCP;
```

SQL - References

Good intro tutorial:
- On-line SQL Tutorial
  - http://www.sqlcourse.com/
- On-line SQL Tutorial 2
  - http://www.sqlcourse2.com/
Another good intro:
- A Gentle Introduction to SQL
  - http://sqlzoo.net/
A useful reference site:
- W3 Schools SQL Tutorial
  - http://www.w3schools.com/sql/
Also worth a look
- Software carpentry lecture on Relational Databases
  - http://plato.cgl.ucsf.edu/Outreach/bmi219/slides/swc/lec/db.html

Object-Relational Databases

Essentially an extension of the relational database model
Preserves the tabular (relational) organization of the data
Allows developers to define more complex data types (User Defined Types, UDTs)
No support for encapsulation or inheritance
Some support for methods is provided (User Defined Functions, UDFs)
SQL object extensions already standardized (SQL3)
e.g.
- postgres
- Oracle

Object Databases

Provides persistent storage of objects
Most useful in conjunction with object-based applications
Primarily a programmer's tool, although vendors are providing SQL3 and ODBC interfaces
e.g.
- Objectivity

Types of Databases

Questions?

Recommended Reading:
- Date, C.J. An Introduction to Database Systems. Reading, Mass.: Addison-Wesley (1981)
- Codd, E.F. A Relational Model of Data for Large Shared Data Banks. CACM 13, No. 6 (June 1970)

Uses of Databases

...or why do I [should you] care about this stuff?

Three major computing issues in bioinformatics:
- Data processing -- convert raw data into results
- Visualization -- a way to visualize the results
- Data storage -- store the results (and perhaps the raw data)

Questions?

Database Access with Python

SQL provides a way to interact with a relational database...
... but how do I access my database programmatically?
Lots of ways, but we're going to discuss sqlite3
sqlite3:
- Provides access to SQLite from python scripts
- Simple (maybe too simple...)
- Basic idea is to execute SQL commands and return the response as a python list
- Installed on plato

sqlite3 Example


#! /usr/bin/python

import sys
import sqlite3

try:
	# Open a connection to the database
	conn = sqlite3.connect ('bmi219.db')
	cursor = conn.cursor()
	
	# Execute an SQL statement -- can be pretty much any SQL
	cursor.execute("SELECT NAME, PROTEIN from GENE")
	# fetchall returns a list of lists
	rows = cursor.fetchall()
	for row in rows:
		print "%s, %s"%(row[0], row[1])

	# Close the cursor and commit any changes to the database
	cursor.close()
	conn.commit()
	conn.close()

except sqlite3.Error, e:
	# Handle any errors
	print "Error %d: %s" % (e.args[0], e.args[1])
        sys.exit (1)

AMP, MAKK...
TET, MYAK...
NGF, MYAK...

Larger sqlite3 Example


#! /usr/bin/python
import sys
import sqlite3

try:
  conn = sqlite3.connect ('bmi219.db')

  # Get a cursor we can work with
  cursor = conn.cursor()

  # Use the execute method to pass SQL commands to the database
  cursor.execute("DROP TABLE IF EXISTS `GENE`")
  # Note that we use triple quotes when we need multiple lines
  cursor.execute("""
                CREATE TABLE 'GENE' 
                (
                        'ID' char(16),
                        'NAME' varchar(20),
                        'PROTEIN' longtext,
                        'START' int,
                        PRIMARY KEY (`ID`)
                )
  """)

  cursor.execute("""
                INSERT INTO 'GENE' VALUES 
                  ('G1','AMP','MAKK...',-5)
  """)
  cursor.execute("""
                INSERT INTO 'GENE' VALUES 
                  ('G2','TET','MYAK...',-10)
  """)
  cursor.execute("""
                INSERT INTO 'GENE' VALUES 
                  ('G3','NGF','MYAK...',-1)
  """)

  print "Number of rows inserted: %d"%cursor.rowcount


  # OK, now lets try to get some data out
  cursor.execute("SELECT NAME, PROTEIN from GENE")
  while (1):
    row = cursor.fetchone ()
    if row == None:
      break
    print "%s, %s"%(row[0],row[1])

  print "Number of rows returned: %d"%cursor.rowcount

  # Another way to do the same thing
  cursor.execute("SELECT NAME, PROTEIN from GENE")
  rows = cursor.fetchall ()
  for row in rows:
    print "%s, %s"%(row[0],row[1])

    print "Number of rows returned: %d"%cursor.rowcount

  cursor.close()
  conn.commit ()
  conn.close()

except sqlite3.Error, e:
  print "Error %d: %s" % (e.args[0], e.args[1])
  sys.exit (1)

SQLite3 Use

sqlite3 provides a good low-level interface
For most uses, probably want to wrap low-level SQL commands in Python objects
In the above example, a GENE might be an object
Might have methods to fetch (SELECT) or save (INSERT) a GENE
Provides some insulation from underlying SQL implementation