Lab 8 -- Creating and filling a Postgres database

Per the lab norm ... work on this for 80 minutes. Work with friends. Send me what you have after 80 minutes.

In this lab you will be making a set of tables within Postgres, and then putting data into those tables. My general approach to doing this is to start by making a database in which the tables will reside.

With the database created the next step analyse the data and determine table schema that fit both the data and how the data is going to be used. I will use as an example for this discussion the hurricane data.

The hurricane data is pretty annoying because it mixes the hurricane and observations in a single file. Specifically it looks like

        AL011851,            UNNAMED,     14,
        18510625, 0000,  , HU, 28.0N,  94.8W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        18510625, 0600,  , HU, 28.0N,  95.4W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        18510625, 1200,  , HU, 28.0N,  96.0W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        18510625, 1800,  , HU, 28.1N,  96.5W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        18510625, 2100, L, HU, 28.2N,  96.8W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        AL021851,            UNNAMED,      1,
        18510705, 1200,  , HU, 22.2N,  97.6W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        AL031851,            UNNAMED,      1,
        18510710, 1200,  , TS, 12.0N,  60.0W,  50, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,
        AL041851,            UNNAMED,     49,
        
    
After some analysis, I came to a conclusion about a how to structure the data. With this structure in mind, I created SQL definitions for the tables I wanted. I usually do this by creating a file, in this case named Hurricane_DDL.sql (DDL is an acronym for Data Definition Language). Here it is

\c hurricane;
drop table if exists observation;
drop table if exists hurricane;
drop type if exists stormtype;

create type stormtype as enum ('', 'TY', 'PT', 'ET', 'ST', 'TD', 'TS', 'HU', 'EX', 'SD', 'SS', 'LO', 'WV', 'DB');

create table hurricane (
       HID char(8),
       Name varchar(25),
       constraint pp_con
       primary key(HID)
       );

create table observation (
       HID char(8),
       date date,
       time time,
       type stormtype,
       latitude float,
       latitudehemi char(1),
       longitude float,
       longitudehemi char(1),
       maxSustained integer,
       constraint PK_CON
       foreign Key(HID) references hurricane(HID)
       );

This file gets used as

        psql < hurricane_DDL.sql
    
Note that I start by dropping the table definitions if they already exist. This also deletes any data that was in those tables so it is pretty aggressive. But it does mean that I am starting with a clean slate.

Finally, I write a program to read the CSV file(s) into the tables I just created. More often than not I write this program using Python. Here is my program. This pretty much follows the pattern of read a line, parse the commas, decide what to do, then execute an SQL insert.

    import psycopg2

db=1

def insertStringH(sline):
    insert = "insert into hurricane (hid, name) values('{}', '{}');"
    return insert.format(sline[0].strip(), sline[1].strip());

def insertStringO(hid, sline):
    aa = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
    insert = "insert into observation (hid, date, time, type, latitude, latitudehemi, longitude, longitudehemi, maxsustained) values('{}', '{}', '{}', '{}', {}, '{}', {},  '{}', {});"
    return insert.format(hid.strip(), sline[0].strip(), sline[1], sline[3].strip(), sline[4][:-1], sline[4][-1], sline[5][:-1], sline[5][-1], sline[6])


if db!=0:
    cnx = psycopg2.connect(user="dbuser", database="hurricane", host="localhost", password="12345678")
    cursor = cnx.cursor()


cc=0
with open("hurdat2-1851-2019-052520.txt") as fp:
    line = fp.readline()
    hid=''
    while line:
        spl = line.strip().split(",")
        if len(spl)<6:
            hid=spl[0]
            ins = insertStringH(spl)
        else:
            ins = insertStringO(hid, spl)
        if db==1:
            try:
                cursor.execute(ins)
            except:
                print(line)
                print(ins)
                cc=cc+1
        else:
            print(ins)
        line = fp.readline()

print(cc)

if db!=0:
    cnx.commit()
    cursor.close()
    cnx.close()

Doing it yourself

If you will be using python, execute the following on the machine running postgres

        pip3 install psycopg2-binary
    

In my code above, you should replace "dbuser" with your login plus "_123". For instance, for Sarah the line

    cnx = psycopg2.connect(user="dbuser", database="hurricane", host="localhost",
would become
    cnx = psycopg2.connect(user="scoufal_123", database="scoufal", host="localhost",

Look at the two files Family and people.. (Alternately, if you want more of a challenge, look at people. This is largely identical to People except that parents in one family can be children in another. Hence, the family to people relationship is many to many.) The files define a set of families and the people in those families. None of the columns in the families table are guaranteed to be unique or unchanging. Similarly, none of the columns in the people table are unique or unchanging. It is guaranteed that the family column in the people table relates to the first column in the family table. Also note that a family may have 0 or more people and a person may be in one or more families. Finally, there are two families with the name A. This means that when you fill the people table, you will have to choose to which family you assign people.

First, do not create a whole new database. Instead put the tables for this lab into your default database (the one I created for you and is linked to your login.)

Come up with a reasonable, general purpose design for the tables to implement this data. Implement this design in two ways: first by creating an SQL DDL file that will create instances in PostgreSQL of the tables you designed. Then write -- in a language of your choice -- code that inputs the family data into PostGreSQL. (The process is essentially identical in most languages that use ODBC -- and most do. Java, C, Go, Python, ...

A thought about table creation and filling. Since you are writing the code that will fill tables, it is almost certainly a lot easier for your program to create unique keys (should you decide you need them) rather than having PostgreSQL assign them for instance using "family_id INT GENERATED ALWAYS AS IDENTITY" which was shown in class. Side note "family_id serial" is almost exactly equivalent to "family_id INT GENERATED ALWAYS AS IDENTITY".

Java

If you are more comfortable in java than python

What to hand in

Send email to gtowell@brynmawr.edu with your DDL file and your program for inputting. It will be easier for me if you do this by cutting and pasting these files into the body of the email rather than including them as attachments.