Tuesday, March 29, 2011

Databases - Basics

In the previous post (Databases Introduction)  we looked at different types of databases and simple concepts of tables. In this section, we will look at some basics of databases- tables, inner joins and outer joins. The syntax of SQL used here is for Oracle. However, it is very similar in all other databases such as MySQL, MS SQLServer, Postgres, etc..


We will create two tables by using the SQL statements below: 
  1. create table customer (cust_id number(4,0) not null, customer_name varchar2(60) not null);
  2. create table invoice (invoice_id number(4) not null, cust_id number(4) not null , invoice_total number(10,2) default 0.00, paid_ind char(1));
In the first SQL we create the table called Customer with the column names cust_id and customer_name. Observe that we do not want to store null values in each of the columns in the customer table.

In the second SQL we create the table invoice, which has its columns and also the cust_id column from the customer table. This is how we delegate that a specific invoice is associated to a specific customer.

Observe that one customer can have many invoices and hence we have a one-many relationship between the customer and invoice tables. This implies that a customer may not have any invoices i.e. a one-zero relationship between the customer and invoice tables.

We can make the above create SQLs complex by specifying the tablespace name (we are using the default tablespace assigned to the user you are logging on) and adding foreign key constraints to the cust_id in the invoice table to the customer table. This will eliminate the input of data into the invoice table for which a cust_id is non-existent in the customer table.

A simple way to input data into the tables is to use the insert SQL statement which is given below:

insert into customer values (100, 'Paradigm Solutions');
insert into customer values (102, 'Siemens Pvt Ltd');
insert into customer values (104, 'British Telecom');

Now when we do a simple select from the customer table we get:

---------------------- ------------------------------------------------------------

100                   Paradigm Solutions

102                   Siemens Pvt Ltd

104                  British Telecom

Now here are the SQL statements for inserting records into the invoice table.
insert into invoice values ( 8442, 102, 5430.00, 'Y');
insert into invoice values ( 8997, 102, 5430.00, 'Y');
insert into invoice values ( 8775, 102, 5430.00, 'N');
insert into invoice values (9876,100,2345.78,'N') ; 
insert into invoice values (9886,100,345.00,'N'); 
As an exercise, we leave to the reader to retrieve the records inserted into the table invoice.

Inner and Outer Joins

To retrieve customers and invoices for which there are invoices, we do an inner join  where  the SQL is constructed by joining the column cust_id from the table customer and invoice:
select c.cust_id,  customer_name, i.invoice_id,  i.invoice_total, i.paid_ind
from customer c, invoice i where  c.cust_id= i.cust_id;;
gives results as:
CUST_ID                CUSTOMER_NAME                       INVOICE_ID  INVOICE_TOTAL  PAID_IND
---------------- --------------------------------------------------  ---------------------- ----------------------    --------  
100                        Paradigm Solutions                                 9876                     2345.78           N
100                       Paradigm Solutions                                  9886                    345                   N  
102                      Siemens Pvt Ltd                                           8775                    5430                 N 
102                     Siemens Pvt Ltd                                            8997                    5430                 Y 
102                    Siemens Pvt Ltd                                             8442                    5430                 Y 

We can make our SQL criteria (the sql after the where keyword) complex by requesting only those which have not been paid by modifying as below:

where  c.cust_id= i.cust_id and i.paid_ind ='N'

This will return those invoices which have not paid for the customers who have invoices.

Observe that we have used aliases for table names customer, invoiced as c and i respectively. This eliminates any ambiguous column names like cust_id being specified in the SQL. This, as a result, also eliminates typing the complete table name when referring the columns in the criteria or in the select.

The above was an example of inner join where we get the records when they exist in both tables. However, if you remember that we do not get the third customer - British Telecom from the table customer as it has no invoices associated with it.  In order to obtain all the customer names and their invoices, whether they exist or not, we will write an outer join as shown:

select c.cust_id, customer_name, i.invoice_id  from customer c, invoice i where  c.cust_id= i.cust_id (+);

returns the results as shown below:
CUST_ID                         CUSTOMER_NAME                     INVOICE_ID
---------------------- ---------------------------------------------------- ----------------------

100                                Paradigm Solutions                                         9876
100                                Paradigm Solutions                                         9886
102                               Siemens Pvt Ltd                                                 8775
102                              Siemens Pvt Ltd                                                  8997
102                              Siemens Pvt Ltd                                                  8442
104                              British Telecom

Observe that in the outer join SQL we specify in the join (+) which is equivalent to writing as:

select c.cust_id, customer_name, i.invoice_id  from customer c left outer join  invoice i on i.cust_id= c.cust_id ;

The latter method is more readable and specifies whether it is left or right outer join while the (+) indicates how the tables are already joined and the right , left outer is dependent on which side of the = sign the (+) is assigned.

Nowadays in Object Oriented Programming (OOP), we can understand both of these joins specifically when using Object Relation Mapping (ORM) tools like Hibernate, topLink etc. Most of the time we are using an outer join to get the most relevant information and hence in Hibernate (of which we will see in future articles) by default it uses outer join.

Wednesday, January 12, 2011

Databases - Introduction

In the information age (which started in 1940 or so when the US census used the first computer), a database is referred to as a silo of information which has data stored in a particular way. This makes stored data relevant to employees in order to make business decisions. Databases are a specific type of model. Models are any storage repository (server) which stores data whether it is a relational database, object-oriented database, flat or just files.

In this article, we confine ourselves to relational databases like MySQL, MS SQLServer20xx, Oracle, Sybase, MS Access and HSQLDB. These are the most common relational databases currently in use.

Non-Relational Databases
Before we move onto relational databases we will first discuss non-relational ones. These demanded a lot of sort to find code by programmers in order to retrieve the data to the users. Consider the US phone book and the number 307-456-7892. In order to search the phone book to retrieve who is associated to this phone number, the code (pseudo-code) would look something like:

.... set up code
do while (!eof(PhoneBook)) {
if (recordPhoneNumber EQUALS '3074567892') then
get out of the DO WHILE;
... grab the record's last name and first name and send it back.

This means that for every phone number, we will go through reading the file. Of course, if we had the file sorted by phone numbers a priori then it will be very easy to retrieve the name. Now, if we were given the last name of the individual, then retrieving the phone number would hold the same logic. As you can already see this is CPU (Central Processing Unit) intensive and there are better ways of storing data and retrieving them in less time.

Object Oriented Databases
Object Oriented(OO) Databases store objects. Objects have attributes and methods. These can be retrieved with OO languages like SmallTalk and Java, C++. An article [1] in the reference section gives a short description of OO databases including their advantages and disadvantages. I urge the reader to skim through said article to get familiar with OO databases.

Relational Databases
A relational database is two-dimensional and stores data in rows with the columns being attributes. It is akin to a spreadsheet. A spreadsheet (or a file) which stores particular data by the last name, first name, address and the telephone number of individuals is referred to as a table and each row in the spreadsheet is generally called a record.

Consider our example in the flat file section where we need to retrieve the person associated with a telephone number. Now the pseudo code is reduced to a query in Structured Query Language (SQL) as:

SELECT * FROM person WHERE telephoneNumber = 3074567892;

The database engine then searches the table to find the person for the record with the telephone number 3074567892 and returns the record. Now if there is no record matching the number then it will return zero records.

In a database, a table can have index files which are files that sort the table on a specified column or concatenation of columns. When the query is sent to the table then the index file (in our example, the telephoneNumber Index file) will be used if it exists. If there is no index file then we are back to a full scan. Now, if we were given the last name and were asked to retrieve the phone number, then the query is basically the same except for the where clause i.e.

SELECT * FROM person WHERE lastName ='COOK';

and the database engine will utilize the lastName index file for the above query.

Primary Id
Each table stores information regarding a primary unit in a structured way. For instance, a person can have many types of email addresses, business addresses, physical addresses, mail box addresses and sometimes a permanent address. This automatically lets a database designer think in terms of having a table which stores addresses and a table called "say person" which stores the last name, first name etc of the individual.

To join the person table and the address table we define a unique identification for each person such as SSN, DL # or just a sequence. This is then stored in both tables and we are then able to join the two tables to retrieve information of the person and their addresses. In other words, we relate the person table to the address table on the key sequenceNumber. Now the query will look like this:

SELECT * FROM person, address WHERE person.sequenceNo = address.sequenceNo AND person.telephoneNumber = 3074567892

This code will return all the records from both tables; that is if they match in both tables for the specific telephone number.

Index files are created on primary keys generally to speed the process of querying (searching).

Now that the storage of data (models) has been introduced, we move onto other topics in programming and analysis. We now move into some SQL for creating, inserting data and also joins of tables in Databases Basics.


[1]An Introduction to the Object Oriented Databases article.