Databases

WARNING: It has come to my attention that this page, that contains my personal notes for my Databases module during my First Year of Computer Science at university, was indexed by Google and is being shown to students. This page is a really good resource for Databases, but be aware that this is alot more info than you need for the exam. I apologize for the confusion.

1.Introduction

File based approach to information storage

One possible way of storing information would be to store each set of data needed on a seperate text file and then implement a program to do specific queries on the text files.

Limitations to file based approach

This may seem like a perfectly good solution however there are many limitations to it.

  1. No recovery - If something goes wrong you can't recover it.
  2. Inefficient - large data sets are very inefficient to query if held in text files.
  3. No simultaneous access - It makes simultaneous access very difficult as when querying a file it must be locked. Additionally hiding files or fragments of files is a difficult task with a file based system.
  4. Separation and isolation - Each program holds it's own set of data to deal with one task and may not be aware of useful data held by another program
  5. Duplication - Different programs may hold the same data and so that data is redundant and simply wastes space/ increases chance of desynchronisation.
  6. Data dependence - Each file may have a different format, so the structure is hard coded into the program making it hard to transfer data between programs.
  7. Fixed queries - Programs are written with specific queries to meet requirements and if a new requirement is needed a new program is needed.

Database approach to information storage

Database: A collection of data related in a logical manner which is designed to provide all the necessary info for an organisation

Requirements for a DBnv

Database management system

This is a system which allows users to have complete control over the DB. Giving functions to define/create/maintain the DB

A DBMS has two features: 

Data Definition Language:  The schema. It allows users to define how the database should be. Including the data types the structures and the constraints of the data.

Data Manipulation Language: The queries. It's the language you use to manipulate the database. A common one is SQL.

DBMS also offers: 

Components of the DBMS environment

Disadvantages to using DB over file system

Whilst there are lots of very clear advantages, to be balanced lets go through the disadvantages too

Compared to a file system a DB is/has:

A noSQL DB solves these problems by using a hybrid of a file system and a DB.

2. Database schemas and planning

Transactions and concurrency control

We want for the DBMS to be able to be trusted and for all operations to be completed. It should be reliable and always be in a consistent state.

It needs 

Transaction: An action carried out which reads/updates database.

Database recovery

Concurrency control

Abstract data models

DDL: specifies entities/attributes/relationships/constraints. However it is too low level to understand for most people.

Data model: Intuitive concepts describing data

Types of data organisation

The three way of charecterizing the data is:

Structured data

Semi-structured data

Unstructured data

Relational data model

Entity-Relationship (ER) model

This is a graphical description of the DB

Notations for the ER model:

3-level ANSI-SPARC* Architecture

Derived attributes vs attributes

DB schema

The aim of the schema is to allow users to all have access to the same DB instance with customized views depending on what part they need to see themselves.

Data independence

Upper levels in the DB schema should not be affected by changes in the lower level

Logical data independence: External schemas called views don't change if change the logical structure of the data.

Physical data independence: Conceptual schema doesn't change if we change the internal schema.

Three main phases of Database Design

3. Relational model

A relational model represents a DB as a collection of relations and constraints. 

This is basically the same as the entity relationship model which we'll talk about next so I'll just gloss through this section and use it as an overview.

Terminology

Properties

4. The entity relationship model

 

This is a graphical description of the DB to allow people perhaps without such advanced knowledge to understand the db. Aka engineers.

Main components

Entity

Relationship

Crow's Feet Are Best – TDAN.com

Attribute

This is the set of all common characteristics that are shared by entity occurrences of a particular type.

5: Dependencies and Normalisation

Normalisation

This is a bottom up strategy, basically we start with all the data and attributes stored in the tables. Then from that we figure out optimal relationships such to have the best designed relational database

Database redundancy

it should have No redundancy: Every data item is stored in one place.

Data anomalies Terminology

Modification anomaly: Don't change all instances of a specific value, after modifying one.

Deletion anomaly: Losing other values because you delete one items data

Insertion anomaly: Where when new data items are added, we need to add information about other entities.

Decomposition

To fix the data anomaly problems, remove the dependencies and redundancy by splitting data into multiple tables. Where data which must stay consistent between values of same attribute value, is stored one time in a different table

Relational keys

Functional data dependencies

This describes the relationships between attributes in the same relation.

Let A and B be two sets of attributes. Then B is Functionally dependent on A if each value of A is associated with exactly one value of B

Closure of a set F of dependencies

The closure denoted F+ is the set of all functional dependencies that are implied by dependencies in F

To compute the closure F+ of F we need some interference rules

Armstrong's axioms: 

Algo for computing the closure of F+

6: Normalisation II

Normalisation process

This is a multi-stage process, where the result of each stage is called a normal form. At each stage we do a check to see if the specific criteria is satisfied and if it isn't we have to reoranise.

Unnormalized form

repeating group: Attribute values that repeat. Attributes should only have one value for each cell.

First Normal Form

Second Normal Form

Is in 1NF and there are no partial functional dependencies. This means every non-key attributes depends on the whole primary key. E.g you can't work out a non-key attribute without knowing the primary key.

Whole primary key: the set of Keys which allow you to uniquely identify the set of attributes

So, perhaps a subset is dependent on one primary key and another subset dependent on the whole primary key. If this is the case then there is a partial dependency.

To convert to 2NF we do:

Third Normal Form

This gets rid of transitive dependencies. This means getting rid of cells that are dependent on another non-candidate key.

7. Structured Query Language

SQL components

Writing SQL statements

Views

8. Relational Calculus and Algebra

Relational Calculus

Relational Algebra

With this we describe the data by the operations applied to get it.

There are 6 basic operations:

Derived operations:

9.Relational algebra in depth

Selection (σ)

σpredicate(R)

Projection (π)

πcol-1,...,col-n(R)

Union, Set difference, intersection

These are all the same as their set counterparts (except they require compatibility) so lets go quick through them.

Union

RSR \cup S

Set difference

R-SR - S

Intersection

RS

Compatibility of schemas

Inorder for union, set difference or intersection to work the schemas must match.

So same number of attributes and corresponding attributes must have same domain. Note: not same name, just same domain.

Cartesian product

R×S

Rename

ϱX(R)

Join

Joins are cartesian products plus a selection.

equijoin: this is a type of join where the selection involves the equality operator.

Natural join (

$$R \bowtie S = \Prod_{C_1, \dots, C_2} (\sigma_{R.A_1=S.A_1, \dots, R.A_k =S.A_k)(R \times S)) $$

This is an equijoin between the relations R and S, over all their attributes that have the same name, without duplicates.

Steps to do a natural join:

  1. R×SR \times S (It takes the cartesian product between R and S)
  2. For each attribute with the same name A in both R and S, select all rows where R.A = S.A
  3. Removes duplicate columns using a projection.

Outer join

 This computes the natural join of R and S, and then adds tuples which did not match between R and S.

left outer join: Adds tuples from R which did not match those in S.

Right outer join: Adds tuples from S which did not match those in R.

Full outer join: Adds tuples from either which didn't match.

Semi join

the natural join between R and S. However with only the attributes of R

10. Multi-User Architectures

Teleprocessing architecture

File-server architecture

Client server architecture

Two tier architecture

Three tier architecture

This is literally how the internet browser works.

Distributed DBMS

This has network of computers each with part of the database, which mirrors the organisational structure.

Aims to:

Distributed DBMS is the software system that manages the distributed database and makes the distribution transparent to the users.

Distributed processing vs distributed DBMS

Distributed processing is a centralised database that is accessed over a computer network. But Distributed DBMS has multiple DB fragments distributed across sites.

Design of distributed DBMS 

How to do this depends on Quantitative information and Qualitive information

 

Editors

View count: 9673