Databases | Compsci Compsci

<p><strong>WARNING:</strong> It has come to my attention that this page, that contains my personal notes for my Databases module during my First Year of Computer Science at university, was indexed by Google and is being shown to students. This page is a really good resource for Databases, but be aware that this is alot more info than you need for the exam. I apologize for the confusion.</p><h1>1.Introduction</h1><h2>File based approach to information storage</h2><p>One possible way of storing information would be to store each set of data needed on a seperate text file and then implement a program to do specific queries on the text files.</p><h3>Limitations to file based approach</h3><p>This may seem like a perfectly good solution however there are many limitations to it.</p><ol><li><strong>No recovery </strong>- If something goes wrong you can't recover it.</li><li><strong>Inefficient </strong>- large data sets are very inefficient to query if held in text files.</li><li><strong>No simultaneous access</strong> - It makes simultaneous access very difficult as when querying a file it must be locked. Additionally hiding files or fragments of files is a difficult task with a file based system.</li><li><strong>Separation and isolation -</strong> Each program holds it's own set of data to deal with one task and may not be aware of useful data held by another program</li><li><strong>Duplication</strong> - Different programs may hold the same data and so that data is <strong>redundant</strong> and simply wastes space/ increases chance of desynchronisation.</li><li><strong>Data dependence</strong> - Each file may have a different format, so the structure is hard coded into the program making it hard to transfer data between programs.</li><li><strong>Fixed queries </strong>- Programs are written with specific queries to meet requirements and if a new requirement is needed a new program is needed.</li></ol><h2>Database approach to information storage</h2><p><strong>Database: </strong>A collection of data related in a logical manner which is designed to provide all the necessary info for an organisation</p><h3>Requirements for a DBnv</h3><ul><li>One central repository of data shared by all users/departments</li><li>all data has a minimum amount of duplication/redundancy</li><li>Large databases may have a "data dictionary" which describes the DB data. (e.g schema)</li></ul><h2>Database management system</h2><p>This is a system which allows users to have complete control over the DB. Giving functions to define/create/maintain the DB</p><p>A DBMS has two features: </p><p><strong>Data Definition Language:</strong>  The schema. It allows users to define how the database should be. Including the data types the structures and the constraints of the data.</p><p><strong>Data Manipulation Language: </strong>The queries. It's the language you use to manipulate the database. A common one is SQL.</p><p>DBMS also offers: </p><ul><li><strong>Security</strong> - Users can only access what they're allowed to access</li><li><strong>Concurrency</strong> - Multiple queries can be done on the same data</li><li><strong>Recovery</strong> - If something goes wrong you can undo it.</li></ul><h2>Components of the DBMS environment</h2><ul><li><strong>Hardware - </strong>The computers storing the data</li><li><strong>Software - </strong>the DBMS managing the data</li><li><strong>Data -</strong> the actual stuff which you are storing + the metadata e.g schema</li><li><strong>Procedures - </strong>The documented instructions on how to use the DB</li><li><strong>People </strong>- The people who run the queries and mange the DB</li></ul><h2>Disadvantages to using DB over file system</h2><p>Whilst there are lots of very clear advantages, to be balanced lets go through the disadvantages too</p><p>Compared to a file system a DB is/has:</p><ul><li>More complex</li><li>Higher cost to implement</li><li>Additional hardware cost</li><li>Slower processing for SOME applications</li><li>Higher impact if things go wrong.</li></ul><p>A noSQL DB solves these problems by using a hybrid of a file system and a DB.</p><h1>2. Database schemas and planning</h1><h2>Transactions and concurrency control</h2><p>We want for the DBMS to be able to be trusted and for all operations to be completed. It should be reliable and always be in a consistent state.</p><p>It needs </p><ul><li>Database recovery</li><li>Concurrency control protocols<ul><li>This where database access are prevented from interfering with others.</li></ul></li></ul><p><strong>Transaction:</strong> An action carried out which reads/updates database.</p><h3>Database recovery</h3><ul><li>During execution of a transaction the data may be in an inconsistent state, where constraints may be violated</li><li>Committed: transaction commited successfully</li><li>Rolled back: When it was not completed successfully.</li></ul><h3>Concurrency control</h3><ul><li>process of managing simultaneous operations on the DB, preventing them interfering with each other.</li><li>This allows multiple users to edit the DB at the same time and all to execute correctly.</li></ul><h2>Abstract data models</h2><p><strong>DDL:</strong> specifies entities/attributes/relationships/constraints. However it is too low level to understand for most people.</p><p><strong>Data model:</strong> Intuitive concepts describing data</p><h3>Types of data organisation</h3><p>The three way of charecterizing the data is:</p><ul><li>Structured data</li><li>Semi-structured data(XML)</li><li>Unstructured data</li></ul><h3>Structured data</h3><ul><li>Data represented in a strict format such as a relational data model (tables, tuple, attributes)</li><li>The DMBS ensures that everything has the right structure and maintains integrity.</li></ul><h3>Semi-structured data</h3><ul><li>Schema mixed in with data, so you don't know in advance how it's structured.</li></ul><h3>Unstructured data</h3><ul><li>No structure to document</li><li>E.g text document or webpage html</li></ul><h2>Relational data model</h2><ul><li>Relationships are tables (columns + rows)</li><li>The attributes are the colums</li><li>And the tuples are the rows. </li></ul><h3>Entity-Relationship (ER) model</h3><p>This is a graphical description of the DB</p><ul><li>It specifies the data objects and the important properties</li><li>Also the associations between the entities ( relationships)</li><li>Includes constraints</li></ul><p>Notations for the ER model:</p><ul><li>Crow's foot notation</li><li>UML notation</li></ul><h2>3-level ANSI-SPARC* Architecture</h2><ul><li><strong>External level: </strong> Data that users care about</li><li><strong>Conceptual level: </strong>The logical structure of the data that the DBA cares about</li><li><strong>Internal level: </strong>How the data is physically stored in the DB. (Data Structures, algorithms)</li></ul><h3>Derived attributes vs attributes</h3><ul><li>Derived means it is based on an attribute with a formula applied</li><li>Attributes are actual values stored in the DB</li></ul><h2>DB schema</h2><ul><li><strong>DB schema:</strong> this describes everything in the DB</li><li><strong>DB instance: </strong>Describes the data at a particular moment</li></ul><p>The aim of the schema is to allow users to all have access to the same DB instance with customized views depending on what part they need to see themselves.</p><h3>Data independence</h3><p>Upper levels in the DB schema should not be affected by changes in the lower level</p><p><strong>Logical data independence: </strong>External schemas called views don't change if change the logical structure of the data.</p><p><strong>Physical data independence: </strong>Conceptual schema doesn't change if we change the internal schema.</p><p><img src="https://ibrecap.com/images/user_images/data independence1610966874.png" alt="" width="456" height="218"></p><h2>Three main phases of Database Design</h2><ul><li><strong>Conceptual design: </strong>Make a high level model of the data<ul><li>This identifies the users requirements</li><li>Is independent of physical needs</li><li>Gives a fundamental understanding of the system</li></ul></li><li><strong>logical design</strong>: Make a<em> relational data model</em> of the data<ul><li>Use the conceptual design to map out the entities/relationships</li><li>Normalise data</li></ul></li><li><strong>Physical design: </strong>Describe the database implementation. Specify storage structures for an optimum performance</li></ul><h1>3. Relational model</h1><p>A relational model represents a DB as a collection of relations and constraints. </p><p>This is basically the same as the <strong>entity relationship model </strong>which we'll talk about next so I'll just gloss through this section and use it as an overview.</p><h2>Terminology</h2><ul><li><strong>Relation</strong> - table with rows and columns which logically stores entity occurrences and their tuples</li><li><strong>Attribute </strong> - A named column of a relation with a unique name. It stores properties of entities</li><li><strong>Domain of attribute </strong>- Allowed values</li><li><strong>Tuple</strong> - Row of relation<ul><li>stores attributes for a given entity occurrence</li></ul></li><li><strong>Cell </strong>- Intersection of row and column<ul><li>a specific attribute value for a specific entity occurrence</li></ul></li><li><strong>Degree -</strong> Number of attributes a given relation has.<ul><li>num of properties an entity has</li></ul></li><li><strong>Cardinality - </strong>Number of tuples in a relation<ul><li>Number of entity occurrences an entity has</li></ul></li><li><strong>Normalised -</strong> means the relation is appropriately structured to reduce redundancy and fit certain rules. These are the normal forms.</li></ul><h2>Properties</h2><ul><li>Relation name is distinct</li><li>Attributes have distinct names</li><li>Values of attributes are from same domain</li><li>Each cell has one atomic value</li><li>Each tuple is distinct so no duplicate tuples</li><li>Ordering of attributes and tuples doesn't matter.</li></ul><h1>4. The entity relationship model</h1><p> </p><p>This is a graphical description of the DB to allow people perhaps without such advanced knowledge to understand the db. Aka engineers.</p><ul><li>Set of requirements</li><li>Types of things you want to represent data</li><li>Attribute of things</li></ul><h2>Main components</h2><h3>Entity</h3><ul><li>Thing that is of enough concern to be represented separately.</li><li>Represented by rectangles</li><li><strong>Entity occurrence: </strong>One unique identifiable occurence of an entity</li></ul><h3>Relationship</h3><ul><li>Named association between two entity types. Which has some context in the database</li><li>Represented by labelled line</li><li>Cardinality: how many entity occurrences of an associated entity type is a single entity occurrence related to?<ul><li>One-to-one</li><li>one-to-many</li><li>many-to-many</li></ul></li><li>Use crows foot notation to notate the cardinality. (labelled line splits off depending on cardinality)</li><li>Optionality and participation<ul><li>If it participates optionally it has <strong>partial participation</strong> else it has <strong>total participation</strong></li><li>Total participation represented by vertical bar</li><li>Partial participation represented by circle</li></ul></li></ul><p><img src="http://tdan.com/wp-content/uploads/2016/07/stewart06012008_3.gif" alt="Crow's Feet Are Best – TDAN.com"></p><h3>Attribute</h3><p>This is the set of all common characteristics that are shared by entity occurrences of a particular type.</p><ul><li>Primary keys are underlined</li><li>Attributes represented by labelled ellipses attached to rectangles</li><li>OR all attributes in lower part of entity rectangle.</li><li>Single attribute has one component, composite attribute has multiple components. E.g address is composite.</li><li>Derived attribute, derived using function using differently stored data. E.g age comes from DOB</li></ul><h1>5: Dependencies and Normalisation</h1><h2>Normalisation</h2><p>This is a bottom up strategy, basically we start with all the data and attributes stored in the tables. Then from that we figure out optimal relationships such to have the best designed relational database</p><h3>Database redundancy</h3><p>it should have <strong>No redundancy:</strong> Every data item is stored in one place.</p><ul><li>This minimises the space required</li><li>Simplifies the maintenance of the database</li><li>If it was stored in two places then every time we updated it we would need to change two elements</li><li>Dependencies between attributes cause redundancy. (Knowing one attribute should not let you know any other attribute.)</li></ul><h3>Data anomalies Terminology</h3><p><strong>Modification anomaly: </strong>Don't change all instances of a specific value, after modifying one.</p><p><strong>Deletion anomaly: </strong>Losing other values because you delete one items data</p><p><strong>Insertion anomaly: </strong>Where when new data items are added, we need to add information about other entities.</p><h3>Decomposition</h3><p>To fix the data anomaly problems, remove the dependencies and redundancy by splitting data into multiple tables. Where data which must stay consistent between values of same attribute value, is stored one time in a different table</p><h3>Relational keys</h3><ul><li><strong>Candidate keys:</strong> Minimal (not minimum) set of attributes whose values uniquely identify the tuples</li><li><strong>Primary key:</strong> Candidate Key which identifies the row</li><li><strong>Alternate key:</strong> Keys which are not selected as primary but are candidate keys</li><li><strong>Simple key:</strong> Key consiting of only one attribute</li><li><strong>Composite key:</strong> Key consists of several attributes.</li></ul><h2>Functional data dependencies</h2><p>This describes the relationships between attributes in the same relation.</p><p>Let A and B be two sets of attributes. Then B is <strong>Functionally dependent</strong> on A if each value of A is associated with exactly one value of B</p><ul><li><strong>Determinant</strong>: The set of attributes on the LHS of the functional dependency</li><li><strong>Dependent:</strong> Set of attributes on the RHS of the functional dependency</li><li><strong>Full dependency: </strong>B depends on A and not dependent on proper subset of A</li><li><strong>Partial dependency: </strong>B depends on A and on at least one proper subset of A</li><li><strong>Transitive dependency: </strong>If B depends on A and C depends on B then C depends on A. THis is BAD!</li></ul><h3>Closure of a set F of dependencies</h3><p>The closure denoted F+ is the set of all functional dependencies that are implied by dependencies in F</p><p>To compute the closure F+ of F we need some interference rules</p><p><strong>Armstrong's axioms: </strong></p><ul><li>Reflexivity: If B is subset of A then A → B</li><li>Augmentation: If A → B, then A,C → B, C</li><li>Transitivity: If A → B and B → C, then A → C</li><li>Decomposition: If A → B, C then A → B and A → C</li><li>Union: If A → B and C b→ D, then A, C → B, D</li></ul><h3>Algo for computing the closure of F+</h3><ul><li>For every func. Dependency f in set F+<ul><li>Apply rule sof reflexivity and augementation to f</li><li>add these new func dep to set F+</li></ul></li><li>For each pair of func dep. if another function is implied by the first two using transitivity then add that func to the set F+</li><li>Keep going till F+ does not change any more.</li></ul><h1>6: Normalisation II</h1><h2>Normalisation process</h2><p>This is a multi-stage process, where the result of each stage is called a normal form. At each stage we do a check to see if the specific criteria is satisfied and if it isn't we have to reoranise.</p><h3>Unnormalized form</h3><p><strong>repeating group: </strong>Attribute values that repeat. Attributes should only have one value for each cell.</p><ul><li>This can contain one or more repeating groups</li><li>Two rows may be the same</li></ul><h3>First Normal Form</h3><ul><li>Removes all repeating groups by either flatting the table (creates redundancy) or creating a new table to store repeated values.</li><li>1NF gets rid of repeats which makes it a relational database but doesn't fix dependencies.</li></ul><h3>Second Normal Form</h3><p>Is in 1NF and there are no <strong>partial</strong> functional dependencies. This means every non-key attributes depends on the whole primary key. E.g you can't work out a non-key attribute without knowing the primary key.</p><p><strong>Whole primary key:</strong> the set of Keys which allow you to uniquely identify the set of attributes</p><p>So, perhaps a subset is dependent on one primary key and another subset dependent on the whole primary key. If this is the case then there is a partial dependency.</p><p>To convert to 2NF we do:</p><ul><li>Remove the partial dependent attributes</li><li>And then place it in a new relation.</li></ul><h3>Third Normal Form</h3><p>This gets rid of transitive dependencies. This means getting rid of cells that are dependent on another non-candidate key.</p><ul><li>Remove the transitive keys by creating new tables.</li></ul><h1>7. Structured Query Language</h1><h2>SQL components</h2><ul><li>Sql - Common database language that is easy to learn</li><li>Data definition language  - component of Sql that is used to change the database meta data</li><li>Data manipulation language - component of SQL that is used to change data in database</li></ul><h2>Writing SQL statements</h2><ul><li>reserved words - Commands in the SQl that must be spelt correctly(these should be capitalised)</li><li>User-defined words - Words made by user</li><li>Case insensitive - You should probably use capitals for reserved words and </li></ul><h2>Views</h2><ul><li> essentially a subset of the main table which isn't stored, but is instead generated using a predefined query. </li><li>Used to make queries simpler.</li><li>Restrict access to database.</li></ul><h1>8. Relational Calculus and Algebra</h1><h2>Relational Calculus</h2><ul class="points"><li>Relational calculus is query language which tells us what data we want but not how to get it.</li></ul><ul><li>it uses the existential quantifier "there exists" and the universal quantifier "For all".</li><li>We give it free variables and bound variables and expect an output which matches the requirements.</li></ul><h2>Relational Algebra</h2><p>With this we describe the data by the operations applied to get it.</p><p>There are 6 basic operations:</p><ul><li><strong>Selection</strong> (σ) - Select subset of relation that matches predicate.</li><li><strong>Projection</strong> (<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>π</mi></math>) - returns subset of relation with <strong>specified attribute columns</strong> and removes duplicates.</li><li><strong>Rename </strong>- Return relation with renamed attributes.</li><li><strong>Union</strong> - Joins two relations which are union compatible (same number of attributes and corresponding attributes have same datatype) </li><li><strong>Set difference</strong> - Removes elements in one relation that are present in the other relation (must be union-compatible)</li><li><strong>Cartesian product</strong> - Returns relation that is a concatenation of every tuple in R with every tuple in S (no compatibility needed!)</li></ul><p>Derived operations:</p><ul><li><strong>Intersection </strong> - Returns elements that are present in both relations (must be union-compatible)</li><li><strong>Division - </strong>R(x,y) div S(y) means gives all distinct values of x from R that are associated with all values of y in S.</li><li><strong>Join</strong> - Joins columns of two tables according to a predicate</li></ul><h1>9.Relational algebra in depth</h1><h2>Selection (σ)</h2><p><strong><math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>d</mi><mi>i</mi><mi>c</mi><mi>a</mi><mi>t</mi><mi>e</mi></mrow></msub><mo>(</mo><mi>R</mi><mo>)</mo></math></strong></p><ul><li>This selects part of a relation according to the predicate. </li><li>Returns a "horizontal slice" of the DB which matches the conditions. (e.g all rows WHERE x=1.</li><li>It cannot add/remove columns</li></ul><h2>Projection (<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>π</mi></math>)</h2><p><strong><math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>π</mi><mrow><mi>c</mi><mi>o</mi><mi>l</mi><mo>-</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>c</mi><mi>o</mi><mi>l</mi><mo>-</mo><mi>n</mi></mrow></msub><mo>(</mo><mi>R</mi><mo>)</mo></math></strong></p><ul><li>This takes the data and returns only the specified columns of that data.</li><li>Like a "vertical slice"</li></ul><h2>Union, Set difference, intersection</h2><p>These are all the same as their set counterparts (except they require compatibility) so lets go quick through them.</p><h3>Union</h3><p><strong><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>R</mi><mo>∪</mo><mi>S</mi></mrow><annotation encoding="LaTeX">R \cup S </annotation></semantics></math></strong></p><ul><li>Outputs the union of two relations</li><li>The result will not contain duplicates, if the same tuples appeared in both relations.</li></ul><h3>Set difference</h3><p><strong><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>R</mi><mo>-</mo><mi>S</mi></mrow><annotation encoding="LaTeX">R - S </annotation></semantics></math></strong></p><ul><li>Removes any common rows between R and S.</li></ul><h3>Intersection</h3><p><strong><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi><mo>∩</mo><mi>S</mi></math></strong></p><ul><li>Returns only the tuple rows which are common between both</li></ul><h3>Compatibility of schemas</h3><p>Inorder for union, set difference or intersection to work the schemas must match.</p><p>So same number of attributes and corresponding attributes must have same domain. <strong>Note:</strong> not same name, just same domain.</p><h2>Cartesian product</h2><p><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi><mo>×</mo><mi>S</mi></math></p><ul><li>Returns the concatenation of each row in  R with each row in S. </li><li>Order does not matter since it returns ALL possible ones.</li><li>if two attributes have the same name they are prefixed with their relation name.</li></ul><h2>Rename</h2><p><math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϱ</mi><mi>X</mi></msub><mo>(</mo><mi>R</mi><mo>)</mo></math></p><ul><li>This renames a relation to something else.</li><li>Can be used to create a copy of a relation when doing a join</li><li>Can also be used to simplify queries so that we don't need to use the full relation name every time we reference it.</li></ul><h2>Join</h2><p>Joins are cartesian products plus a selection.</p><p><strong>equijoin: </strong>this is a type of join where the selection involves the equality operator.</p><h3>Natural join (<math xmlns="http://www.w3.org/1998/Math/MathML"><mo>⋈</mo></math>) </h3><p>$$R \bowtie S = \Prod_{C_1, \dots, C_2} (\sigma_{R.A_1=S.A_1, \dots, R.A_k =S.A_k)(R \times S)) $$</p><p>This is an <strong>equijoin </strong>between the relations R and S, over all their attributes that have the same name, without duplicates.</p><p>Steps to do a natural join:</p><ol><li><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>R</mi><mo>×</mo><mi>S</mi></mrow><annotation encoding="LaTeX">R \times S</annotation></semantics></math> (It takes the cartesian product between R and S)</li><li>For each attribute with the same name A in both R and S, select all rows where R.A = S.A</li><li>Removes duplicate columns using a projection.</li></ol><h3>Outer join</h3><ul><li>Left outer join: R ⟕ S</li><li>Right outer join: R ⋊ S</li></ul><p> This computes the natural join of R and S, and then adds tuples which did not match between R and S.</p><p>left outer join: Adds tuples from R which did not match those in S.</p><p>Right outer join: Adds tuples from S which did not match those in R.</p><p>Full outer join: Adds tuples from either which didn't match.</p><h3>Semi join</h3><p>the natural join between R and S. However with only the attributes of R</p><h1 id="anonymous_element_24">10. Multi-User Architectures</h1><h2>Teleprocessing architecture</h2><ul><li>The traditional architecture.</li><li>Many terminals connected to the cental computer</li><li>Terminal sends messages to the central computer</li><li>All data processing done in the central computer</li></ul><h2>File-server architecture</h2><ul><li>Processing distributed around a computer network</li><li>Every workstation has its own DBMS and its own user application</li><li>The workstation requests the files it needs from the file server, which acts like a shared hard disk</li><li>This requires sending who tables to the user terminal from the file server which causes a large amount of network traffic. </li><li>A full DBMS must be stored on each workstation and having concurrency /recovery is more difficult.</li></ul><h2>Client server architecture</h2><ul><li>Client requires a resource and the server provides the resource.</li></ul><h3>Two tier architecture</h3><ul><li>Client does presentation of data</li><li>Server supplies data services to the users.</li><li>User gives a request to the client, which generats the sql query and sends it to the server. Then the server accepts it and sends the result to the client. Then the cliet formats the result for the user.</li><li>Has increased performance, reduced costs and reduced communication costs.</li></ul><h3>Three tier architecture</h3><ul><li>Tier 1 - user interface</li><li>Tier 2 - application server</li><li>Tier 3 - database server</li></ul><p>This is literally how the internet browser works.</p><h2>Distributed DBMS</h2><p>This has network of computers each with part of the database, which mirrors the organisational structure.</p><p>Aims to:</p><ul><li>Make all data accessible to all units</li><li>Store the data close to the location where it's used most.</li></ul><p>Distributed DBMS is the software system that manages the distributed database and makes the distribution transparent to the users.</p><ul><li>Split into fragments</li><li>Each fragment stored on one or more computers under the control of a seperate DBMS.</li><li>Every computer makes up a communications network</li><li>Sites have local autonomy.</li><li>Sites have access to global applications</li><li>Not all sites have local applications but all site have access to global applications.</li></ul><h3>Distributed processing vs distributed DBMS</h3><p>Distributed processing is a centralised database that is accessed over a computer network. But Distributed DBMS has multiple DB fragments distributed across sites.</p><h2>Design of distributed DBMS </h2><ul><li>Fragmentation - break relation into fragments</li><li>Allocation - How to allocate the fragments optimally.</li><li>Replication - Which fragments are stored at multiple sites (redundantly duplicated data)</li></ul><p>How to do this depends on <strong>Quantitative</strong><strong> information </strong>and <strong>Qualitive information</strong></p><p> </p>