LM14 – Database (DBMS)


First note that when I speak about DBMS I am assuming it is a Relational Database Management System or RDBMS but often we omit the “R” and just write DBMS.  Also, texts will present other models (e.g. flat file, hierarchical, network, object oriented) but the pervasive model is the RDBMS so I will limit my coverage to RDBMS.  Ok, so r
eal quick, although nothing I do is real quick  🙂 … you will find a complete DBMS text in this menu system and while much of it is beyond the scope of this course I recommend you return to this resource in your upper level DBMS and IS Analysis & Design courses.

As previously presented DBMS are ubiquitous.  DBMS form the critical foundation of Information Systems (individual business systems to research libraries to our Worldwide financial management).  The WWW has become a dynamic entity (built on an n-tier architecture) where content is dynamically maintained on and extracted from a back-end DBMS and processed and formated for Web presentation.  Consider that HTML5 even facilitates/incorporates the implementation of client-side databases to improve personalization, experience and efficiency.  DBMS are also used in networking and system administration to manage users and resources so they are pervasive in all of computing.

First, why do we need DBMS

Today, the success of organizations depends on their ability to:

1. Acquire and maintain accurate timely data about their operations

2. Manage data effectively

3. Use data to analyze and guide their activities

=> We just defined IS

Organizations need flexible data management systems to:

1. Get the most from large complex data sets

2. Simplify the tasks of managing this data

3. Extract useful information in a timely fashion

The alternative to DBMS is to use ad hoc approaches – i.e. applications that cannot carry from one domain to another (Vertical application).  As an example we could store the data and files and write application-specific code to manage it but this is not flexible, scalable or reusable (i.e. we need a Horizontal application).  To illustrate this let’s explore the difference.

File System vs. DBMS

To understand the need for DBMS let’s consider the following scenarios and apply the CS concepts we have learned this semester.

1. Look at the following flat file data table where orders are taken and a sequential order number is generated.  First, this should make you realize why when you call into customer service they typically ask you for your order number first.

OrderNumberCustomerIDLastNameFirstNamePhoneEmail
100135LoobyJames228-9867j@jl.com
100237LabAddie376-0012a@al.com
100339PyranneesSam987-6543s@sp.com
100435LoobyJames387-8391j@jl.com
100547BlackThunder499-8876b@bt.com

We can sort the table based on any one column but then every other column is randomly ordered and this is the way it will be saved in storage.

From the LM on storage – what type of search do we need to perform if the data is randomly ordered?

From the LM on OS – where does all processing take place.  What does this mean for a very large data set.

Putting these 2 items together, could Amazon get away with this basic data flat file approach?

2.  Now, lets understand ACID – Atomicity, Consistency, Isolation & Durability: http://www.ciss100.com/wp-content/uploads/dbms/ACID.pdf

3. Lastly, let’s look at some additional shortcomings of a flat file approach:  http://www.ciss100.com/wp-content/uploads/dbms/Customers.pdf

Let’s put it all together so please consider the following:  a company has large collection (1 Terabyte) of data on employees, departments, products, sales and so on and 1 TB may be small when we consider product literature and multimedia, customer service information (e.g. think about “this call may be recorded for quality…”, ) and the emergent Big Data.

Employees need to access the data concurrently (Question: what happens to a file if 2 people open it – Answer: typically one person gets a read only copy so full concurrent functionality is not realized).

Questions about data must be answered quickly.  We can easily store data in a collection of operating system files but we probably do not have 1 TB of main memory to hold data – recall data must be in memory to be accessed and acted upon (i.e. you cannot modify something on disk as it must be brought into memory to be accessed and then written back to disk if modified), therefore we must store data in storage device and bring relevant parts into main memory for processing as needed.  Also, recall a file containing records can only be sorted on a single field so as an example, if the records are sorted by last name, the SSNs, addresses, phone numbers, etc are all in random order.  The result, if you wish to search for a particular SSN, you are relegated to transferring data from the hard drive sequentially and then performing a sequential search on all records to find the SSN.

Changes made to the data by different users must be applied consistently (ACID demonstration above).   A DBMS must provide for atomic transaction execution in isolation and the data in memory must be written to persistent storage beyond what the OS typically performs.

Access to certain parts of data (e.g. salaries) must be restricted.  The OS cannot restrict access on the field level in a  file

We need a program to identify and address items and we must write special programs to answer each question that users ask about data.  We must protect data from inconsistent changes made by concurrent data access and this concurrency must be programmed for each new program.  We must also ensure data is restored to a consistent state if the system crashes while changes are being made.

Operating systems provide only password mechanism security and therefore is not sufficiently flexible to enforce security problems where users have access to different subsets of data.

Data redundancy and inconsistency <- duplication of data often results in inconsistencies

Data isolation <- data scattered in various files in different formats

Integrity problems <- consistency constraints difficult to add afterwards

Result => The dedicated resulting program is likely to be highly complex and inflexible.

DBMS Description

A DBMS is a collection of interrelated data (Information) and a set of programs to access this data.  A DBMS typically describes the activities of one or more related organizations and has 2 components:

1. Data commonly referred to as the database

2. Software designed to assist with maintenance and utilization is Management System.  This includes the data description language (DDL) and data manipulation language (DML)

 

DBMS Abstraction

The gap between the way users think of their data the way data is ultimately stored => hides the storage and access details.

3 layers of abstraction supported by the DBMS

1. External schema – Users perception & data access.

2. Conceptual schema – Designed by DBA => normalized data (DB has single conceptual schema)

3. Physical schema – Physical storage & access (DB has single physical schema)

External Schema – View Level

Allows data access to be customized (and authorized) at level of individual users or groups.  Forms, reports and queries for many different users both inside and outside the organization.

Any given database has exactly one conceptual schema and 1 physical schema

It has just one set of stored relations (a relation is a stored table)

May have many external schemas each tailored to particular group of users

External schema design guided by end user requirements

Example – Reports allow students to view course schedules by times, modality (e.g. on-campus vs. distance learning), by faculty name or even by campus for large institutions.  Forms allow students to register for courses but both reports and forms are based on queries.

Logical or Conceptual Schema

Conceptual Database Design – choice of relations and choice of fields for these relations not always obvious.  The process of arriving at good conceptual schema requires business analysis and DB normalization.

Example – In a very basic sample university database – relations (tables) contain information about:

Entities such as students, faculty, courses, classes, classrooms, etc.

Relationships such as student enrollments in courses

All student entities described using records and students relation

Each collection of entities and collection of relationships described as a relation

Consider the following conceptual schema for the above business example:

Students(SID: string, LastName: string, FirstName: string, DOB: date, GPA: real)

Faculty(FID:string,LastName: string, FirstName:string, Salary: real)

Courses(CID: string, CourseName: string, Credits: integer)

Enrolled(SID: string, CID: string, Grade: string)

Teaches(FID:string, CID: string)

DBMS

In the diagram above we can see the primary key/foreign key relationships. Also note the relations or tables exist outside the parenthesis whereas the attributes (domains, fields or columns) have a type or domain associated with them.  Note that we can disambiguate identical field names/domain names by fully qualifying the table and domain name (e.g. Faculty.LastName and Student.LastName).

Physical Schema – Specifies storage details – stored on secondary storage devices

File organizations (storage schema)

Auxiliary data structures – e.g. indexes to speed data retrieval operations

Example – In University DB above, it beneficial to search on courses, students, courses, instructor, etc.

DBMS Structure

A DBMS is partitioned into modules much like an operating system and in fact, it needs OS functionality to access and manage the data effectively.  Each module deals with specific responsibilities of overall systems

Unsophisticated users (e.g. customers travel agents)

Sophisticated users (DB administrators programmers)

Web forms

Application front ends

SQL interface

Plan executor – parser operator evaluator optimizer (query eval engine)

Transaction manager files and access methods recovery manager

Lock manager

Buffer manager

Disk space manager

Index files

System catalog

Data files

 

Advantages of DBMS

Data Independence – Application programs should be independent from details of data representation and storage

Efficient Data Access – DBMS utilizes a variety of sophisticated techniques to store and retrieve data efficiently.  This is especially important for data is stored on external storage devices.  Recall you can only sort records based on a single field or attribute.  If you sort records based on last name, they are unsorted on SSNs and vice versa.  To find a particular record based on SSN you would have to search the 1 TB sequentially.  Now recall this means bringing  bringing every hard drive cluster/sector into memory to read every record sequential.

Data Integrity & Security – Data always accessed through DBMS can enforce integrity constraints on data.  As an example, before inserting salary information for employee a DBMS checks that the department budget is not exceeded.  Equally important, a DBMS can enforce access controls that governs when/what data is visible to different classes of users (e.g. restrict salary information to HR and directors).

Data Administration – Data redundancy and inconsistency can be improved by centralizing administration. Experienced professionals (this means you) understand the nature of the data being managed that includes:

1. How different groups use it

2. Organize data to minimize redundancy

3. Fine-tune storage and retrieval

Concurrent Access and Crash Recovery – Atomicity or indivisibility. DBMS schedules concurrent accesses the data.  Users perceive that data is accessed by one user at a time and protects and recovers from failure.

Reduced Application Development Time – High-level interface to data facilitates quick development of applications.  These applications likely to be more robust than if developed from scratch.

 

DBMS Disadvantages

Is there ever reason not to use a DBMS -> a few

1. Applications with tight real-time constraints

2. Applications with just a few well-defined critical operations requiring efficient code

 

Terminology and Definitions

Entities (or tuples) such as students and faculty courses and classrooms

Relationships between entities such as student’s enrollment in courses, use of rooms etc.

Instance – snapshot of database

Example Instance

Sid name login age GPA

5322 Jones HYPERLINK “mailto:Jones@CS” Jones@CS 19 3.4

3344 Smith HYPERLINK “mailto:Smith@cs” Smith@cs 22 2.9

**note date of birth more correct in age since age changes or is computed**

Each row in the students relation is a record or tuple that uniquely describes a student.  In this sense the schema may be regarded as a template for describing a student

Integrity Constraints 

We can make the data description more precise by specifying integrity constraints (e.g. conditions that the records in relation must satisfy)

Example: Every student must have unique sid.  We cannot capture this information by simply adding another field to students schema and DBMS provides ability to specify uniqueness of values in a field.  As a result a DBMS increases the accuracy we may define in our data

 Database Languages

Data Definition Language (DDL) – File contains metadata – data about data

Data Manipulation Language (DML) – DBMS allows users to create modify and query data (SQL)

Transaction Management

Example: Consider database that holds information about airline reservations

At any instant – possible several agents looking up information about available seats and making new reservations.  When several users access a database concurrently DBS must order their requests to avoid conflicts.  DBMS must protect users from effects of system failures.  Ensure all data restored to consistent state when system restarted after crash

Transaction

Any one execution of a user program in a DBMS and seen as basic unit of change. Partial transactions are not allowed and a group of transactions equivalent to serial execution of all transactions

Concurrent execution (OLTP or Multiple Access)

DBMS must schedule concurrent accesses to data allowing each user must be able to ignore the fact others are accessing data

Example (demonstrated in lecture capture above) – Program that deposits cash into account is submitted to DBMS.  At same time another program debits money from same account.  Must serialize the transactions e.g. not interleaved

Locking Protocol – Set of rules to be followed by each transaction enforced by DBMS

Incomplete Transactions and System Crashes

Transactions may be interrupted for running to completion for a variety of reasons.  DBMS must ensure changes made by incomplete transactions removed from database

Example – If DBMS is in the middle of transferring money for account a to account b and account a has been debited however the system crashes before account b is incremented => must restore system

=> System Log records all writes to the database used to bring database to consistent state after system crash to reperform/commit or all transactions

To reduce this time periodically forcing information to disk – this is called a checkpoint

 

Textbook Chapter 14 Presentation w/MS Access Application

 

Lastly, you will here me use specific terminology and cite DBMS theory so here is a nice free RDBMS text based on Edgar Codd’s work:

http://www.gitta.info/LogicModelin/en/text/LogicModelin.pdf

In the subsequent pages of this sub-menu structure you will find the core of DBMS theory and application presented in APA format (to additionally provide you with APA reference papers) distilled from the following references.

Leave a Reply