Chapter 8, Databases
What is a database? 
  - a collection of information stored in an organized form on a computer ("electronic filing cabinet") 
  
- methods for organizing, storing, retrieving, communicating and managing 
  information 
  What are databases used for -- what would your life be like without databases?  Have you... 
  - ...called anyone on your cell phone recently? 
  
- ...send someone a text message recently?
  
- ...logged into your Otterbein account recently?
  
- ...logged into Facebook recently?
  
- ...used an ATM recently? 
  
- ...used a credit card recently? 
  
- ...had your Cardinal Card scanned recently? 
  
- ...have you purchased anything online recently? 
  
- ...have you flown on a commercial airline recently? 
If you can answer yes to ANY of these 
questions, you have interacted with databases that have information about you!Basic concepts and terminology 
  - database is a collection of tables, typically stored as a file 
	
- table is one 2-d grid of related information called records
- record is the information relating to one person, other entity, or 
  event. 
  
- field is a discrete piece of information across the records 
      
- each field has a specfic type and format: numeric, textual, date, picture, 
        computed (like spreadsheet formulas) 
- key field is a field designated as having a unique value for each record 
  
- Database analogy to spreadsheets
    - A database is like a workbook
- A table is like a worksheet
- A record is like a row
- A field is like a column
- A field in one specific record is like a cell
- The analogy does not go far, as you will soon see
Why Spreadsheets are not enough
Q: If databases are so much like spreadsheets, why not just use spreadsheets?
A: When the collection of information is complex, you run into problems with
maintaining consistency and storing redundant information.
Example: Information for course registration, spreadsheet style
| Name | Email | Section | Room | 
|---|
| Mary Lamb | ml@oc.edu | 1040 | Towers 114 | 
| Mary Lamb | ml@oc.edu | 2915 | Roush 213 | 
| Mary Lamb | ml@oc.edu | 3210 | Towers 110 | 
| Joe Kline | jk@oc.edu | 2915 | Roush 213 | 
| Joe Kline | jk@oc.edu | 1040 | Towers 114 | 
| Joe Kline | jk@oc.edu | 5678 | Battelle 210 | 
| . . . | . . . | . . . | . . . | 
Each row represents one student registered for one section.
In reality, more data are required for each registration but this will illustrate.
- What if Mary gets married and changes her name?  All rows containing her name have 
to be found and changed.  If you miss even one, the data are inconsistent
- What if a section changes rooms?  All rows for that section have to be found
and changed. If you miss even one, the data are inconsistent
- Notice that Mary's and Joe's names are saved multiple times, their email addresses
are saved multiple times, the room assignments are saved multiple times.  This wastes space.
Databases, and relational databases in particular, will address these problems.  See modified example below.
Relational databases 
  - A database may consist of many related tables.  
    
- Consider a student database.  There is a table containing transcript information, another table containing 
      financial aid information, another table containing class rosters. 
    
- Some information is required by all three: student name, ID, class rank, 
  and major. 
    
- If this common info is replicated in all three, then any change to it 
      (say, student changes major) requires all three tables to be updated. 
    
- In complex systems, this becomes very error-prone. 
    
- Solution: place common information into a separate table, where other 
      tables can point to it.  Change only needs to be made in one place! 
    
- Since tables point to each other, we say they are related and call it 
      a relational database. 
    
- In this example, "pointing to" is accomplished by storing the student's 
      ID in all the tables. 
Example: Reconfigure the registration information above into a relational database
  
  
  | 
   | Students | 
|---|
 | ID | Name | Email | 
|---|
 | 109999 | Mary Lamb | ml@oc.edu |  | 108888 | Joe Kline | jk@oc.edu |  | . . . | . . . | . . . |  |  | 
| Registrations | 
|---|
 | ID | Section | 
|---|
 | 109999 | 1040 |  | 109999 | 2915 |  | 109999 | 3210 |  | 108888 | 2915 |  | 108888 | 1040 |  | 108888 | 5678 |  | . . . | . . . |  |  | 
| Sections | 
|---|
 | Section | Room | 
|---|
 | 1040 | Towers 114 |  | 2915 | Roush 213 |  | 3210 | Towers 110 |  | 5678 | Battelle 210 |  | . . . | . . . |  | 
  
There are now three tables.  The ID field is key field for the Students table
and the Section field is key field for the Sections table.  Each row in the Registrations
table represents one student registered for one section.  
Given a record in the 
Registrations table: 
- To get the student's name
and email, just use the ID field value to search the Students table and find it!
- To get
the section's room number, just use the Section field value to search the Sections table
and find it!
- The Registrations table is related to the Students table through the ID field
- The Registrations table is related to the Sections table through the Section field
If the student name changes, only one record needs to be changed!  If the room
assignment changes, only one record needs to be changed! Data are consistent because
there is only one copy.  And the duplication problem also disappears!
Ordering Records   
  
    - for student records, key value used to be social security number, now unique ID number assigned
    
- each record must have different value in key field 
    
- the database is ordered (sorted) based on key field value, to make 
    searches faster 
		  
			- If records are not ordered, must perform sequential search, looking at each record one by one
- If records are ordered, can perform binary search: check the middle record then determine if
			desired record is in first half or second half.  Then repeat on just that half.  Keep repeating until
			you find it or discover it is not there.
			
- If table has 1 million records, sequential search can involve looking at all 1 million.  Binary search
			involves at most 20!
			
- (math note: 20 is approximately the logarithm base 2 of 1 million, in other words, 1 million is about 220)
			
 
- imagine if Columbus White Pages listed names randomly instead of 
    alphabetically? 
    
- imagine if Columbus White Pages were listed in order based on phone 
    number instead of name? 
    
- interesting fact:  Iceland (population about 300,000) has one 
    telephone book, and people are listed alphabetically by their first 
    name.  Their last name is formed by taking their father's first name 
    and appending "son" if a boy and "dottir" if a girl.  Women do not 
    change names upon marriage. 
Finding data
 
  - browsing is a method for retrieving information from a very small 
  database - just peruse it. 
  
- query is a method for asking for specific information from 
  database. 
  
    - every time you submit a search request to a web search engine, you are 
    making a query 
    
- there are many query techniques and languages, more on this later 
  
 
- data mining refers to sophisticated new techniques for discovering 
  patterns and trends in huge databases (patterns may be unknown in advance and
	are not part of the database structure)
 
Search Engine Queries
  - You explored this topic in Project 1.
- Every Web search engine has an "advanced search" feature.
- Fine-tuning your search results in fewer but better results. 
  
- All allow the use of Boolean logic. 
  
    - join two terms with AND to get hits only if both terms appear (most 
    search engines insert an implied AND between terms) 
    
- join two terms with OR to gets hits if either or both appear 
    
- these can be combined, e.g.  database AND (query OR 
    browse) 
 
- use of quotes "" around a phrase to match the exact phrase, e.g.  "database query techniques"
  
- use of asterisk * to match character sequence within a word.  e.g. 
  trai* matches any word starting with trai, such as trail, 
  trailer, train, training, trainer, trainee, etc. 
  
- ability to limit search to a specifc server or web site -- Google feature 
  that allows you to specify search term(s) then "site:" followed by site to 
  search.  So to search for the term "registration" only at otterbein's 
  site the search would be:  registration  
  site:www.otterbein.edu 
  
- How can web searches be completed so quickly???  Exact details are 
  tightly held secrets, but basically the search sites have software that 
  automatically "surfs" from page to page and stores words from each page into 
  highly-indexed databases. 
SQL: a language for forming database queries 
  - Structured Query Language, pronounced "S-Q-L" or "sequel" 
  
- has become the lingua franca of database management products (lingua franca: 
  a common language used by speakers of different languages) 
  
- SQL contains a few specific commands for filtering information contained 
  in a database, yet queries are somewhat English-like in appearance 
  
    - SELECT command specifies which fields to display from matching records 
    
- FROM command specfies which database table to search 
    
- WHERE command specifies search criteria to filter records from the 
    database 
 
- think of database as mathematical set of records and the WHERE command 
  defines subsets whose intersections (AND) and unions (OR) determine matches 
- Some SQL examples for database table StudentList that has fields called FirstName, LastName, Age, Credits :
	- SELECT FirstName
 FROM StudentList
 WHERE LastName = "Jones"
- SELECT FirstName, LastName
 FROM StudentList
 WHERE (Age > 20 AND Credits < 12)
Microsoft Access as illustration 
    - Access is the DBMS (DataBase Management System) part of Microsoft Office 
      suite. 
    
- Widely used for "small" databases and easy to use 
  
- the entire database is stored in one file, identified by ".accdb" (Access 2007) or ".mdb" (Access 1997-2003)
  filename extension 
  
- the database consists of 7 different types of objects: 
  
    - a table is one grid of information.  Each row is a record, 
    each column is a field. 
    
- a query is a stored query specification (e.g. an SQL command). 
    
- a form is a nicely structured data entry form so you can enter 
    data more easily (instead of directly into the grid) 
    
- a report is a nicely structured data display form so you can view 
    data more easily 
    
- a page is a web page which can be generated from database 
    contents 
    
- a macro is a stored command sequence, so you can easily automate 
    repetitious tasks 
    
- a module is a computer program you can write using Visual Basic 
    language to perform special processing. 
 
Here is a view of 
the Northwind database, with the list of tables shown. 
 
You can get an idea of the nature of Northwind just by reading the names of 
the various tables.  This is a relational database, and the relationships 
are shown in this illustration.  The list of identifiers in each box is the 
list of fields in that table, and the lines represent relationships. 
 
 
  
  
Databases and privacy 
The textbook briefly describes a plethora of database-related privacy 
horror-stores, most caused by criminals creating or stealing an "identity" or 
mistakes made by credit bureaus. 
What are some of the privacy concerns related with databases?
Did you know that the U.S. Constitution does not explicitly address the right 
to privacy? 
Do you feel the greater privacy threat is government, or the private sector? 
Sometimes it is possible to uniquely identify you and reveal information 
about you by combining different pieces of information gleaned from multiple 
unrelated databases.  This is called record matching.  
Geneologists have long depended on this technique to learn about their 
ancestors.  The government depends on this technique to identify potential terrorists. 
Sometimes even summaries of anonymous survey results can uniquely identify 
you.  For example, my previous university compiles and publishes an annual 
report based on faculty survey information.  One of the summary categories 
is average salary by age group (20-29, 30-39, ..., 70-above).  There is one 
professor in the "70 - above" age group.  The survey is not anonymous for 
him.