Chapter 8, Databases
What is a database?
- a collection of information stored in an organized form on a computer ("electronic filing cabinet")
- methods for organizing, storing, retrieving, communicating and managing
information
What are databases used for -- what would your life be like without databases? Have you...
- ...called anyone on your cell phone recently?
- ...send someone a text message recently?
- ...logged into your Otterbein account recently?
- ...logged into Facebook recently?
- ...used an ATM recently?
- ...used a credit card recently?
- ...had your Cardinal Card scanned recently?
- ...have you purchased anything online recently?
- ...have you flown on a commercial airline recently?
If you can answer yes to ANY of these
questions, you have interacted with databases that have information about you!
Basic concepts and terminology
- database is a collection of tables, typically stored as a file
- table is one 2-d grid of related information called records
- record is the information relating to one person, other entity, or
event.
- field is a discrete piece of information across the records
- each field has a specfic type and format: numeric, textual, date, picture,
computed (like spreadsheet formulas)
- key field is a field designated as having a unique value for each record
- Database analogy to spreadsheets
- A database is like a workbook
- A table is like a worksheet
- A record is like a row
- A field is like a column
- A field in one specific record is like a cell
- The analogy does not go far, as you will soon see
Why Spreadsheets are not enough
Q: If databases are so much like spreadsheets, why not just use spreadsheets?
A: When the collection of information is complex, you run into problems with
maintaining consistency and storing redundant information.
Example: Information for course registration, spreadsheet style
Name | Email | Section | Room |
Mary Lamb | ml@oc.edu | 1040 | Towers 114 |
Mary Lamb | ml@oc.edu | 2915 | Roush 213 |
Mary Lamb | ml@oc.edu | 3210 | Towers 110 |
Joe Kline | jk@oc.edu | 2915 | Roush 213 |
Joe Kline | jk@oc.edu | 1040 | Towers 114 |
Joe Kline | jk@oc.edu | 5678 | Battelle 210 |
. . . | . . . | . . . | . . . |
Each row represents one student registered for one section.
In reality, more data are required for each registration but this will illustrate.
- What if Mary gets married and changes her name? All rows containing her name have
to be found and changed. If you miss even one, the data are inconsistent
- What if a section changes rooms? All rows for that section have to be found
and changed. If you miss even one, the data are inconsistent
- Notice that Mary's and Joe's names are saved multiple times, their email addresses
are saved multiple times, the room assignments are saved multiple times. This wastes space.
Databases, and relational databases in particular, will address these problems. See modified example below.
Relational databases
- A database may consist of many related tables.
- Consider a student database. There is a table containing transcript information, another table containing
financial aid information, another table containing class rosters.
- Some information is required by all three: student name, ID, class rank,
and major.
- If this common info is replicated in all three, then any change to it
(say, student changes major) requires all three tables to be updated.
- In complex systems, this becomes very error-prone.
- Solution: place common information into a separate table, where other
tables can point to it. Change only needs to be made in one place!
- Since tables point to each other, we say they are related and call it
a relational database.
- In this example, "pointing to" is accomplished by storing the student's
ID in all the tables.
Example: Reconfigure the registration information above into a relational database
Students |
ID | Name | Email |
109999 | Mary Lamb | ml@oc.edu |
108888 | Joe Kline | jk@oc.edu |
. . . | . . . | . . . |
|
|
Registrations |
ID | Section |
109999 | 1040 |
109999 | 2915 |
109999 | 3210 |
108888 | 2915 |
108888 | 1040 |
108888 | 5678 |
. . . | . . . |
|
|
Sections |
Section | Room |
1040 | Towers 114 |
2915 | Roush 213 |
3210 | Towers 110 |
5678 | Battelle 210 |
. . . | . . . |
|
There are now three tables. The ID field is key field for the Students table
and the Section field is key field for the Sections table. Each row in the Registrations
table represents one student registered for one section.
Given a record in the
Registrations table:
- To get the student's name
and email, just use the ID field value to search the Students table and find it!
- To get
the section's room number, just use the Section field value to search the Sections table
and find it!
- The Registrations table is related to the Students table through the ID field
- The Registrations table is related to the Sections table through the Section field
If the student name changes, only one record needs to be changed! If the room
assignment changes, only one record needs to be changed! Data are consistent because
there is only one copy. And the duplication problem also disappears!
Ordering Records
- for student records, key value used to be social security number, now unique ID number assigned
- each record must have different value in key field
- the database is ordered (sorted) based on key field value, to make
searches faster
- If records are not ordered, must perform sequential search, looking at each record one by one
- If records are ordered, can perform binary search: check the middle record then determine if
desired record is in first half or second half. Then repeat on just that half. Keep repeating until
you find it or discover it is not there.
- If table has 1 million records, sequential search can involve looking at all 1 million. Binary search
involves at most 20!
- (math note: 20 is approximately the logarithm base 2 of 1 million, in other words, 1 million is about 220)
- imagine if Columbus White Pages listed names randomly instead of
alphabetically?
- imagine if Columbus White Pages were listed in order based on phone
number instead of name?
- interesting fact: Iceland (population about 300,000) has one
telephone book, and people are listed alphabetically by their first
name. Their last name is formed by taking their father's first name
and appending "son" if a boy and "dottir" if a girl. Women do not
change names upon marriage.
Finding data
- browsing is a method for retrieving information from a very small
database - just peruse it.
- query is a method for asking for specific information from
database.
- every time you submit a search request to a web search engine, you are
making a query
- there are many query techniques and languages, more on this later
- data mining refers to sophisticated new techniques for discovering
patterns and trends in huge databases (patterns may be unknown in advance and
are not part of the database structure)
Search Engine Queries
- You explored this topic in Project 1.
- Every Web search engine has an "advanced search" feature.
- Fine-tuning your search results in fewer but better results.
- All allow the use of Boolean logic.
- join two terms with AND to get hits only if both terms appear (most
search engines insert an implied AND between terms)
- join two terms with OR to gets hits if either or both appear
- these can be combined, e.g. database AND (query OR
browse)
- use of quotes "" around a phrase to match the exact phrase, e.g. "database query techniques"
- use of asterisk * to match character sequence within a word. e.g.
trai* matches any word starting with trai, such as trail,
trailer, train, training, trainer, trainee, etc.
- ability to limit search to a specifc server or web site -- Google feature
that allows you to specify search term(s) then "site:" followed by site to
search. So to search for the term "registration" only at otterbein's
site the search would be: registration
site:www.otterbein.edu
- How can web searches be completed so quickly??? Exact details are
tightly held secrets, but basically the search sites have software that
automatically "surfs" from page to page and stores words from each page into
highly-indexed databases.
SQL: a language for forming database queries
- Structured Query Language, pronounced "S-Q-L" or "sequel"
- has become the lingua franca of database management products (lingua franca:
a common language used by speakers of different languages)
- SQL contains a few specific commands for filtering information contained
in a database, yet queries are somewhat English-like in appearance
- SELECT command specifies which fields to display from matching records
- FROM command specfies which database table to search
- WHERE command specifies search criteria to filter records from the
database
- think of database as mathematical set of records and the WHERE command
defines subsets whose intersections (AND) and unions (OR) determine matches
- Some SQL examples for database table StudentList that has fields called FirstName, LastName, Age, Credits :
- SELECT FirstName
FROM StudentList
WHERE LastName = "Jones"
- SELECT FirstName, LastName
FROM StudentList
WHERE (Age > 20 AND Credits < 12)
Microsoft Access as illustration
- Access is the DBMS (DataBase Management System) part of Microsoft Office
suite.
- Widely used for "small" databases and easy to use
- the entire database is stored in one file, identified by ".accdb" (Access 2007) or ".mdb" (Access 1997-2003)
filename extension
- the database consists of 7 different types of objects:
- a table is one grid of information. Each row is a record,
each column is a field.
- a query is a stored query specification (e.g. an SQL command).
- a form is a nicely structured data entry form so you can enter
data more easily (instead of directly into the grid)
- a report is a nicely structured data display form so you can view
data more easily
- a page is a web page which can be generated from database
contents
- a macro is a stored command sequence, so you can easily automate
repetitious tasks
- a module is a computer program you can write using Visual Basic
language to perform special processing.
Here is a view of
the Northwind database, with the list of tables shown.
You can get an idea of the nature of Northwind just by reading the names of
the various tables. This is a relational database, and the relationships
are shown in this illustration. The list of identifiers in each box is the
list of fields in that table, and the lines represent relationships.
Databases and privacy
The textbook briefly describes a plethora of database-related privacy
horror-stores, most caused by criminals creating or stealing an "identity" or
mistakes made by credit bureaus.
What are some of the privacy concerns related with databases?
Did you know that the U.S. Constitution does not explicitly address the right
to privacy?
Do you feel the greater privacy threat is government, or the private sector?
Sometimes it is possible to uniquely identify you and reveal information
about you by combining different pieces of information gleaned from multiple
unrelated databases. This is called record matching.
Geneologists have long depended on this technique to learn about their
ancestors. The government depends on this technique to identify potential terrorists.
Sometimes even summaries of anonymous survey results can uniquely identify
you. For example, my previous university compiles and publishes an annual
report based on faculty survey information. One of the summary categories
is average salary by age group (20-29, 30-39, ..., 70-above). There is one
professor in the "70 - above" age group. The survey is not anonymous for
him.