COMP 3400 Project 3: The Hybrid Sort-Merger
Spring 2017
Due: Thursday 23 March 2017
Worth 50 points of your 200 project points
This project uses the C language on Linux. "Individual with a partner"
Introduction
The "bounded buffer" process synchronization problem
is embodied in Unix by the pipe construct, which manages
the synchronization for you. You will utilize pipes to solve
the following problem.
This program will sort and merge the contents of two ASCII files, sending the
result either to a file or to standard output. It is called a
hybrid because the sorting will be performed by shell commands
activated within your C program, and the merging will be handled
by your code.
Basics of composing, compiling and running C programs from a Linux command interpreter
You will develop this project using Cygwin.
I am giving you a flash drive that has Cygwin pre-installed. The Cygwin home page describes it as
"a large collection of GNU and Open Source tools which provide functionality similary to a Linux
distribution on Windows."
Although Linux has a graphical user interface (several, actually), you will
use a command interface for this project. By working with the textual command-based
interface you are immersed into a development environment used by many operating
system developers. If you have used DOS command interfaces on Windows systems,
you will be familiar with the concepts. The interface issues a prompt, you type
in a command, the command is carried out, its results (if any) are displayed,
and another prompt is issued. Unix/Linux command interfaces are called shells,
and several are available. The default for our installation is bash,
the Bourne Again SHell. This is a pun on the Bourne shell developed for Unix by Steven Bourne.
Here's how to get to the Linux command interface.
- NOTE: If you use Cygwin under multiple user IDs (e.g. on both a campus computer and a personal computer), you
will have to move some files around. The first time you launch Cygwin, it will automatically set up a
Linux-named home folder based on your ID. If you do this in the lab, the ID will be your Otterbein login ID and
your Linux home folder will be
/home/yourOtterID
, where home
maps into the Cygwin folder on your
flash drive (e.g. E:\cygwin\home
). If you launch Cygwin from a computer where your login ID is different, it will set up another
folder with the new ID. If you maintain two folders, it's up to you to keep their contents consistent.
You can copy/paste files as usual using Windows Explorer. You should probably not do this while Cygwin is running.
- Insert the flash drive and open its file system.
- Open (double-click) the
cygwin
folder.
- Look for the
Cygwin.bat
file. This is a "batch" file, a DOS command script, that will launch Cygwin. Important!
It assumes the flash drive is mounted on the E: drive. If it is not, you need to edit it (Notepad is fine) to change the drive letter.
- Run (double-click)
Cygwin.bat
.
- A command window should pop up, with
#
as its prompt.
- Optional but reassuring: Type in the
pwd
command (Print Working Directory). It should respond with /home/yourID
,
where yourID is, um, your ID!
- You are ready to start entering Linux commands. In Linux-world, folders are not folders, they are directories.
Once there, there are a few basic things you need to do to compose, compile and run
C programs.
- You need a text editor for composing your programs. Several are
available (and we can discuss this if you like), but the simplest to use is
called nano. Here is a separate one-page
PDF handout on the basics of using nano. Spoiler Alert: since Linux directories and files map directly
into the Windows file system, you can also use any Windows text editor you like. This includes jGRASP,
which recognizes the C language and will color highlight. Just be careful to coordinate your Windows work
with your Linux work.
- To compile a C program, type the command gcc -o prog
prog.c, substituting the name of your program for prog.
This compiles prog.c in the current directory and saves the executable
file in file prog.
You can name the executable anything you like, but by convention it
is given the same name as the source program only without any filename extension.
- The C compiler will produce warnings and error messages. They refer to the
program by line number. Use the editor to correct errors (In nano, Ctrl-c
will tell you what line number the cursor is on).
- Once your program compiles, you run it by simply typing the name of the
executable file, e.g. ./prog, along with any
command arguments it requires (initially none). Note: the "
./
" before
the executable name denotes "here", the current directory. Linux will not
automatically look there. We can discuss the PATH environment variable sometime.
- For help on a bash command or C function, you have a couple of options. One
is to Google it. The other, "pure Linux",
option is to type man topic,
where topic is the name of the command or function. Help will
be shown one screen at a time. Use the space bar to advance to the next screen.
Press the q key (for quit) to return to the command prompt. There is additional
help available on selected topics using the info command rather than man.
Specifications
Your program will be run with the command:
smerge file1 file2 [ file3 ]
Your program must produce sorted versions of file1 and file2
contents, then merge their lines so the result is also sorted, then output the
result either to file3 if specified, or to standard output (the screen).
The square brackets [ and ] indicate that file3 is optional; they are not part of the
command. Each file consists of one or more lines of text; each line is terminated
with the end-of-line character. Detailed specifications are given here.
The program must:
- Verify the command arguments. Notice that main has this header:
int main(int argc, char *argv[])
Inside main, argc contains the count of command line arguments, including
the command name. It should have the value 3 or 4, depending on whether
or not a destination file is specified. The second parameter argv
is an array of strings, each of which contains one word from the command line.
argv[0] contains "smerge". The files whose names are in argv[1]
and argv[2] must exist or else the program will terminate with an
appropriate error message. If a destination file is given in argv[3]
and it already exists, it will be overwritten with no message to the user.
- Call pipe() twice to create two
pipes. The pipe() function requires one argument, an array of two int.
The two arrays are provided for you (pipe1 and pipe2).
You do not need to initialize the array elements because their values will be produced by the function.
The pipe() function returns an int that indicates the success or failure of the operation.
After a successful call, the argument array elements [0] and [1] will contain integer file
descriptors that represent the read and write pipe channel, respectively.
- Call fork() twice
to create two child processes. The parent and children proceed
"in parallel". Since the first fork creates a child, structure your logic so that the second fork
is performed only by the original parent.
- The first child:
- closes the read file descriptor of the first pipe
- closes both the read and write file descriptors of the second pipe
- creates a temporary file using mkstemp().
This is a function that creates and opens a file whose name is randomly generated based on a template that you provide.
- calls system(). This is a function that takes a string parameter
where the string is a shell command, and carries out the command. You will give it the command to read the first
file (file1) and write the sorted version of its contents to the temporary file. This is done using the shell
command sort.
The shell command needs to be something like "sort file
> tempfile", where file is the input file,
and tempfile, which will receive the sorted contents, is the name of the file returned by the call
to mkstemp(). The ">" symbol is essential; it redirects
output to the file rather than the screen. Your program will have to build the command string before
calling system(). You can build formatted strings pretty nicely using
sprintf(), which has the formatting features of printf() but
puts the results in a string instead to the screen.
- opens the temporary file for reading, then writes its contents to the first pipe (use a loop that reads then writes 20 bytes at a time).
- When finished writing, the child should close the pipe, then
close and delete the temporary file.
- The second child similarly sorts and outputs the second file (file2)
via the second pipe in a similar manner (described in step 4).
- Let's get back to the parent! After the two forks, the parent will read
lines from the two pipes and write the sorted results to the output file or
screen. Specifically, it:
- closes the write end of both pipes
- reads a line from each pipe. I am providing you a function to read and return one line of
text from a given file descriptor.
- performs a "while" loop that terminates when both pipes are empty. In each iteration of this loop, it
- compares the two lines using the strcmp() function,
- writes the "smaller" to file3 or the screen using the write() function.
When your program runs, standard output (the screen) is automatically opened for output at file
descriptor 1. So all you need is an int variable to contain either
the standard output file descriptor (which is 1) or the file3 file descriptor (which is the
return value from the open() or creat() function), then write to that
file descriptor.
- reads the next line from the smaller one's pipe
This is just like the merge operation in the merge sort you studied in the olden days. I will give you a
quick demo if you have forgotten it or have not studied it before.
- closes the read end of both pipes and
- terminates gracefully.
Additional Specifications
The functions and system calls below are more fully described and illustrated
with example programs in a supplementary page. Note that the example programs,
and smerge.c itself, will not compile or run on a Windows system.
- Use only creat(), open(),
read(), write()
and close() for I/O, including output
to the screen. Use open() to open
an existing file and creat() to create
and open a file that does not exist. Each of them returns a file descriptor
of type int, which is subsequently used as the first argument for
read(), write(),
and close(). Pipe()
also returns file descriptors, so you use read()
and write() with pipes as well as
with files. When finished with a file descriptor, call close().
Delete a file using unlink(). When
creating a file with creat(), use
the file name as the first argument and 0755 as the second argument (must
have the leading zero to be interpreted as octal).
- Your C program should use system calls in this list. The header file is
given for each one. To use open(), for example, your program needs
to do: #include <fcntl.h> at the top. Use Google to get detailed
information about the use and operation of these functions.
pipe() |
<unistd.h> |
close() |
<unistd.h> |
fork() |
<unistd.h> |
read() |
<unistd.h> |
perror() |
<stdio.h> |
write() |
<unistd.h> |
system() |
<stdlib.h> |
mkstemp() |
<stdlib.h> |
open() |
<fcntl.h> |
unlink() |
<unistd.h> |
creat() |
<fcntl.h> |
|
|
- Make the program as robust as you can. Check for erroneous return codes
from functions and output appropriate messages to standard error. The perror()
function can be used for this.
- When a pipe is created, file descriptors are created for both a reader and
a writer. Proper pipe etiquette requires that the writer close its read pipe
before proceeding, and the reader first closes its write pipe. Pipes must
be created before the forks, so the parent and child will share the
file descriptors.
- Your solution must be well-structured. To help you with this, I am providing a stub
smerge.c. Right click to download. Save it in your Cygwin home folder.
- The provided smerge.c defines
functions having the specifications shown below. The getline() function is
complete; the others are stubs that you need to complete.
|
// Read one line of text into char array, excluding newline character.
// Array terminated with '\0'.
// Return: 0 if end of file, 1 otherwise.
// Parameter: fd is file descriptor from which to read,
// Parameter: line is char array to contain
the text being read. |
int getLine(int fd, char *line) |
|
// Merge contents of two input streams and write results to output stream.
// Parameter: fd1 is file descriptor for one input stream,
// Parameter: fd2 is file descriptor for the other input stream,
// Parameter: out is file descriptor for output stream. |
void merge(int fd1, int fd2, int out) |
|
// Sort contents of specified file and write to specified pipe.
// Parameter: file is char array containing file name,
// Parameter: outPipe is file descriptor for the write pipe |
void sortAndPipe(char *file, int outPipe) |
|
// Perform a child's task: do pipe maintenance, call sortAndPipe() function.
// Parameter: myPipe array containing my pipe's file descriptors
// Parameter: otherPipe array containing other pipe's file descriptors
// Parameter: fileName char array containing my input file name |
void child(int *myPipe, int *otherPipe, char *fileName) |
|
// Perform the parent's task: create output file if necessary,
// do pipe maintenance, call merge() function.
// Parameter: argc argument count received by main()
// Parameter: argv argument list received by main()
// Parameter: pipe1 array containing first pipe's file descriptors
// Parameter: pipe2 array containing second pipe's file descriptors
|
void parent(int argc, char *argv[], int *pipe1, int *pipe2) |
|
// Check smerge command line for validity.
// Return: 0 if command line valid else -1
// Parameter: argc argument count received by main()
// Parameter: argv argument list received by main() |
int checkArgs(int argc, char *argv[]) |
|
// Perform main task: validate arguments, create two pipes, fork two children,
// parent calls parent(), both children call child().
// Return: 0 if command worked else 1
// Parameter: argc number of command arguments, including command name
// Parameter: argv array containing command words |
int main(int argc, char *argv[]) |
|
Recommended Approach
It would be wise to adopt a top-down approach to this project. Write your
program to fork off a couple of child processes which simply announce their
existence then exit. Then you know the fork part is working. Then add the pipes.
Have each child process write a single string to its pipe, and have the parent
read them then shut down the pipes. At this point, you know that both the fork
and pipe parts are working. Then work on getting the command line arguments
into your program (if you don't already know how to do this). Then have the
children read from their respective files and echo file contents to the parent
. . . and so forth. The idea is to start with something simple, get it working,
then refine one step at a time, testing and debugging as you go.
Scoring
The maximum 50 points are allocated as follows:
- 25 points Functionality (works correctly given valid command arguments)
- 12 points Program structure (structured code, follows described algorithm, correct
use of fork, pipe and I/O system calls)
- 8 points Robustness (handles invalid command arguments, non-existing
files, and error return values gracefully)
- 5 points Documentation, including comments and self-documenting identifiers
To Turn In
Send me your completed smerge.c file as an email attachment to psanderson@otterbein.edu.
[ COMP 3400
| Peter Sanderson
| Math Sciences home page
| Otterbein
]
Last updated:
Peter Sanderson (PSanderson@otterbein.edu)