COMP 530: Lab 1: Parser for a Shell

Due 11:59 PM, Tueday, September 13, 2022

Introduction

This lab will be deeper dive into C programming on Linux, and will serve as a building block for the larger, shell assignment you will do in Lab 2. This will also be a warm-up for more intensive programming assignments this semester. We will use the same docker infrastructure from Lab 0.

We will continue using the same Docker container for this lab as the prior lab.

Working in a group

You may complete this assignment alone or with up to three team members (you plus two others), per the course collaboration policy. However, you are expected to understand all components of the work that is handed in, and you may find it useful to (re-)work any portions independently that a teammate completes on your behalf. Moreover, this assignment will also help you warm up on C programming, and is an important building block for future labs.

Because this code is extended in Lab 2, you are expected to work in the same teams for Labs 1 and 2 2, except in extenuating circumstances. In the event you cannot continue with the same team, please reach out to the course instructor. Your team for Lab 1 does not need to be the same as Lab 0. For labs 3 and following, you are free to switch teams as you please.

Once you join the group, you will be asked to create a team, or join a team in github. Once a team is created, you can add members to your team.

Getting the starter code

You will need to click on this link to create a private repository for your code.

Once you have a repository, you will need to clone your private repository (see the URL under the green "Clone or Download" button, after selecting "Use SSH". For instance, if your private repo is called thsh-team-don:

git clone git@github.com:comp530-f22/thsh-team-don.git

Pointer Refresher

One of the hardest parts of moving from programming in a higher-level language, like Java, to C is dealing with pointers. We will start with a few ungraded, but highly recommended, exercises to refresh your understanding of pointers.

Exercise 0. Read about and practice programming with pointers in C. Complete this tutorial from Yang (Max) Hu.

The best reference for the C language is The C Programming Language by Brian Kernighan and Dennis Ritchie (known as 'K&R'). We recommend that students purchase this book (here is an Amazon Link). There is a copy on reserve in the library.

Read 5.1 (Pointers and Addresses) through 5.5 (Character Pointers and Functions) in K&R. Then download the code for pointers.c, run it, and make sure you understand where all of the printed values come from. In particular, make sure you understand where the pointer addresses in lines 1 and 6 come from, how all the values in lines 2 through 4 get there, and why the values printed in line 5 are seemingly corrupted.

There are other references on pointers in C, though not as strongly recommended. A tutorial by Ted Jensen that cites K&R heavily is available in the course readings.

We also recommend reading the Ksplice pointer challenge as a way to test that you understand how pointer arithmetic and arrays work in C.

Warning: Unless you are already thoroughly versed in C, do not skip or even skim this reading exercise. If you do not really understand pointers in C, you will suffer untold pain and misery in subsequent labs, and then eventually come to understand them the hard way. Trust us; you don't want to find out what "the hard way" is.

Shell Overview

To become familiar with low-level Unix/POSIX system calls related to process and job control, file access, IPC (pipes and redirection). In Lab 2, you will write a mini-shell with basic operations (a small subset of Bash's functionality).

In this lab (Lab 1), we will implement the necessary parsing components for this shell. In Lab 2, you will actually implement process and job control. But for now, we will just implement building blocks for the shell.

This shell will be called thsh, or Tar Hells SHell.

Helpful and allowed interfaces

You are welcome to use any standard C version, including C99 or C11, as well as K&R, ANSI, or ISO C.

Don't spend time writing a full parser in yacc/lex: use plain str* functions to do your work, such as strtok(3). You may use any system call (section 2 of the man pages) or library call (section 3 of the man pages) for this assignment, other than system(3).

Finding programs

Shells provide a nicer command-line environment by automatically searching common locations for commands. For instance, a user may type ls, and the shell will automatically figure out that the binary is actually located at /bin/ls. On Linux, the paths to automatically search is stored in the environment variable PATH.

You can inspect your environment variables using the printenv command:

$ printenv
TERM=xterm
SHELL=/bin/bash
PATH=/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/porter/bin
...

In the case of ls, the shell should try each of the colon-delimited values in the PATH list as a prefix to find the command, e.g,: ls should be checked as /usr/lib/lightdm/lightdm/ls, /usr/local/sbin/ls, /usr/sbin/ls, /usr/bin/ls, /sbin/ls, /bin/ls, /usr/games/ls, and /hom/porter/bin/ls.

In general, environment variables are passed from the parent through the envp argument to main().

Exercise 1. (2 points) Complete the init_path() function in jobs.c.

Implement code to parse the PATH environment variable and populate a table of path prefixes, which your shell will later use to search for binaries. Complete the starter code for init_path that you have been given in the file jobs.c. Each entry in the table should correspond to one path prefix. You have also been given a test harness in test_env.c. This test harness just prints the parsed table in a standard manner, which the grading script can check. Do not change the test harness code.

Note that you can test different PATH values on the command line using the following syntax, and with the expected output below:

$ PATH=/foo:/bar ./test_env
===== Begin Path Table =====
Prefix  0: [/foo]
Prefix  1: [/bar]
===== End Path Table =====

Be sure to handle an empty PATH correctly, as well as to remove trailing slashes. It is not necessary to validate whether each entry in PATH actually exists or not; we will do this in Lab 2.

Challenge! (1 bonus point) Add support for a "NULL" entry in the PATH environment variable, i.e., two colons with nothing between them. For example: PATH=/bin:/sbin::/usr/bin. In this example, the third entry should be converted to the current working directory --- "." would suffice.

Most challenge problems do not have automatic grading; this one does. For a NULL input, the desired output from test_env would be a line with a single dot as the path, similar to:

Prefix  0: [.]

Note that strtok cannot handle this case. Also, the autograder for the main lab does not expect you to handle this case.

Command-Line Parsing

The main part of the assignment will be to complete a C function that parses a stream of characters characters from standard input (i.e., the keyboard). Specifically, you will be parsing shell syntax, including pipes and input/output redirection. For now, you will complete helper functions and just use a provided test-harness; you will use your implementation of this function in Lab 2 to complete a working shell.

For simple commands, the syntax is simple: one or more whitespace characters separates tokens in the input. The parser's job is to identify the start of each token, and return an array of pointers to the start of these tokens. Things are slightly more complicated with special characters and pipelines, which we will explain shortly.

For instance, the command ls -l should have two tokens: ls and -l.

In this exercise, you will complete the implementation of parse_line in parse.c. The output will be a two-dimensional array of commands and tokens. The first level of the array will be pipeline stages, which we will explain next. For simple commands, there will be one valid entry (entry zero), and the second entry in the returned array reference will be NULL. Within each top-level entry, is an array of pointers to each token in the output. Suppose the name of the return value is commands. For an input like ls -l, commands should include:

commands[0] = ["ls", "-l", '\0']
commands[1] = ['\0']

Redirection Support

One of the most powerful features of a Unix-like shell is the ability to compose a series of simple applications into a more complex workflow. The key feature that enables this composition is output redirection.

Redirection is accomplished by three special characters '<', '>', and '|'. These should appear after the command. You will need to add include logic in your parsing code that identifies these characters and handles them specially, rather than treating them an "normal" tokens. Note that you do not need to actually implement this redirection yet, you will only detect these special characters.

The first two characters can direct input from a file into a program, and and output from a program, respectively.

[/home/porter] thsh> ls -l >newfile
[/home/porter] thsh> cat < newfile
newfile
thsh
...

In the example above, the standard output of ls -l is directed to a file, named newfile. If this file didn't exist previously, the shell created it. Note that the ls program does not know it is writing to a file, and is not passed the string '>newfile' as an argument. Similarly, the contents of newfile are passed to the cat program as its standard input.

In principle, you can redirect handles other than standard in and out; for this lab assignment, we will simplify this --- you need only implement support for redirecting standard in and standard out.

Specifically, in parse_line, there are two arguments infile and outfile, that are pointers to a pointer to a character array. This style of using "double pointers" lets a function return more than one value. For instance, suppose one has a pointer to a string (tmp) that should be set as infile. One can assign this as follows:

	*infile = tmp;

From the caller's perspective, they can pass an initially NULL pointer to a function, which may be populated by a function. For instance, in the parser_tester.c file:

    char *infile = NULL;
    char *outfile = NULL;
    ...
    ret = parse_line(buf, length, parsed_commands, &infile, &outfile);
    ...
    if (infile) {
      printf("Input redirection to file [%s]\n", infile);
    }

In this example, after calling parse_line the value of infile may change to a pointer to a non-NULL string. The caller must check the value before using.

Recall that in this lab, your job is only to identify and properly parse these special arguments that should go to the shell itself, and be hidden from the command binary. In Lab 2, you will actually implement redirection. For file indirection, just ensure that the infile and outfile parameters to parse_file are handled correctly, and that the redirection characters ("<" and ">") and file name are NOT included in the parsed command structure. For instance, consider the following examples from the parser_tester utility provided in the starter code:

$ ./parser_tester
ls -l > newfile
Pipeline Stage 0: [ls] [-l]
Output redirection to file [newfile]
Command [ls] is not a built-in.
cat < newfile
Pipeline Stage 0: [cat]
Input redirection to file [newfile]
Command [cat] is not a built-in.

Pipelines

The final special character you will need to handle is the pipe ("|"). The idea of a pipe is that one can specify a series of commands, such that the output of the first command is sent as input to the second command, the output of the second to the third, and so on. For example, one might list a directory, search for all file names that include the letter "d" using grep, and then count them using the word count (wc) utility:

ls | grep "d" | wc -l

The shell actually creates three child processes and connects their input and output handles appropriately. For now, your job is simply to parse this pattern correctly; you will implement pipes in Lab 2.

Specifically, the output of parse_line is a two-dimensional array of commands and tokens. A line of input should render one top-level, or command, entry for each step in a "pipeline" as above. For instance, in the example above, if the returned structure is named commands, it should look like:

commands[0] = ["ls", '\0']
commands[1] = ["grep", "\"d\"", '\0']
commands[2] = ["wc", "-l", '\0']
commands[3] = ['\0']

Note: When file redirection and pipelines are combined, you can assume that only the first stage will have input redirection, and only the last stage will have output redirection. For example:

grep "d" < in.txt | grep -v "html" | wc -l > out.txt

The semantics of input file redirection after the first stage, or output before the final stage are not defined and will not be tested. For example, you need not handle cases like:

ls | grep -v "html" > foo.txt | wc -l
grep "d" < in.txt | grep -v "html" < a.txt | wc -l

Comments

Finally, you will need to handle comments in your parser. Specifically, a comment starts with a '#' character, and any text between a comment character and newline should be ignored.

Hint: Note that these special characters are not sensitive to whitespace. No whitespace between a token and special character is allowed, as is a lot of whitespace.

Exercise 2. (6 points) Complete the parse_line() function in parse.c, as described above.

You can test this function using the parser_tester utility provided to you. Type sample inputs, and it will print the parsed commands in a standard format, which which the grading script can check. Do not change the test harness code. You may test any pattern of inputs you like at the command line, and use Control+D when you are finished to terminate the test program.

Formal Grammar

Some students may find a formal grammar for the parsing of the shell helpful (similar to those given out in COMP 401, 431, and 522). If you do not, feel free to ignore this section. This is expressed in the Backus-Naur Form, and specifies what your shell should be able to parse (the user's line of input without the newline character '\n' is <input>):

<input> ::= <command-list> | <background-command-list>
<background-command-list> ::= <command-list> <nullspace> <bg>
<bg> ::= "&"
<command-list> ::= <command> | <piped-list-of-commands>
<command> ::= <simple-command> | <compound-command>
<piped-list-of-commands> ::= <command> | <command> <nullspace> <pipe> <piped-list-of-commands>
<nullspace> ::= "" | <sp>
<pipe> ::= "|"
<sp> ::= <whitespace> | <whitespace> <sp>
<whitespace> ::= " " | "\t"
<simple-command> ::= <arg-list>
<arg-list> ::= <word> | <word> <sp> <arg-list>
<word> ::= <char> | <char> <word>
<char> ::= any character (ascii or unicode) that are not in the set {'|', ' ', '\t', '<', '>', '&'}
<compound-command> ::= <possible-io-list> <nullspace> <simple-command> <nullspace> <possible-io-list>
<possible-io-list> ::= "" | <io-list>
<io-list> ::= <io-op> | <io-op> <nullspace> <io-list>
<io-op> ::= <input-redirection> | <output-redirection>
<input-redirection> ::= <in> <nullspace> <word>
<output-redirection> ::= <stdout> | <fd-out>
<stdout> ::= <out> <nullspace> <word>
<fd-out> ::= <fd> <out> <nullspace> <word>
<fd> ::= positive integer
<in> ::= "<"
<out> ::= ">"

Note that this grammar includes a few features, such as specifying the file descriptor for redirection, arbitrary redirection, and background commands, that are only needed in challenge problems in Lab 1.

You are only required to parse this subset of full bash grammar (full grammar is here if you are curious: http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/utilities/V3_chap02.html#tag_18_10).

If your parser is more robust than this (e.g., if your parser correctly ignores whitespace in the beginning/end of the line) you will not be penalized, and may be rewarded with bonus credit (feel free to highlight extra sophistication in challenge.txt).

Built-In Commands

There are a small number of commands that are directly implemented in a shell, rather than as separate binaries. Typically, these are commands that affect the shell itself, rather than the system as a whole.

For instance, the exit command terminates the currently running shell. Running the exit() system call in a child process will indeed terminate the child, but the only way to terminate the currently running shell is to issue the exit() system call within the shell itself.

Thus, every shell needs to support a small number of built-in commands. We provide a framework for built-in commands in builtin.c. For each built-in command, we declare a table of struct builtin --- one per command. Each entry defines the command itself, and the handler function that implements that command. From builtin.c:

static struct builtin builtins[] = {{"cd", handle_cd},
				    {"exit", handle_exit},
				    {'\0', NULL}};

This table includes two entries: one for exit and one for cd. We provide an implementation of handle_exit and handle_cd, which you do not need to change.

Exercise 3. (2 points) Complete the handle_builtin() function in builtin.c.

This function should detect when a command matches an entry in the builtin table and call the appropriate handler.

Using the parser_tester utility, the correct behavior for a series of cd, ls, and exit commands is as follows:

$ ./parser_tester
cd
Pipeline Stage 0: [cd]
Command [cd] is a built-in, returned 42.
ls
Pipeline Stage 0: [ls]
Command [ls] is not a built-in.
exit
Pipeline Stage 0: [exit]

Note that cd does not need to do anything except return 42 (for now!), but exit should actually terminate the process.

Hand-In Procedure

You will be handing in the code via gradescope. You should have been added to the class; if not, please contact the instructors as soon as possible. If you work in a team, you should only submit one copy of your code in gradescope; you can add your teammates to the handin. We recommend handing in directly from your github repository to the assignment. Gradescope will run the autograding program, giving you immediate feedback on the assignment. You may hand in more than once and we will take the most recent, applying lateness penalties as appropriate (out-of-band).

In the event of any discrepancies with the autograder environment, please contact course staff as soon as possible.

Generally, unless the homework assignment specifies otherwise, you should compile your program using the provided Makefile (e.g., by just typing make on the console). Do not add any special command line arguments ("flags") or compiler options to the Makefile.

Note: We do not have an automated way to calculate late penalties. These will be applied manually at the end of the semester.

The program should be neatly formatted (i.e., easy to read) and well-documented. The Style Guide gives additional guidance on lab code style.

Make sure you put your PID(s) in a header comment in every file you submit.

If you complete any challenge problems, please describe the solution and how to demonstrate it in challenge.txt. Note that we have a separate submission option in gradescope for submitting challenge problems, in order to accommodate later submissions without charging late hours. Even if you are submitting on time, please submit your challenge problems a second time through the appropriate challenge assignment --- we will only manually grade assignments handed in through the challenge option.

This completes the lab.


Last updated: 2022-11-03 10:50:49 -0400 [validate xhtml]