CSSA Intro to research

Computer Science Students Association
unc chapel hill

Introduction to doing research

home

about us

students

guides

Introduction to doing research

This is an introduction to research for new graduate students.
Many thanks to the authors Michele Weigle and David Ott. Thank you also to Prof. Ketan Mayer-Patel and Mark Lindsey for their suggestions. The original texts are at: reading, experiments and CVS.
Table of contents:

Hints on Finding and Reading Academic Papers

Types of Academic Papers

How do I get a copy of a particular paper?

Which Papers to Read, and How to Read Them

Hints for maintaining references to papers

Experimental Research Hints

Essential Tools

Archival Data Storage is Plentiful

Recommending Reading

Hints for using the CVS software version control

Why use CVS?

Some basic commands

Useful CVS Links

Hints on Finding and Reading Academic Papers [top]

Not long after you start your graduate studies at UNC, you'll be asked to read your first academic paper. It might be a course professor or your research advisor who asks you to do it, but it's guaranteed to happen.
While it might seem tough going at first, reading papers is one of the fun things about grad school. It's a chance to learn what's going on with research groups elsewhere, and to keep up with new ideas and approaches. As time goes on, you'll find that there's a core body of important papers in your subfield that provide a basis for newer work going on in the present. Knowing this body of work is key to having a sound understanding of your subfield.
Here's a few hints for the uninitiated on academic papers.

Types of Academic Papers
Papers typically come in one of three flavors: a workshop paper, a conference paper, or a journal paper.
Workshop papers consist often of incomplete ideas or work-in-progress reports. Workshops tend to be small, and sometimes just have authors and a select, invited crowd who can give helpful advice. Some workshops are nearly as strong as conferences, though. For example, the International Workshop on Quality of Service (IWQoS) is a strong workshop in networking.
Conference papers are where it's at. Nearly everything good has been published at a conference at some point. IEEE and ACM conferences tend to be fairly good. In contrast, conferences that are regional are often not as good. Such conferences usually have names that include "directional" (e.g., The Southwest Conference on Multimedia) or "geographical" (e.g., Pan-Asian Conference on Multimedia) words in their title. This is not hard-and-fast rule, of course, since some regional conferences are top notch.
Journal articles have the highest quality and are usually very thorough. But it takes a long time for a paper to be journalized. So while the information is thoroughly discussed, the ideas are often no longer new any more. Journal papers also tend to be fairly long. Nearly every journal article is based on some conference papers, so it's often better to find the original conference papers to quickly digest the main ideas.

How do I get a copy of a particular paper?
Life is good. Getting hold of a particular paper in Computer Science nowadays is usually very easy because of the Web.
I can nearly always find a paper by going to http://www.google.com and searching on the title in quotes. For example, if I'm looking for a paper entitled "Architectural Considerations for a New Generation of Protocols" by D. Clark and D. Tennenhouse, I'll search on "architectural considerations for a new generation".
The resulting hits are instructive. A very large number of papers in Computer Science are found in the database at http://citeseer.org, run by the NEC Research Institute in Princeton, New Jersey. This database is great because it not only gives bibliographic information and lets you download the paper in various formats, it also contains information on which papers cite this paper as a reference. This lets you sleuth for related papers and get a sense of the importance of this paper to a larger body of work in the area.
Another common type of hit is the homepage of one of the paper's authors. Most researchers in Computer Science have some form of home page which lists their publications allows you to download them in PostScript or PDF format. This is great for getting to know the work associated with a particular individual.
Similarly, research groups often have a publications page that makes available various papers published by its members. For the above example, the Advanced Networking Architecture Group at MIT lists this paper on their publications page.
If you're lucky, the journal, conference, or workshop that published the paper put it online for you to download. That's rather lucky, however, as many conferences and journals charge a fee before you can access their papers online.
UNC libraries have a number of journal publications online at their e-journal site. This is a great way to browse papers when you're not sure what you're looking for or you're just trying to stay up-to-date with recent conference proceedings.
Finally, it happens every once in a while that the paper you need to read is so old, it isn't available in electronic format on the Web. In that case, you'll need to walk over to the Brauer Math-Physics Library in Phillips Hall (next door to Sitterson Hall), find the publication and photocopy it. Note that conference and workshop proceedings are usually kept in the stacks, while journals are kept in the reference section.

Which Papers to Read, and How to Read Them
Reading a paper in its entirety is fairly serious time commitment. The situation seems particularly daunting for new students who feel overwhelmed by the number of papers in their subfield. How does one come up to speed?
Take heart. Reading every paper in its entirety within in your subfield isn't necessary.
Professor Ketan Mayer-Patel suggests a series of steps for deciding whether to read a paper, and how much time to commit to it. The idea is to start at the top of the list and proceed to the next step only if the paper warrants it.

Guess about the paper's relevance by the frequency of its citation. This can be done by looking at http://citeseer.org, or observing the bibliographies of other important papers informally. If the paper isn't very relevant, you'd be better off spending your time on one that is.

Check the title, where and when it was published, and who its authors are. Papers by key people and in key conferences and journals give the paper a higher priority. Recent work should take precedence over older, outdated work. (Although sometimes it's important to know seminal papers from the past in your subfield.)

Read the abstract and the first page. Does their problem approach make sense and address the issue you're interested in? Don't waste your time reading a paper in detail if it lacks applicability to your problem and approach.

Read the section headings. What direction do they go with their approach, and what are their main contributions?

Look at the pictures. What do various diagrams and plots actually show?

Skim the paper. Look for the main conclusions and contributions. Avoid spending time on proofs, detailed derivations, etc.

NOTE: Everything up to this point can be done in less than 10 minutes.

If the paper still seems valuable after the above steps, read it carefully.

Hints for maintaining references to papers [top]

During the course of grad school, you'll read lots of papers. You'll even want to remember something about many of these papers. Starting an annotated bibliography early will help you when you're ready to write a paper, your proposal, the related work section of your dissertation, anything.
You can either start a Word document with references and comments, or use BibTeX. If you have any inkling that you want to write papers and/or your dissertation in LaTeX, use BibTeX for your references. (Here's my BibTeX setup.) If you want to use Word, several folks at UNC use EndNote for organizing bibliographies (it's not free, but is sold at Student Stores).

Experimental Research Hints [top]

Write It Down!
At some point, you will run experiments and later need to reference it in a paper or progress report (or someone else in your research group needs to understand the experiment). You will be very happy if you have logged some of the following:

motivation for the experiment

expectations you had before running the experiment

parameters used in the experiment

general impression of the results

Lab notebooks are valuable, but web pages are great places to record this type of information. Not only can you search for keywords, but you can refer others to your log and link in other documents (plus, if it's in AFS-space, it's backed-up). Often people will put pictures of graphs and other results on their experiment web page. Here's my online experiment log.

Essential Tools
gnuplot is my favorite plotting utility (FAQ, Intro from UNI). Lots of people at UNC also use Matlab.

awk is a great little scripting language. I use it mainly for separating columns from data files. Here's an example:
# test.dat
123 abc 456 efg 789 hij
321 cba 654 gfe 987 jih

% awk '{print $1,$3,$5}' < test.dat

123 456 789
321 654 987

% awk '{if ($1==123) print}' < test.dat

123 abc 456 efg 789 hij
Here are a couple of awk tutorials: rice.edu, canberra.edu.au
learn perl or python
Archival Data Storage is Plentiful
After you've run lots of experiments, you may find you need additional disk space. ATN offers a mass storage system that uses SAM-FS. After your data has been in the mass storage system for 14 hours, it is backed up to tape. Note that this system is only for long-term storage of seldom-used files.

Recommending Reading

The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling by Raj Jain (UNC libraries, amazon.com)

Writing for Computer Science: The Art of Effective Communication by Justin Zobel (amazon.com)

Hints for using the CVS software version control [top]

CVS ("Concurrent Versions System") is a version control system for software development projects. It allows you to keep change histories on individual source files, and to tag a particular snapshot of all files as belonging to the same release. It also supports other operations useful in the context of team programming.
CVS is part of the GNU Project and is freely distributed in the open source community. (GNU does many commonly used UNIX utilities like emacs, make, gcc, and gdb. See their "Free Software Directory" for a complete list.)

Why use CVS?
CVS solves a number of problems in the context of software development.

It provides change histories. Suppose you make several changes to a source file and it ends up breaking something? CVS allows you to keep a change history so that you can review how a source file has been modified. This is particularly important in team contexts when other people are changing the same set of files.

It supports rollbacks. Along with source file change histories, CVS saves previous versions of each source file so that a developer can rollback to any previous version at will. This can be a powerful tool when debugging.

It provides release tagging. At some point, a snapshot of the current working system can be tagged with a particular version number. This will add a tag to all source files, and allow a developer to rollback the entire project to a particular point for debugging or other purposes.

It backs up previous versions of source code in a compact format. One could always save a copy of source files at various times to create a change history and allow rollbacks. This is inefficient, however, since multiple copies of things that don't change are also saved and the disk space needed quickly grows. CVS provides a compact solution by internally saving only the differences between file versions, but still allowing any version to be restored in its entirety.

It provides merging capabilities. Two developers working on the same source file independently can merge their changes. Source code can also be branched, and the branches merged together at some later time.

Some basic commands
Here are a few commands to give you an idea of how to use CVS. This list is hardly complete, but is meant simply to illustrate several typical operations. (See the links in the next section for a more detailed treatement of CVS commands.)
First, set your environment variable to your CVS repository directory. This is where CVS will store source files, differences, version information, etc.
setenv CVSROOT=mycvsrootdir
To create a source repository, cd to your source directory and use:
cvs import -m MESSAGE MODULE VENDER_TAG RELEASE_TAG
Once a repository is created, you can cd to any directory and "checkout" the source tree. This will copy it into your directory.
cvs checkout [-r REV] [-d DIR] MODULE
To add source files to a project:
cvs add -m MESSAGE file1 file2 file3
Remember that "checkout" in CVS means copying a source tree into your current working directory. Unlike some other source control systems, there is no concept of checking a file out before working on modifications in CVS. Instead, you simply make the modifications, and then "commit" the changes using:
cvs commit
To create a snapshot, you tag all files in the module:
cvs tag TAG_NAME
To look at a file's history:
cvs history file1

Useful CVS Links
GNU CVS manual
One of many online tutorials
CVS FAQ
CVS FAQ page written by UNC's DiRT research group

University of North Carolina at Chapel Hill
Computer Science, Sitterson Hall
Chapel Hill, NC 27599-3175 USA

Server Manager: webmaster@cs.unc.edu
Content Manager: cssa@cs.unc.edu
Last modified: