Problems with Version 1

Let's discuss some key issues with version 1 of the AFS.

We'll cover the following

Key Problems

A few key problems with this first version of AFS motivated the designers to rethink their file system. To study the problems in detail, the designers of AFS spent a great deal of time measuring their existing prototype to find what was wrong. Such experimentation is a good thing because measurement is the key to understanding how systems work and how to improve them; obtaining concrete, good data is thus a necessary part of systems construction. In their study, the authors found two main problems with AFSv1:

  • Path-traversal costs are too high: When performing a Fetchor Store protocol request, the client passes the entire pathname (e.g., /home/remzi/notes.txt) to the server. The server, in order to access the file, must perform a full pathname traversal, first looking in the root directory to find home, then in home to find remzi, and so forth, all the way down the path until finally the desired file is located. With many clients accessing the server at once, the designers of AFS found that the server was spending much of its CPU time simply walking down directory paths.

  • The client issues too many TestAuth protocol messages: Much like NFS and its overabundance of GETATTR protocol messages, AFSv1 generated a large amount of traffic to check whether a local file (or its stat information) was valid with the TestAuth protocol message. Thus, servers spent much of their time telling clients whether it was OK to use their cached copies of a file. Most of the time, the answer was that the file had not changed.

TIP: MEASURE THEN BUILD (PATTERSON’S LAW)

One of our advisors, David Patterson (of RISC and RAID fame), used to always encourage us to measure a system and demonstrate a problem before building a new system to fix said problem. By using experimental evidence, rather than gut instinct, you can turn the process of system building into a more scientific endeavor. Doing so also has the fringe benefit of making you think about how exactly to measure the system before your improved version is developed. When you do finally get around to building the new system, two things are better as a result:

  • First, you have evidence that shows you are solving a real problem.
  • Second, you now have a way to measure your new system in place, to show that it actually improves upon the state of the art.

And thus we call this Patterson’s Law.

There were actually two other problems with AFSv1: load was not balanced across servers, and the server used a single distinct process per client thus inducing context switching and other overheads. The load imbalance problem was solved by introducing volumes, which an administrator could move across servers to balance load. The context-switch problem was solved in AFSv2 by building the server with threads instead of processes. However, for the sake of space, we focus in this chapter on the main two protocol problems above that limited the scale of the system.

Improving the protocol

The two problems above limited the scalability of AFS; the server CPU became the bottleneck of the system, and each server could only service 20 clients without becoming overloaded. Servers were receiving too many TestAuth messages, and when they received Fetch or Store messages, were spending too much time traversing the directory hierarchy. Thus, the AFS designers were faced with a problem:

THE CRUX: HOW TO DESIGN A SCALABLE FILE PROTOCOL

How should one redesign the protocol to minimize the number of server interactions, i.e., how could they reduce the number of TestAuth messages? Further, how could they design the protocol to make these server interactions efficient? By attacking both of these issues, a new protocol would result in a much more scalable version of AFS.

Get hands-on with 1200+ tech skills courses.