Storing Files on Tape

Understand the different techniques used to solve the storing files on tape problem efficiently.

Introduction to storing files

Suppose we have a set of nn files that we want to store on a magnetic tape. In the future, users will want to read those files from the tape. Reading a file from tape isn’t like reading a file from a disk; first, we have to fast-forward past all the other files, and that takes a significant amount of time. Let L[1..n]L[1 .. n] be an array listing the lengths of each file; specifically, file ii has length L[i]L[i]. If the files are stored in order from 11 to nn, then the cost of accessing the kkth file is:

cost(k)=ki=1L[i]cost(k) = \underset{i=1}{\overset{k}\sum}L[i]

Expected cost of accessing a random file

The cost reflects the fact that before we read file kk, we must first scan past all the earlier files on the tape. If we assume for the moment that each file is equally likely to be accessed, then the expected cost of searching for a random file is:

E[cost]=nk=1cost(k)n=1nnk=1 ki=1L[i]E[cost] = \underset{k=1}{\overset{n}{\sum}}\frac{cost(k)}{n} = \frac{1}{n} \underset{k=1}{\overset{n}{\sum}} \space \underset{i=1}{\overset{k}{\sum}}L[i]

If we change the order of the files on the tape, we change the cost of accessing the files; some files become more expensive to read, but others become cheaper. Different file orders are likely to result in different expected costs. Specifically, let π(i)π(i) denote the index of the file stored at position ii on the tape. Then the expected cost of the permutation ππ is:

E[cost(π)]=1nnk=1 ki=1L[π(i)]E[cost(\pi)] = \frac{1}{n} \underset{k=1}{\overset{n}{\sum}} \space \underset{i=1}{\overset{k}{\sum}}L[\pi(i)]

Finding the best order to store files on tape

Which order should we use if we want this expected cost to be as small as possible? The answer seems intuitively clear: sort the files by increasing length. But intuition is a tricky business.

Lemma 1. E[cost(π)]E[cost(π)] is minimized when L[π(i)]L[π(i+1)]L[π(i)] ≤ L[π(i + 1)] for all ii.

Proof: Suppose L[π(i)]>L[π(i+1)]L[π(i)] > L[π(i + 1)] for some index ii. To simplify notation, let a=π(i)a=π(i) and b=π(i+1)b=π(i+1). If we swap files aa and bb, then the cost of accessing aa increases by L[b]L[b], and the cost of accessing bb decreases by L[a]L[a]. Overall, the swap changes the expected cost by (L[b]L[a])/n(L[b]− L[a])/n. But this change is an improvement because L[b]<L[a]L[b] < L[a]. Thus, if the files are out of order, we can decrease the expected cost by swapping some unordered pairs of files.

This is our first example of a correct greedy algorithm. To minimize the total expected cost of accessing the files, we put the file that is cheapest to access first and then recursively write everything else; no backtracking, no dynamic programming, just make the best local choice and blindly plow ahead. If we use an efficient sorting algorithm, the running time is clearly O(nlogn)O(n \log n) plus the time required to actually write the files. To show that the greedy algorithm is actually correct, we’ve proved that the output of any other algorithm could be improved by some sort of exchange.

Let’s generalize this idea further. Suppose we are also given an array F[1..n]F [1 .. n] of access frequencies for each file; file ii will be accessed exactly F[i]F[i] times over the lifetime of the tape. Now the total cost of accessing all the files on the tape is:

cost(π)=nk=1(F[π(k)].ki=1L[π(i)])=nk=1 ki=1L[π(i)]\sum cost(\pi) = \underset{k=1}{\overset{n}{\sum}} \begin{pmatrix} & F[\pi(k)] . \underset{i=1}{\overset{k}{\sum}} L[\pi(i)]\\ \end{pmatrix} = \underset{k=1}{\overset{n}{\sum}} \space \underset{i=1}{\overset{k}{\sum}}L[\pi(i)]

As before, reordering the files can change this total cost. So what order should we use if we want the total cost to be as small as possible? (This question is similar in spirit to the optimal binary search tree problem, but the target data structure and the cost function are both different, so the algorithm must be different, too.)

We already proved that if all the frequencies are equal, we should sort the files by increasing size. If the frequencies are all different, but the file lengths L[i]L[i] are all equal, then intuitively, we should sort the files by decreasing access frequency with the most-accessed file first. In fact, this is not hard to prove (hint, hint) by modifying the proof of Lemma 1. But what if the sizes and the frequencies both vary? In this case, we should sort the files by the ratio L/FL/F.

Lemma 2. cost(π)\sum cost(\pi) is minimized when L[π(i)]F[π(i)]L[π(i+1)]F[π(i+1)]\frac{L[\pi(i)]}{F[\pi(i)]} \leq \frac{L[\pi(i+1)]}{F[\pi(i+1)]} for all ii

Proof: Suppose L[π(i)]/F[π(i)]>L[π(i+1)]/F[π(i+i)]L[π(i)]/F[π(i)] > L[π(i + 1)]/F[π(i + i)] for some index ii. To simplify notation, let a=π(i)a = π(i) and b=π(i+1)b = π(i+1). If we swap files aa and bb, then the cost of accessing aa increases by L[b]L[b], and the cost of accessing bb decreases by L[a]L[a]. Overall, the swap changes the total cost by L[b]F[a]L[a]F[b]L[b]F[a] − L[a]F[b]. But this change is an improvement because

L[a]F[a]>L[b]F[b]L[b]F[a]L[a]F[b]<0.\frac{L[a]}{F[a]} > \frac{L[b]}{F[b]} \Leftrightarrow L[b]F[a] - L[a]F[b] <0.

Thus, if any two adjacent files are out of order, we can improve the total cost by swapping them.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy