The first physical wave was all about decentralized connectivity. IMPs and later routers permitted a file to be broken into many fixed-sized blocks or packets and then sent independently to a distant machine. The blocks took different routes, arrived out of sequence and sometimes with duplication or retries, but no matter, they were all stitched together perfectly at the receiving end. The distinguishing feature of this architecture was that there was no one machine controlling the journeys of these blocks. It didn’t matter if a router broke down or T1 communication lines were severed, the emergent behavior of all those routers was to find a way to get all those packets delivered. It worked in the face of disaster or misconfiguration, and the process of delivery was abstracted completely from the many applications that depended upon it.
The next physical wave is about decentralized storage. We have the cheap hard-drives and servers that hold them and we have lots of data to protect. The problem is that we manage the data in an old-fashioned centralized way. Napster and its progeny were on the right track, but they were all about sharing; providing access through thousands of copies of a (music) file. But that’s no good here. Today’s user demands privacy but wants the same convenience of machine independence.
Visualize this: In the same way that the TCP/IP protocols split up a file into blocks for transfer, let’s do it for storage. We’ll compress the data for efficiency and encrypt them into thousands of anonymous blocks and store them on many different ‘block servers’. The block servers will be like stripped down web servers; only smart enough to accept a block for storage based on a 64 bit number and give it back in future when presented with that same 64 bit number. If you break into one of these servers, what will you see? There will be hundreds of millions of encrypted blocks of exactly the same size addressed by a set of these numbers.
Next, place some intelligence on the client computers that use this space. When it is time to store a file, software will create those blocks and then send them to the block servers. But how will it decide which block server should be used, and where (the 64 bit number) on that server it should be placed? I’m sure that you can dream up strategies to place blocks based on an ascending sequence of addresses on the next available server, but I suspect most of these ideas will require some central authority that regulates where everyone’s blocks must go to prevent conflicts. That’s no good for the next wave. We cannot efficiently grow a centralized storage system without technological (scaling) or political problems. We’ve tried that. It’s not working.
Here’s how we do it: Create a ‘storage schedule’ on the fly at the instant of storage for a file that is based on (1) a user’s privately-held encryption key and (2) the complete pathname of the file to be stored. This schedule will be created in a 64 bit number space using ‘one-way’ functions developed over the last 30 years by encryption theorists. Store the blocks. Discard the schedule. At some time in the future when you want to retrieve the file, recreate the schedule from (1) the encryption key and (2) the file’s pathname and use it to go to each block server in the list and ask for the particular block.
Let’s think about the ramifications of this technique. First, the ‘one-way’ functions statistically guarantee that the servers all receive an equal number of blocks so our hardware people will love us because all the equipment is used to peak efficiency. Secondly, the blocks of the file are retrieved through a direct numerical calculation – unlike conventional solutions that require two or three database lookups. This eliminates the requirement for expensive IT staff to manage complicated mission-critical database servers. Thirdly, we have a storage algorithm with two variables. If we hold the encryption key constant and permute the file's pathname, we have a hierarchical file system that can grow to any size (as long as we add more block servers when they fill up) that depends only on that one encryption key. If we permute the encryption key but use the same file pathname the schedule is still unique so we can have any number of independent file systems co-existing on the same block servers. That gives unbounded scalability.
The real issue in everyone’s mind, however, is privacy. Why would I place my personal data on someone else’s server? Why would I trust someone to hold my data? Let’s analyze the security of this architecture. With conventional technology, looking for customer data is a bit like breaking into a bank. Once the thief gets through the ‘door’ by hacking through the security or bribing the sys admin, the file system is laid out before them and they can easily get to the specific ‘safety-deposit box’ or file of a customer. Hackers can then use their formidable skills and resources (e.g. botnets) to try to ‘break open the box’ or crack the encryption of that file. Now consider our system. We let them into the safe. A hacker can ask any block server for a block by specifying a 64 bit number. The problem they face is that they don’t know where to look. To rebuild a file without the encryption key that created the schedule, the hacker has to make 263 = 9 billion, billion guesses at a hundred different servers, decrypt each block and reassemble them in the correct order. This is like searching 100 haystacks to find a specific blade of grass – unlike a needle, the right block does not look any different than its neighbors. Such an attack would take longer than the creation of the universe.
With this technology it is finally possible to create the user-centric Internet of storage. It is possible to place all data into a distributed and homogeneous store of anonymous blocks with complete privacy for all participants. The protection of data will no longer require the machine-centric point of view of the past and users will comfortably store their data ‘on the net’ in complete confidence. It only makes sense.
Wednesday, July 4, 2007
Subscribe to:
Posts (Atom)