How To Set Up a Tahoe-LAFS Distributed Filesystem on Ubuntu 14.04 Servers
Tahoe-LAFS is a distributed “least-authority” filesystem that can be easily scaled to provide a robust, network aware storage network. With benefits similar to those that RAID configurations gain at the disk level, Tahoe-LAFS leverages a similar model to encrypt and split up files among multiple servers.
In this guide, we will set up a Tahoe-LAFS system using three Ubuntu 14.04 VPS instances. This will allow us to distribute our data among 3 separate servers, giving us redundancy and a bigger storage pool than any single server.
Install the Tahoe-LAFS Components
The Ubuntu 14.04 repositories contain the most recent version of Tahoe-LAFS at the time of this writing (1.10.0). We can use this to install the components on each of our servers.
sudo apt-get update sudo apt-get install tahoe-lafs
This will install the Tahoe-LAFS python package in a system accessible location.
To isolate our configuration, let’s create an unprivileged user specifically for dealing with our Tahoe-LAFS configuration:
sudo adduser tahoe_user
Choose a password for your user. Next, we’ll transition to our new user for the remaining of our guide:
su - tahoe_user
Configure an Introducer to Manage Node Connections
The first step that we need to take in configuring our cluster is to set up our “introducer”. The introducer is simply a node that functions as a hub to connect clients and storage nodes.
The introducer (or the introducer component if the machine also contains a storage node) does not actually transfer any files or store any data. Its sole responsibility is to collect the addresses of the storage locations and give them to the clients who request access to the cluster. The actual data does not pass through the introducer.
We will only need to run through this step on our first machine. All of the other servers will use this same introducer to connect to the cluster and discover each other.
To get started, we can run our first Tahoe command. This will initialize our Tahoe introducer within a hidden directory in our home directory called
tahoe create-introducer ~/.introducer
This will create the directory and put some configuration files inside. We won’t have to adjust any of these settings, because they’re pretty basic.
To bring the introducer online, we just need to tell Tahoe to start this component:
tahoe start ~/.introducer
Your introducer is now up and you can begin creating and connecting your storage nodes to this “grid” (the name of the collective set of nodes working together).
But first, we need to extract one vital piece of information from our new introducer. This is the address that our storage nodes can use to connect. To find this, you can type this command:
This is the network address that we need in order to configure our storage nodes, so copy and paste this somewhere that you will have access to later.
Configure Storage Nodes to Hold Data
Now that we have our introducer component configured and online, we need to start making our storage components so that we will actually have a pool of storage available to write to.
In a production environment, you’ll probably want to have your introducer node on a completely separate VPS instance from your storage units. You would also want to distribute your data across a much larger pool than we will be demonstrating. For this guide however, we will have 3 storage nodes, one of which also has our introducer.
On each of your servers, create a regular storage node by typing:
This will create the general folder structure similar to our introducer directory, but with different information. This is located in
~/.tahoe by default.
We need to adjust the configuration file before we can continue. Open it with your editor:
First, create a unique name for each node. This will be used to identify each server and can give you clues to debugging issues:
nickname = <span class=“highlight”>node-name</span>
Next, we’ll have to adjust the
web.port parameter. Currently, it only accepts connections from the local computer. You can change it to
0.0.0.0 to allow connections from all of your interfaces, or bind it to your specific IP address:
web.port = tcp:3456:interface=0.0.0.0
Next, we need to find the
introducer.furl parameter. We need to modify this with the introducer furl that we got during our introducer configuration. It should look something like this:
introducer.furl = pb://firstname.lastname@example.org:57972,10.128.1.228:57972,127.0.0.1:57972/ysslmcvu23f7opkxj4db5r5mdan3pj6m
The next parameters that we need to adjust are the commented out parameters
shares.total. Let’s go into a bit more depth about what these parameters mean:
What do the Share Parameters Mean?
shares.* parameters are perhaps some of the most important configuration decisions you’ll have to make while setting up your Tahoe-LAFS system. It configures the amount of redundancy you have and the amount of space your data will take up.
In short, when a file is uploaded, it is broken into “shares”. Each share, when combined with a certain number of other shares, can be used to recreate the file.
shares.total parameter dictates exactly how many shares are created when the file is uploaded. By default, this is “10”.
shares.needed is the number of shares needed to recreate the entire file. By default, this is “3”. So if this was set to “1”, you would have redundancy, but there would be a complete copy of the file created 10 times.
These two parameters combine to create the share policy for the client. The default values dictate that when a file is uploaded, it will be divided into 10 different shares. These shares will be redundant so that any 3 of them can be used to recreate the file.
This means that the
shares.total will give us the amount of overhead needed to redundantly store the file when multiplied by the file size. In the default arrangement, your files will take up take up roughly 3.3 times as much storage space as storing the file on a regular disk.
This may seem like a lot, but you also have to realize you’re gaining the benefit of losing any 7 of those shares and still being able to reconstruct a complete file.
The last parameter,
shares.happy is used to configure how your shares will be distributed between servers. While the
shares.total parameters deal with “shares”, the
shares.happy parameter specifies actual servers.
This sets the minimum number of servers available to receive a copy of the shares. If less than the number of servers is available, the upload will not be successful. This is used to ensure that your data is spread out over a reasonable amount of nodes. By default, this is set to “7”.
Since we only have 3 nodes available, we’ll need to adjust some of these properties. Since every server can house multiple shares of the same file, the only parameter we really need to adjust is
shares.happy. This has to be the number of servers or fewer.
For our purposes, we will uncomment all of these, and we will change the
shares.happy to “3”, our total number of servers:
shares.needed = 3
shares.happy = <span class=“highlight”>3</span>
shares.total = 10
These are fine for demonstration, but you’ll want to tune them more depending on your needs (do you need very high redundancy, or is disk space more of an issue?).
When you are finished editing the files on each node, you can start each storage node by typing:
Interacting with the Web Interface
After all of your nodes are up, you can access the Tahoe-LAFS file system by browsing to any of your VPS instances at port
You should get a web page that shows all of your nodes and an interface to your grid:
This is an overview of your nodes. Because Tahoe-LAFS doesn’t assume that you are running on your own hardware and does not assume that you are the only operator, it does not really include an interface to query data by name. It assumes that the owner will have the information to retrieve the file that is given upon uploading.
Let’s try to use the interface a little bit to demonstrate.
First, let’s create a directory by clicking the “Create a directory” button along the left-hand side. The default selection (SDMF) is fine for our purposes:
You will be taken to the directory page that looks something like this:
This directory does not have a name in the standard sense that we are used to. Instead, it pretty much has an ID (actually, a few IDs). Because it doesn’t have a name and Tahoe-LAFS doesn’t implement querying for security reasons, we will be unable to get back to this directory if we do not store the information about it now.
You can either bookmark this page in the browser, or you can record the IDs associated with the directory. To get the IDs, click on the “More info on this directory” link at the top of the page:
You will be taken to a page with the storage index, the object type, and most importantly, the access caps or Tahoe-URIs. These are listed in a table in the center of the page:
This might seem like a lot of information required to access a simple directory, and it is. For the directory, we have an access URI for read and write access, and a separate one for read-only access that we can give out.
This is the information that we need to come back to this directory. You can copy and paste this somewhere. Click on the “JSON” link for a more digestible presentation if you are scripting:
Inside of the directory, we can upload files easily using the available buttons. You can make them immutable (read-only), or writeable by selecting the radio buttons. You will not have to keep track of all of the URIs for these files as long as they are kept in a directory you have the URI of.
If you want more capabilities, there are a variety of alternative front-ends to Tahoe-LAFS listed here. Of particular interest may be the command line interface and the SFTP interface, which can help you manage your files in a much easier way. For instance, the CLI allows you to create aliases for your directories and files so that you don’t have to keep track of URIs.
By now, you should have a good idea of how to set up a distributed, encrypted filesystem across a variety of different storage mediums, whether you trust the administrators or not. This allows you to avoid some of the issues of trust that are inherent with having others manage your data.
<div class=“author”>By Justin Ellingwood</div>