1. It should work great for huge files, especially over LANs. Right now a 1byte change will mean a re-download, but one of the first things we're doing is introduce diff-based downloads :)
2. It should
3. We're not 100% sure how we're going to charge for it yet, but anyone who signs up for the beta will be grandfathered into whatever system we end up using (i.e. you won't have to pay)
4. I'm actually going to address this in a separate post, but we've designed it in such a way where you'll actually stream files from one device to the other (based on a least-recently-used policy), so in effect you should have access to ~104GB of data.
5. This relates to #4, but is up to you, largely. If you have computers that are up 24/7, great. If you don't, you can leverage our cloud servers for better availability.
#4 is interesting—but how do you define "devices?"
I have, say, 500GB of music. It doesn't fit on my laptop, so it "lives" on an external HD. Could it sync the laptop with the HD so that I always still have access to a set chunk—say, 10GB—of my most-recently-used music files? This is a capability I've been waiting for something to support ever since I bought the drive.
a device is anything that can run AeroFS. We can (and plan) to do exactly what you described for caching data in between _devices_, but we haven't really considered the use case where you may want access to your most recently used external HDD data (in this case, the external HDD is really more like a part of your laptop device)
Could you run multiple instances of AeroFS on the one device and assign different stores to each? That might be a reasonable workaround, assuming that it's not more work than doing it "properly".
I am working on something exactly similar in my free time as a side project. Nothing to show yet, but I am currently playing with rsync to support incremental diff-based syncs.
The streaming part is neat, providing access to data well beyond the storage capacity of your current machine. ZumoDrive (former YC company) also focuses on the streaming aspect.
This is just me throwing out a toy idea here, but.. Consider that most computers have large chunks of unused space. Now draw a parallel to projects like Folding@Home which make use of unused CPU cycles for computation. Wouldn't it be nice to make use of the world's unused storage in exchange for providing your unused storage? This made me think of that as it sounds like this 'p2p supercloud' could be just another peer type. You opt in to share 100GB == you get 80GB of redundant off-site backup for free type system.
Now, I am not asking you to implement this peer type (although that would rock my world if you did), but would it be possible for someone to implement it themselves? In other words will you be providing a 'peer API'?
80GB of redundant storage = 160GB at least. More than that, because you can't count on any one node almost at all, so you need more than two copies.
This means you should get more like 10-20GB per 100GB you commit, otherwise the cloud simply will not have enough space.
Then if you consider that even with many nodes containing your data, there is a decent chance all of them go offline at a certain time. You have to have many, many nodes for the odds to be small enough. Which means the best solution is to use the storage you committed as one of the nodes, so it is always available to you. Then, it really transforms into a cloud backup system, rather than a cloud file system.
You wouldn't store full copies, you'd stripe it across multiple machines using some type of error correction coding, like Reed-Solomon, which has less than 2x overhead.
You are talking about RAID5. However, RAID5 is useless if more than a few disks go offline at the same time.
RAID1/10 is most useful when there's a higher chance of multiple disks failing at a time, or when the odds of multiple disks failing in your RAID5, while low, are unacceptable.
Of course there are other things at work when you talk RAID0/5/10, but this is a large part of it.
I've used wuala for linux and have found it quite rough around the edges.
* not completely decentralized or open source. If wuala goes out of business, your data may not be recoverable.
* web interface is lacking (poor folder navigation/listing)
* doesn't work without X on linux
* even with X, not all features are available through the command line or API, though some are
* the interface that is provided is clunky
* the status messages leave me wondering where in the process an update is. If a piece of software can't reliably tell me where it is in a process, I can't trust that the process is happening the way I expect.
Yep, Wuala is not completely open source and it is a business. Your worry that they might go out of business soon is mitigated quite a bit by the fact that they have been acquired by LaCie [1] over a year ago.
I suppose Linux definitely isn't their main market and while the Web Interface is lacking at the moment they are working on an overhaul.
I wonder about this too, but in offices where bandwidth is LAN speed and uptime predictable, and disks unlikely to be full of personal media and larger than necessary.
Although I have no good ideas where it would be beneficial apart from being fun.
"Each AeroFS device has its own 1024bit RSA key pair, which is certified by us to be authentic."
That suspiciously reads like the AeroFS people get a copy of your key. If that's the case then it's only marginally more secure than DropBox. Hope I'm reading that wrong...
We generate a temporary password for the user being invited and encode it in the invitation code sent to the user's email address. We use this temp pass to verify the user when he/she signs up and destroy the pass immediately after. During initial setup, the user's device generates its own public key pair and sends a CSR (certify signing request) to us for certification.
I'm curious to know more about the technical details about how this works. Like what protocols and technologies you're using. If that information isn't too sensitive that is. :-)
We developed a lot of the protocols and technologies ourselves, and could talk about them for hours :) Let me know if you have a specific area you want me to discuss.
In short, AeroFS is a decentralized data management system running on top of p2p overlay networking.
The overlay network layer presents to the data management layer a transport-agnostic view of the Internet, and addresses peers using network-independent identifiers. In this way, data management can talk to any peer regardless of network topologies and firewall restrictions, as if the world is flat :)
The data management layer controls data versioning and update propagation in a fully decentralized way. As I described in another comment, we use version-vector-like data structures to track versions and mange conflicts. We use modified epidemic algorithms (http://portal.acm.org/citation.cfm?id=41841) for fast update propagation. AeroFS distinguish between peers and super peers. Super peers can help update propagation and peer communication in many ways.
A lot of research went into building group communication tool kits like Spread and Ensemble. Today they are being used in server/cluster environments. I was playing with re-purposing one of these (JGroups) for my pet project. The toolkit implements the abstractions I need (A peer communication Channel, peer discovery etc.). It also provides various protocol stacks so I can use a different one for streaming movies vs. syncing files.
RE: "it just works" - you're right of course, and we're definitely trying to make it as simple as possible.
There are some features we're going to implement down the road that can be done better with p2p solutions though (aggregated storage across devices, for example), so I hope you give us a chance! :)
On the other hand, I do. You don't have to be as big to be the biggest fish in a smaller pond. (Or as 'The Dip' puts it, to be the best in the world start by shrinking the world)
Absolutely. This project has the potential to remedy a problem that, to date, I have not been able to solve.
I have computers in multiple geographically-diverse locations and need a large amount data (terabytes) to always appear in each location. Other requirements:
1. Direct sync between my devices, with no third-party cloud involved
2. Fast local sync when two devices are on the same subnet are detected
3. Since individual files can be 20 GB in size, interrupted synchronizations should automatically resume when the connection is re-established (without having to start over from the beginning)
4. Encrypted transmission of data, but not encrypted on disk
5. When renaming or moving a file from one folder to another, the system should be smart enough to detect that there's no need to re-transmit that file (i.e., it just needs to rename it or move it to the new location on all other devices)
6. Ability to throttle upstream/downstream bandwidth on a per-device basis
Neither Dropbox nor CrashPlan -- nor any other tool -- has been able to meet all of these requirements.
In short, this is a very exciting and welcome development. I sincerely hope that this problem will soon be solved!
1) How does this system handle two devices behind separate NATs? (aka a work device and a home device.)
2) What is the conflict resolution protocol if a file is modified in two or more locations? (Newest wins, automatic duplication for manual resolution, etc.)
1) we ICE/STUN as well as relay for firewall penetration. 2) we use a modified version of version vectors (http://en.wikipedia.org/wiki/Version_vector) and accompanying algorithms to detect and resolve conflicts. In a decentralized system, conflict management boils down to managing causal relationship between distributed updates, and version vector was invented just for that :)
However, having looked at the wikipedia page on Version Vectors it appears that is a protocol for detecting conflicts. I was interested in how you resolve them.
A simple example is a zip file that I add file A to on one computer and later file B to on another computer. When I sync up do I end up with a zip containing no new files, file A, file B, both files or a corrupt zip file. (Does the answer change if the zip file is encrypted?)
I see. There are two categories of conflicts to resolve: meta conflicts (like when you rename a file to "foo" on device A and meanwhile rename it to "bar" on B) and data conflicts (i.e. the example you gave).
We will formally describe meta conflict resolution in a separate post. Because resolution for data conflicts is very application specific, we will publish an API to allow application developers to write their own conflict resolvers. Meanwhile, we will try to provide resolvers for popular file types by default.
From the end user's view, in most cases conflicts are automatically resolved without being noticed. User intervention is required if automatic resolution fails or the user wants to manually merge.
Can you go into a little more detail about sharing?
Are my files kept on my devices when shared or does it implicitly mean shared files are read/write accessible by others?
If so, is there a way of managing permissions?
We have implemented full-fledge access control including file ownership, read/write permissions on data/metadata, list/add/remove permissions on directories, etc. But we disabled it from the interface to keep user experience as simple as possible.
Later on we may enable them based on use cases and user feedback. Our API will include ACL management as well. Currently files are read/write accessible once shared.
File syncing that includes mobile devices seems to becoming increasingly important as their storage space grows. I'm excited about AeroFS and am looking forward to seeing more posts about the technical aspects.
What about performance? I assume since many routers these days use asymmetric dsl, download/upload from hosts may be poor if only a few hosts are involved, and the hosts use ADSL.
1) Does it work well with huge files? 1GB+ etc? Will a 1-byte change mean complete download to all devices?
2) Does it work well with 100k small files in deeply nested folders?
3) Will you charge for software and/or support?
4) What happens when one of the devices doesn't have enough storage? 4GB SSD laptop vs. 100GB HDD.
5) Will any of my computers have to be up 24/7?