Tell HN: AeroFS - File Syncing Without Servers

chime · on July 22, 2010

So many questions...

1) Does it work well with huge files? 1GB+ etc? Will a 1-byte change mean complete download to all devices?

2) Does it work well with 100k small files in deeply nested folders?

3) Will you charge for software and/or support?

4) What happens when one of the devices doesn't have enough storage? 4GB SSD laptop vs. 100GB HDD.

5) Will any of my computers have to be up 24/7?

yurisagalov · on July 22, 2010

1. It should work great for huge files, especially over LANs. Right now a 1byte change will mean a re-download, but one of the first things we're doing is introduce diff-based downloads :)

2. It should

3. We're not 100% sure how we're going to charge for it yet, but anyone who signs up for the beta will be grandfathered into whatever system we end up using (i.e. you won't have to pay)

4. I'm actually going to address this in a separate post, but we've designed it in such a way where you'll actually stream files from one device to the other (based on a least-recently-used policy), so in effect you should have access to ~104GB of data.

5. This relates to #4, but is up to you, largely. If you have computers that are up 24/7, great. If you don't, you can leverage our cloud servers for better availability.

derefr · on July 22, 2010

#4 is interesting—but how do you define "devices?"

I have, say, 500GB of music. It doesn't fit on my laptop, so it "lives" on an external HD. Could it sync the laptop with the HD so that I always still have access to a set chunk—say, 10GB—of my most-recently-used music files? This is a capability I've been waiting for something to support ever since I bought the drive.

yurisagalov · on July 22, 2010

a device is anything that can run AeroFS. We can (and plan) to do exactly what you described for caching data in between _devices_, but we haven't really considered the use case where you may want access to your most recently used external HDD data (in this case, the external HDD is really more like a part of your laptop device)

anthonyb · on July 23, 2010

Could you run multiple instances of AeroFS on the one device and assign different stores to each? That might be a reasonable workaround, assuming that it's not more work than doing it "properly".

weihan · on July 23, 2010

Actually in theory AeroFS can run multiple instances on one device, and we are actually thinking it as the workaround :)

hxr · on July 22, 2010

I am working on something exactly similar in my free time as a side project. Nothing to show yet, but I am currently playing with rsync to support incremental diff-based syncs.

Will your program be open source?

weihan · on July 22, 2010

yeah rsync is the de-facto standard for delta sync I think :) Open API and/or open source part of AeroFS are in our plans.

LiveTheDream · on July 22, 2010

The streaming part is neat, providing access to data well beyond the storage capacity of your current machine. ZumoDrive (former YC company) also focuses on the streaming aspect.

sabat · on July 22, 2010

Your headline is ingenious: "We may be late to the party, but we brought a keg."

adamgravitis · on July 22, 2010

That's pretty much Yuri's modus operandi.

unavoidable · on July 22, 2010

Ain't that true. I think Yuri owes us a keg next time we see him.

samvj · on July 23, 2010

second that =)

Goosey · on July 22, 2010

This is just me throwing out a toy idea here, but.. Consider that most computers have large chunks of unused space. Now draw a parallel to projects like Folding@Home which make use of unused CPU cycles for computation. Wouldn't it be nice to make use of the world's unused storage in exchange for providing your unused storage? This made me think of that as it sounds like this 'p2p supercloud' could be just another peer type. You opt in to share 100GB == you get 80GB of redundant off-site backup for free type system.

Now, I am not asking you to implement this peer type (although that would rock my world if you did), but would it be possible for someone to implement it themselves? In other words will you be providing a 'peer API'?

sliverstorm · on July 22, 2010

80GB of redundant storage = 160GB at least. More than that, because you can't count on any one node almost at all, so you need more than two copies.

This means you should get more like 10-20GB per 100GB you commit, otherwise the cloud simply will not have enough space.

Then if you consider that even with many nodes containing your data, there is a decent chance all of them go offline at a certain time. You have to have many, many nodes for the odds to be small enough. Which means the best solution is to use the storage you committed as one of the nodes, so it is always available to you. Then, it really transforms into a cloud backup system, rather than a cloud file system.

zhyder · on July 23, 2010

You wouldn't store full copies, you'd stripe it across multiple machines using some type of error correction coding, like Reed-Solomon, which has less than 2x overhead.

sliverstorm · on July 23, 2010

I'm thinking of it in terms of RAID.

You are talking about RAID5. However, RAID5 is useless if more than a few disks go offline at the same time.

RAID1/10 is most useful when there's a higher chance of multiple disks failing at a time, or when the odds of multiple disks failing in your RAID5, while low, are unacceptable.

Of course there are other things at work when you talk RAID0/5/10, but this is a large part of it.

mkuhn · on July 22, 2010

You should take a look at http://www.wuala.com

dwiel · on July 22, 2010

I've used wuala for linux and have found it quite rough around the edges.

  * not completely decentralized or open source.  If wuala goes out of business, your data may not be recoverable.
  * web interface is lacking (poor folder navigation/listing)
  * doesn't work without X on linux
  * even with X, not all features are available through the command line or API, though some are
  * the interface that is provided is clunky
  * the status messages leave me wondering where in the process an update is.  If a piece of software can't reliably tell me where it is in a process, I can't trust that the process is happening the way I expect.

mkuhn · on July 23, 2010

Yep, Wuala is not completely open source and it is a business. Your worry that they might go out of business soon is mitigated quite a bit by the fact that they have been acquired by LaCie [1] over a year ago.

I suppose Linux definitely isn't their main market and while the Web Interface is lacking at the moment they are working on an overhaul.

[1] http://eu.techcrunch.com//2009/03/19/wuala-merges-with-lacie...

aik · on July 22, 2010

I agree, though I believe they're working on it. The technology is very impressive if you look into it.

Concerning it not being completely decentralized - I consider it a plus since that fact ensures greater reliability.

ajb · on July 23, 2010

Sounds like tahoe: http://tahoe-lafs.org/trac/tahoe-lafs

yason · on July 22, 2010

Of all things that modern computers have plenty of, it's hard disks I've rarely seen not being almost full...

jodrellblank · on July 23, 2010

I wonder about this too, but in offices where bandwidth is LAN speed and uptime predictable, and disks unlikely to be full of personal media and larger than necessary.

Although I have no good ideas where it would be beneficial apart from being fun.

blocke · on July 22, 2010

"Each AeroFS device has its own 1024bit RSA key pair, which is certified by us to be authentic."

That suspiciously reads like the AeroFS people get a copy of your key. If that's the case then it's only marginally more secure than DropBox. Hope I'm reading that wrong...

yurisagalov · on July 22, 2010

We don't get a copy of your private key (neither should anyone else, ever). We do get a copy of your public key, to certify it (we use OpenSSL's CA)

blocke · on July 22, 2010

So how do you "invite" someone? Swap public keys?

weihan · on July 22, 2010

We generate a temporary password for the user being invited and encode it in the invitation code sent to the user's email address. We use this temp pass to verify the user when he/she signs up and destroy the pass immediately after. During initial setup, the user's device generates its own public key pair and sends a CSR (certify signing request) to us for certification.

j_baker · on July 22, 2010

I'm curious to know more about the technical details about how this works. Like what protocols and technologies you're using. If that information isn't too sensitive that is. :-)

weihan · on July 22, 2010

We developed a lot of the protocols and technologies ourselves, and could talk about them for hours :) Let me know if you have a specific area you want me to discuss.

j_baker · on July 22, 2010

Not necessarily anything in particular. Just a big picture overview of how the technology works.

weihan · on July 22, 2010

In short, AeroFS is a decentralized data management system running on top of p2p overlay networking.

The overlay network layer presents to the data management layer a transport-agnostic view of the Internet, and addresses peers using network-independent identifiers. In this way, data management can talk to any peer regardless of network topologies and firewall restrictions, as if the world is flat :)

The data management layer controls data versioning and update propagation in a fully decentralized way. As I described in another comment, we use version-vector-like data structures to track versions and mange conflicts. We use modified epidemic algorithms (http://portal.acm.org/citation.cfm?id=41841) for fast update propagation. AeroFS distinguish between peers and super peers. Super peers can help update propagation and peer communication in many ways.

hxr · on July 22, 2010

A lot of research went into building group communication tool kits like Spread and Ensemble. Today they are being used in server/cluster environments. I was playing with re-purposing one of these (JGroups) for my pet project. The toolkit implements the abstractions I need (A peer communication Channel, peer discovery etc.). It also provides various protocol stacks so I can use a different one for streaming movies vs. syncing files.

endlessvoid94 · on July 22, 2010

I think the consumer market for this doesn't care how their files are backed up. They just want it to work.

yurisagalov · on July 22, 2010

RE: "it just works" - you're right of course, and we're definitely trying to make it as simple as possible.

There are some features we're going to implement down the road that can be done better with p2p solutions though (aggregated storage across devices, for example), so I hope you give us a chance! :)

stavros · on July 22, 2010

Yeah, if your thing means I can sync my data across computers without you seeing any of it, I'm never using Dropbox again.

Goosey · on July 22, 2010

On the other hand, I do. You don't have to be as big to be the biggest fish in a smaller pond. (Or as 'The Dip' puts it, to be the best in the world start by shrinking the world)

aik · on July 22, 2010

The business or enterprise markets are massive though.

slantyyz · on July 22, 2010

This offering sounds like a hybrid of Crashplan and Dropbox.

Could be quite good for the use cases they specify.

SnowLprd · on July 22, 2010

Absolutely. This project has the potential to remedy a problem that, to date, I have not been able to solve.

I have computers in multiple geographically-diverse locations and need a large amount data (terabytes) to always appear in each location. Other requirements:

1. Direct sync between my devices, with no third-party cloud involved

2. Fast local sync when two devices are on the same subnet are detected

3. Since individual files can be 20 GB in size, interrupted synchronizations should automatically resume when the connection is re-established (without having to start over from the beginning)

4. Encrypted transmission of data, but not encrypted on disk

5. When renaming or moving a file from one folder to another, the system should be smart enough to detect that there's no need to re-transmit that file (i.e., it just needs to rename it or move it to the new location on all other devices)

6. Ability to throttle upstream/downstream bandwidth on a per-device basis

Neither Dropbox nor CrashPlan -- nor any other tool -- has been able to meet all of these requirements.

In short, this is a very exciting and welcome development. I sincerely hope that this problem will soon be solved!

yurisagalov · on July 22, 2010

Just as a quick note: 1,2,4,5 are already done on our end. #3 and #6 are in our top priority list and should be done soon

If you want to chat more about your particular needs/use case, give me a shout at yuri@aerofs.com, I'd love talk!

codemechanic · on July 22, 2010

Already Tonido meets all this requirements. How it is new? http://www.tonido.com

antgiant · on July 22, 2010

Two questions

1) How does this system handle two devices behind separate NATs? (aka a work device and a home device.)

2) What is the conflict resolution protocol if a file is modified in two or more locations? (Newest wins, automatic duplication for manual resolution, etc.)

weihan · on July 22, 2010

1) we ICE/STUN as well as relay for firewall penetration. 2) we use a modified version of version vectors (http://en.wikipedia.org/wiki/Version_vector) and accompanying algorithms to detect and resolve conflicts. In a decentralized system, conflict management boils down to managing causal relationship between distributed updates, and version vector was invented just for that :)

antgiant · on July 22, 2010

Thank you.

However, having looked at the wikipedia page on Version Vectors it appears that is a protocol for detecting conflicts. I was interested in how you resolve them.

A simple example is a zip file that I add file A to on one computer and later file B to on another computer. When I sync up do I end up with a zip containing no new files, file A, file B, both files or a corrupt zip file. (Does the answer change if the zip file is encrypted?)

weihan · on July 22, 2010

I see. There are two categories of conflicts to resolve: meta conflicts (like when you rename a file to "foo" on device A and meanwhile rename it to "bar" on B) and data conflicts (i.e. the example you gave).

We will formally describe meta conflict resolution in a separate post. Because resolution for data conflicts is very application specific, we will publish an API to allow application developers to write their own conflict resolvers. Meanwhile, we will try to provide resolvers for popular file types by default.

From the end user's view, in most cases conflicts are automatically resolved without being noticed. User intervention is required if automatic resolution fails or the user wants to manually merge.

antgiant · on July 22, 2010

Thank you. That's what I wondered.

blocke · on July 22, 2010

On another note, without yet seeing the software, I'd assume the Mac and Linux ports are both using FUSE. Is there a FUSE alternative for Windows yet?

weihan · on July 22, 2010

Yes. We use CBFS (http://www.eldos.com/cbfs/) for Windows.

blocke · on July 22, 2010

Ah, sadly nothing for a weekend hacker but nice to know it exists.

I'd guess anyone who spent time figuring out how to do Windows filesystems would want to be paid for the trauma.

Thanks for the pointer and good luck with the project, I've signed up for an invite. :)

weihan · on July 22, 2010

There are a few (Google "fuse windows"). In particular, Dokan (http://dokan-dev.net/en/) seems to be a good one. The project is quite active recently.

fragmede · on July 22, 2010

Just wanted to point out that Dokan has its own version of SSHFS - http://dokan-dev.net/en/2010/01/18/open-source-dokan-sshfs/

jodrellblank · on July 23, 2010

There's one at http://www.eterlogic.com which at a glance looks free for non commercial use - Virtual Drive SDK.

rlpb · on July 22, 2010

So it's Dropbox but without the requirement for a central cloud component? Is there anything I've missed?

weihan · on July 22, 2010

Exactly. In addition, because cloud servers to us are merely a "super peer", AeroFS offers a superset of what a cloud-based solution can provide.

Osiris · on July 22, 2010

No storage limitations like Dropbox.

fname · on July 22, 2010

I like the "Chat with the Founders" on the website... Is that something you built?

EDIT: Thanks guys.

yurisagalov · on July 22, 2010

We actually use olark.com for that! It's been super helpful today in responding to comments directly on the website

prabodh · on July 22, 2010

It is Olark http://www.olark.com/

seancron · on July 23, 2010

I see your company doc on "How To Earn HN Karma" has worked out well. Now if only you could turn karma into money...

Edit: This is a reference to the picture on their signup page.

yason · on July 22, 2010

Can I share stuff with my friends? Suppose they already host my backups and I host theirs, can I easily give them access to some of my files?

yurisagalov · on July 22, 2010

You can definitely share stuff with friends. Strictly speaking, right now your friends don't "host your backup" _unless_ you share with them.

mcritz · on July 22, 2010

Can you go into a little more detail about sharing? Are my files kept on my devices when shared or does it implicitly mean shared files are read/write accessible by others? If so, is there a way of managing permissions?

weihan · on July 22, 2010

We have implemented full-fledge access control including file ownership, read/write permissions on data/metadata, list/add/remove permissions on directories, etc. But we disabled it from the interface to keep user experience as simple as possible.

Later on we may enable them based on use cases and user feedback. Our API will include ACL management as well. Currently files are read/write accessible once shared.

santimt · on July 22, 2010

Does it use P2P transfer or the owner of a file should update as many copies of the file as peers syncing?

weihan · on July 22, 2010

It uses p2p syncing: Any device that was invited and has sufficient privileges on a file can serve the file to peers.

tsmith · on July 23, 2010

Way cool guys, and way to represent the Toronto startup scene!

BTW, loved the reference to HN karma in the screenshot!

bobf · on July 22, 2010

File syncing that includes mobile devices seems to becoming increasingly important as their storage space grows. I'm excited about AeroFS and am looking forward to seeing more posts about the technical aspects.

martyhu · on July 22, 2010

What about performance? I assume since many routers these days use asymmetric dsl, download/upload from hosts may be poor if only a few hosts are involved, and the hosts use ADSL.

hassenben · on July 23, 2010

Could someone clarify the difference between aerofs and rsync? What can aerofs do that rsync doesn't?

sabat · on July 22, 2010

Anyone have invites to give out?

santimt · on July 22, 2010

You can sign up in the web page to get (i hope) an invite

sabat · on July 22, 2010

I'd already done that, but I'm impatient. :-)

c00p3r · on July 23, 2010

a wrapper on top of git, I guess? ^_^

yurisagalov · on July 23, 2010

Actually, no ;) We wrote most of the underlying magic ourselves