Hacker News new | past | comments | ask | show | jobs | submit login
Airbnb open-sources Caravel: data exploration and visualization platform (github.com/airbnb)
422 points by caravel on March 31, 2016 | hide | past | favorite | 92 comments



If anyone involved with this is around, I'm curious why Airbnb would build something like this - cost, performance, features, all of the above? Data querying and visualization is a pretty crowded field with a lot of commercial options to choose from[1].

I'm not knocking Caravel (it looks amazing) just curious why build vs buy in this case.

[1] Tableau, Looker, Periscope, Chartio, Qlikview, Gooddata are just some that come to mind.


Well none of the solutions mentionned are open source.

Free as in beer is one incentive as licenses are not cheap, and vendors know when they have you locked down and tend to milk everything they can.

More importantly, software for which we don't have control over the source is a risk. In this day and age anyone that cares enough should be able to push a bugfix/hotfix overnight. What if you'd have to wait for entire quarters or years for Tableau to parallelize their "live mode", or to get connectivity to Presto to work?

What if you want to integrate a new type of visualization that isn't supported? What if you want to integrate with your anomaly detection framework or your A/B testing framework or other internal or external facing applications?

Since this is a common need for most companies, it makes sense to have an open source solution that we can all use and collaborate on.


"Free as in beer is one incentive as licenses are not cheap, and vendors know when they have you locked down and tend to milk everything they can."

Free as in beer is never the answer. A project like this takes multiple engineer-years to build and maintain. That's hundreds of thousands of dollars, at least. How much is a site license? Are you sure? Have you negotiated the rate?

Even for "expensive" services, buying it from someone else is almost always cheaper than paying someone to maintaining it yourself, because expensive services are usually expensive for a good reason: they're niche, and finding someone with the expertise to build it is expensive. And having the source is for a product so that you can customize it is certainly a better answer, but it rarely happens, in practice. It's why we have gobs of open-source Apache-foundation products that nobody in their right mind wants to host in-house, unless they absolutely have to.

Developers have a real, well-documented resistance to paying for things, and it sucks. Because in reality, most development of open-source tools happens when someone gets paid to maintain the tool. If they don't, the tool falls into disrepair. Open-source software isn't free -- it's just paid for by someone else.


I work at a Fortune 100 and we use fairly well known warehousing database. I can tell you we spend a serious amount of money for our licenses, infrastructure, employee talent, staff-aug contractors and management overhead running this commercial solution. Much of the human resource I would classify as over compensated and non-productive.

We also use some of the previously mentioned visualization technologies and get generally poor results by all measures.

If AirBNB has a few engineers around who are part of solving a unique challenge AirBNB has, then they are probably getting a better deal than we are. I'd wager their productivity per resource is significantly better, not to mention licensing, infrastructure and control over product features. That said, not everyone is AirBNB and has the ability to attract and retain that type of talent.


But what's the alternative? Write your own warehousing database? Likely not. Even AirBnB hasn't done that.

I'm not saying that you wouldn't choose a good open source alternative if it exists. I'm saying that you shouldn't run off and implement something that exists commercially if your only reason is that the commercial thing costs money.


Yeah build our own isn't a good option for us for a few reasons.

1. Our problem set is not novel, we're more or less like many others out there and collectively we are a market that a vendor can build a product for

2. Since we have significantly more revenue than a startup like AirBNB we can rationalize and amortize an ongoing expense

3. We actively try not to do things that aren't our core competency

4. Even if we wanted to we have plenty of human resources but not the talent to build our own analytic DB or data visualization tools

I think AirBNB is likely a much different company than most. Their approach to talent attraction/retention, novelty of problem set, ability and tolerance to make a multimillion dollar capital investment in analytic database and visualization tools, etc, etc. In their case the upstart cost of building a custom thing and the value prop likely makes sense.


There's also an in house cost in maintaining vendor software and integrating it. That's on top of the 20% usual maintenance fee.

Building a community can also help distributing the costs. But more importantly it's a labour of collaboration instead of a client/vendor struggle.


If the cost of integrating the service is higher than the cost of building it, then sure, don't buy the service.

But if you find this to be a frequent occurrence, you're almost certainly evaluating costs incorrectly.


> than paying someone to maintaining it yourself

You do realize that this is probably THE reason to open-source a piece of custom software after you've built it, right?

...I imagine that they expect the community will do X% of the maintenance work now, so they can save Y%, so win-win for both the company and the community. Also, you increase the probability that any new hires will be at least mildly familiar with your in-house stack if you open-source some of it (think TensorFlow).

Free as in beer is the best answer when you have reasonable expectations that the community will really embrace the product and you'll get back "free as in beer" upgrades and bugfixes to the software you won't otherwise have the budget to properly maintain (and properly document, btw! think of all the free tutorials that will be written for this after it gets popular!) :)


> Developers have a real, well-documented resistance to paying for things

This is why you never try to target developers and if you do, you have to make sure the initial cost to use it is negligible. This is what Atlassian does very well, with their pricing structure.

In the past, I developed in house tools for massive Enterprise companies and the common theme has always been, we can do it better and cheaper. Sometimes that's true, sometimes it's not. I noticed this trend slowly tapering off before I left to start my startup and this was due to the availability of better open source solutions.

Technical people always overestimate what they are capable of, present company included, and if you are set on selling to developers, you have to make sure the barrier to use is negligible. By at least trying, they will either come to the conclusion that what you have created is a piece of shit or they realize it's not worth their time and effort to duplicate/maintain.


They mention that it was originally paired with Druid. The data volume that Druid excels at (and that AirBnB must have) is orders of magnitude larger than what Tableau&Looker do well with. It's probably just built for bigger-than-SQL OLAP usecases.


It's worth distinguishing between the tools that leave the data in your data warehouse (Caravel, Periscope, Mode, Looker (where I work)), and those that have their own data stores (Good Data, Qlikview, etc.) Tableau can connect directly to your datastore, but it's happier if it can operate on data that's stored locally in-memory.

Anyway, the ones where bring your own database can scale as far as the database can bring you.


Note that Tableau doesn't play well with Presto which Airbnb uses extensively. No possibility of using the "live mode"


Not arguing that Airbnb should use Looker, but fwiw, Looker (where I work) does in fact connect to Presto fine.


Looker scales with plenty of data for us at Snapchat, it's more about the underlying database than the BI tool.


Note that this user used to work at Looker.


yep i did.


Looker performance comes from the underlying storage of data. You can store massive amounts of data on say a Red Shift cluster and still be performant. Most visualization tools that i know are directly tied to storage tier in terms of performance.


I don't think Looker supports Druid.


That's correct. We (I work at Looker) are written to leverage the analytical capabilities of SQL. So we support all kinds of SQL implementations, from totally standard (PG, MySQL, MS) to MPPs (Redshift, BigQuery) to SQL-on-Hadoop (Hive, Impala, Presto, Spark).

But the JDBC -> Druid connectors that exist look pretty janky. So if someone builds a stable connector, I suspect we'd support it. But for the moment, no Druid.


To throw out an open source one: Metabase http://www.metabase.com https://github.com/metabase/metabase

Disclaimer: I work on Metabase.


Not really related to your product, but I find the trend of using emojis in commit messages really distracting. It's probably okay-ish very very rarely for something huge, but commit log reading like slack chat... Em not sure about it.


Can't speak for Airbnb, but I'm not sure that any of the front-end clients that you mentioned (disclosure: I work at Looker), can talk to Druid. So if Airbnb already had a Druid warehouse in place, they may have decided it was easier to roll their own front-end than migrate to a different backend.


Druid is definitely part of the equation.

Larger, data-driven companies with significant engineering teams prefer not relying on 3rd party, closed-sourced vendors. That can represent a significant risk and a blockage for deeper integration with other internal applications when needed.

Not that building always wins over buying, but the balance shifts relatively to the size of the company.

Also, when using open source on the receiving end of the equation, you want to be a good citizen and contribute back to the ecosystem. It ties to pride, passion, and reflect a strong engineering culture, which can help with recruiting.


Your points resonate with me very strongly, but I do have a counter argument.

I work on Google Cloud, where we have both closed-source - BigQuery, open source - Dataproc, and closed source that makes open source rock - Dataflow/Beam. There are merits to each.

BigQuery is serverless and multi-tenant and can't really exist outside of Google Cloud due to its intrinsic dependency on low-level services that don't exist elsewhere (and for the same reasons we couldn't directly externalize Borg, choosing to create an OSS clone in Kubernetes).

It is not unusual for me to hear from folks that they spent 6+ months building a performant and sizable, say Presto cluster. Then there's continuous management, tinkering, configuration, and optimization projects. I hear this from companies that one would consider sophisticated technologically.

By contrast, 40% of all of BigQuery's Petabyte customers scale to these levels without ever talking to us. We just find them on consumption reports. On multiple occasions we've had "surprise" load tests of millions of rows per second streamed into BigQuery, and it just works. BigQuery is also HA out of the box at no additional cost, which is a great luxury.

So sometimes if you need to scale analytics to Petabytes, the option is to just consume a managed and cost-effective service, or tinker with OSS, where there are significant operational tradeoffs. On the other hand, as you said, you build pride and culture. It's also far from an automatic shoo-in that OSS gives you better TCO (against the old closed-source guard, yes, but not so much BQ). Thus, the relationship between company size and value of technology can invert at higher levels.

With all that, I'd love to see a Caravel-BQ plugin :)

(PS. Kudos to Druid for introducing a Streaming ingest. BigQuery also sees value in Streaming ingest, GA-ing our own Streaming API in March of 2015, and Kudos to Airflow).


Roughly how many petabyte customers are you talking about?


There is a study done by a market analyst (Wayne Eckerson) about Building vs Buying BI, he has some good insights based on surveys as to why some companies chose to build and some choose to buy

71% of those who chose to build the BI tools said they built because "We can customize the functionality better"

51% of those who buy say "Buying enables us to provide best-in-class BI functionality"

The study: http://www.jaspersoft.com/sites/default/files/confirmation_f...


It is even crowded than that on the commercial offering [0]. But having such a tool as Open Source benefits us all. Especially when trying to connect to obscure or non standard data sources.

[0] https://en.wikipedia.org/wiki/Online_analytical_processing#M...


part of it is probably recruiting


I am pretty sure they used multiple (and then most likely at some point too many) of those tools


Few more for your list: Leftronic (service) or Grafana (package)


investor's money, baby!


aesthetics


I really like the Python style you guys have adopted.

Grouping imports into: standard lib, third party, local is a strong pattern that I don't see done consistently in many repos. Likewise with your use of wrapping long imports with ()s and a single tab.

Any chance of sharing your Python style guide? My startup is Python based (Django and Flask) and would really appreciate it!





I believe it's called "Sankey Diagram" as denoted in the dropdown menu on the upper top left.

Here is the original demo[1] from Mike Bostock, D3's author.

[1]: https://bost.ocks.org/mike/sankey/


Though they have been around much longer than D3 has... https://en.wikipedia.org/wiki/Sankey_diagram


I believe it's called a Sankey diagram:

http://bl.ocks.org/d3noob/5028304


Sankey Diagrams. Useful for complex funnels!


We always just called them flow diagrams. I didn't know they had a specific name.


Where is the data from? It's a great overview of CO2 emissions, etc.


The code is so clean and simple. This is great PR for the company. I want to work there.


I think so too!

I have done a fair bit of Ruby a few years ago but I'm new to python CRUD apps and trying to improve my knowledge here. Is defining all models in the same file[1] conventional in python apps? Rails used to have separate files for each model. And most Ruby apps that I have seen advocate the one-class-one-file convention.

[1] https://github.com/airbnb/caravel/blob/master/caravel/models...


All models in one models.py file is common for Flask and Django.

If you use multiple apps within one Django project or the equivalent in Flask (Blueprints), that extends to one models.py per app (where a "project" is a collection of "apps").

Sometimes you'll see one file model per (with a models/__init__.py that imports them for use). While I think it keep dependency imports for each model very cleanly separated, you end up having a lot of redundancy importing the same basic pieces in every model file.


It depends how many models but usually 1 app models == 1 file (taking the app structure from django) unless they are expected to grow.

For example you have a comment app that could contain several models: Comment, Thread, Report, etc those can be in the same file. To continue on the django example, I would personally prefer having a models folder in the comments app and one file per model as some can get really big.

I also do 1 file / model in Flask, minus some specific cases where it just makes sense to have them in the same file


It depends on the project.

If you have a small number of models (e.g. <= 5), then it's fine to have them all in one file, as you will not benefit from multiple files, really.

When your application is growing, you have split the models into multiple files, grouped by features, etc (e.g. users.py, content.py, etc).

I prefer this as models usually a very small, and switching from one file to another can become quickly annoying when working on related models. However, it may be different for large classes.


Not from the number of lines of code. Some of them exceeds 1000.


# of lines does not indicate good or poor coding


I'm wondering - what's the effort required to build such a BI tool (tables, charts, maps) these days assuming reusing open source components and focusing on SQL-speaking datastores? Could a small team of experienced devs accomplish such a feat in a year?



Is it covering the very beginning when the caravel wasn't open source?


I really need a great data explorer/dashboard for my postgres-based systems. I was going to use shiny but this looks really nice -- I hope the docs can be built out very soon. Can anyone comment on other competing products? In the commercial space, I like Looker but its too pricey.


Really depends on your needs. There are lots of options out there that are happy to talk to Postgres, but each has different strengths and weaknesses. If all you need is a way to basically share and visualize the output of SQL queries and everyone who's using the tool can write SQL well, then look at Periscope or Mode.

If you're ok pulling the data out of Postgres into memory locally and mostly care about manipulation and beautiful dataviz, then look at Tableau.

If you're mostly interested in more data sciency/ML stuff, then Shiny or something else that's R-based is a good option.

If you're interested in being able to embed your business logic into the tool so that non-SQL folks can build their own queries and everybody's relying on the same data definitions, that's where Looker (disclosure: where I work) excels.


Take a look to http://www.metabase.com/ (open source, made by Expa, Uber's co-founder incubator).


Second this recommendation. Very easy to get up and running - may not be able to handle some complex use cases, but for the basics it's fantastic.


It must have come very far in the past few months then...based on my experimentation, it was far from being ready for production use if you needed anything beyond the most basic of SQL queries. That said, I'm rooting for it (and now, caravel). We use Pentaho, and it's....meh. A lot of moving parts, a big learning curve, and a user experience that's just ok. I'm hoping Metabase (and now Caravel) will both succeed...the open source BI space could really use an influx of simpler, beautiful tools.

A side note: I love that Caravel is written in Python. Metabase switched from Python to Clojure, and I personally believe that to be a barrier to entry for contributing. I've started down the Clojure path a number of times only to stop because I see it as something which would be difficult to impose on my team...Lisp is just so different from what most enterprisey development teams are used to. Python, on the other hand, is easier to justify. I've found things I wanted to help fix in Metabase, but having to learn idiomatic Clojure just to submit a patch is a turn-off.


Hey, Sameer from the Metabase team.

If you don't mind, I'd love to hear why you felt we were far from being ready for production.

Our main challenge has trading off ease of installation and use with the inevitable feature creep that causes the lots of moving parts and learning curve you find meh about Pentaho. You're right in that we've focused on making the most basic of sql queries (via our non-sql tool) usable by anyone in the company on their own vs essentially producing an analytics SDK like Pentaho/Jaspersoft.

On the language front, we made a conscious decision to optimize for ease of installation and low maintenance overhead. While I agree, it's made contributing less accessible, it's been amazing how porting has made Metabase more stable and easier to install. We run a bunch of instances for people, and our ops footprint has been silly small.


I'd love to...but I can't recall the password I used when I set up the h2 db my attempts are stored in and smtp password recovery isn't working (I probably didn't configure it).

If I recall, it was difficult to understand how to properly format the results of a raw sql query to graph. Also, even once properly graphing, saving the graph to a dashboard wouldn't scale properly and would render in a very jumbled manner. Perhaps these issues have been fixed with newer versions....I'll give it another look next week.


Take a look at http://www.viurdata.com Simple to use product, with transparent pricing and you can use drag & drop or add your own SQL Queries.

Disclaimer: I am one of the founders.


Hey! I run marketing for Periscope Data, a data explorer/dashboard product. We have a lot of customers using postgres and get compared to Looker a lot. We focus on optimizing for the analyst whereas tools like Looker focus on business users. We have a lot of features for business users, but chart creation is all SQL based.

Our site is here: https://www.periscopedata.com/ and if you have any questions, shoot me an email at jon@periscopedata.com.


Do you have a self-hosted option? What's your pricing?


Word on the street is Periscope costs $1000/mo for unlimited users, up to 1B rows.


+1 Having some upfront pricing info would be great


Hey, this is very interesting, I will take a look. I am trying to shift a whole $700m division and then hopefully $3b segment on a new workflow paradigm for data (automatically refreshed dashboards instead of sending around excel files, focus on building models and not data pasting in spreadsheets) and unfortunately for now Tableau is my only option. I feel very uncomfortable going with a closed solution since I know that we will have lots of edge cases and that being able to do your own coding is in the end the best way to deal with those. License cost is also incredibly high, we are talking $200 per user per year at minimum, that means 200 to 400 thousand per year for a big organization between 1000 and 200 users.


I'm a huge proponent of the idea of centralizing the data model. That's the core idea behind Looker (where I just came to work after 3 years as a customer), and I agree it's a hugely powerful change from the world of everybody-in-their-own spreadsheet.

On your other point, though, to echo the build vs. buy discussion from above, I think it's a bit misleading to say "oh, we'll just use an open-source solution and that'll be cheaper." Because if open source means a couple of internal developers and an analyst, that's easily $300k+/year in salaries that you might not spend if you were using a vendor.

Anyway, given your particular statement of the problem you're facing, I'd humbly suggest you take a look at Looker. The data modeling layer that's core to Looker is meant to solve EXACTLY that problem, by leaving your data where it lives and then embedding your business logic in the layer that sits between end users and the data.


Wasn't Panoramix already opensource?


It was (Panoramix got renamed to Caravel), it's just officially supported, maintained an grown by Airbnb now.


It is not Java, but Python. What a surprise!


Have you considered using Re:dash[1] before writing your own tool?

[1] https://github.com/getredash/redash


Redash is around for 2.5 years only and probably AirBNB engs thought it was too early to have a look at.


seems a little bit sparse on the documentation or am I missing something?


We'll be providing short user training videos very soon.


As someone who is the core audience for this tool can I say that I strongly prefer clear documentation over videos? Videos are way too hard to maintain and end up being stale the minute after you post them in fast-moving projects. I can't text-search a video and I can't be linked directly to an answer in a StackOverflow response.

Written documentation is vastly superior to videos in my opinion.


To add--you can also skim text MUCH more quickly for the piece you want compared to video.

I absolutely loathe video for any analytics-related documentation. It rarely adds any real value over text outside of live webinars where I can ask questions.


seconded. videos are not useful as documentation.


Would very much appreciate that as well as walk-through docs.

Got it up and running easily enough, and connected to Redshift. But seemed like creating a new "slice" required custom JSON params to define it. Unless I missed something?

edit: yep, missed something. Can "explore" a table by clicking it's link in the table listing.


Is there an API for building viz in code similar to something like Bokeh, or is it end user viz design only?


Having some trouble getting it working on Windows, which I see you don't currently support (I need to create caravel_config.py to get past the fabmanager installation step). This looks really interesting but I might wait until someone has posted Windows instructions.


I got it to work on windows using the Anaconda Python 2.7 installation; keep in mind that the caravel commands in the docs have to be run from your install dir; e.g. change dir to

<yourPythonInstallDir>\Lib\site-packages\caravel\bin

then run as

python caravel db upgrade


Great looking product! Has anyone figured out how to join tables yet, or do you need to define views in your sql database?


Its Apache licensed .. so it means can i use it directly in my company replacing the existing commercial products like Tableau / Looker?


what are the supported data sources? I saw SQL tables and I imagine flat files, can you write calls to web service endpoints?


From the readme file:

Database Support

Caravel was originally designed on top of Druid.io, but quickly broadened its scope to support other databases through the use of SqlAlchemy, a Python ORM that is compatible with most common databases[1].

[1]http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html


Specifically, SQLAlchemy includes dialects out of the box for: Firebird, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, Sybase.

http://docs.sqlalchemy.org/en/rel_1_0/dialects/index.html


And has adapters for Hive, Presto, Redshift and Google BigQuery.


Awesome - great work!

A tutorial on how to link it to a mysql database would be greatly appreciated :)


They're using SQLAlchemy as a database abstraction layer which supports MySQL out of the box.

So, you just need to set the config param SQLALCHEMY_DATABASE_URI like this:

https://github.com/airbnb/caravel/blob/1b4e750b2aa111445703d...

The configuration guide explains it further:

https://github.com/airbnb/caravel/blob/master/docs/installat...


Is this like piwik?


Piwik is more of a Google Analytics replacement. It's a package that contains a data visualization platform, a data storage engine, and a data emitter (website tag) all in one.

This is just a data visualization platform. You need to bring your own data store and data.


Any pictures anywhere?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: