Introducing Wakari - Scientific Python in the cloud

mileswu · on Nov 9, 2012

I can see this being useful for teaching environments, but I'm not so sure how useful this would be for actual scientific research.

Most universities tend to have their own computing cluster, which will make storing the (potentially large) datasets easier, and furthermore use of the CPU time is free!

Javascript ploting for the next set of graphs I need to produce for my research though is something I'm actually pondering about. It has the ability to be interactive which can be really helpful.

pwang · on Nov 9, 2012

> Most universities tend to have their own computing cluster, which will make storing the (potentially large) datasets easier, and furthermore use of the CPU time is free!

We plan on making the entire Wakari stack available to deploy on internal clusters. (Obviously there are many businesses that don't want to upload their data to a third-party service.)

>Javascript ploting for the next set of graphs I need to produce for my research though is something I'm actually pondering about. It has the ability to be interactive which can be really helpful.

Those plotting features are very alpha at this point, but they are all components that will be in Bokeh (https://github.com/ContinuumIO/bokeh), which will be available as a standalone plotting system for Python.

paddy_m · on Nov 9, 2012

We see it being helpful for research in a couple of ways. You have all of the great python ecosystem at your fingertips for analysis. The really exciting part for scientists is the way you can seamlessly share code. Instead of having to install someone else's complete coding environment with all of the compilation steps, you will be able to simply see their code run.

draven · on Nov 9, 2012

Useful for heavy computation on smallish datasets, but not for moderate computation on big datasets. Here at work the datasets are several gigabytes so this would not be a solution.

After reading the blog post this looks like an enhanced IPython notebook with some additional packages. It could be a good platform to test-drive their technology though.

(I hope this comment does not come across as being too negative --- the tech is cool --- but I really wonder what the use case for this tool is.)

pwang · on Nov 9, 2012

> Useful for heavy computation on smallish datasets, but not for moderate computation on big datasets. Here at work the datasets are several gigabytes so this would not be a solution.

This is just a closed beta for the very first initial version of the product. We absolutely intend to make it a powerful platform for computation on big datasets.

Many people are storing data in S3, and many datasets are sourced from locations on the web or remote data providers, anyway. With Wakari, you can have extremely efficient access to those datasets via IOPro (http://continuum.io/iopro), which is adding indexing on S3 data in the next version.

We will also be adding sharing, collaboration, and publishing features in the next version. I completely understand that it might not fit every situation (or your particular situation right now), but I hope that eventually you'll find these other features fairly compelling.

> After reading the blog post this looks like an enhanced IPython notebook with some additional packages.

Not the case at all. We provide a fully sandboxed Linux environment, accessible via the browser, with a complete installation of Anaconda Pro. The IPython notebook is an additional feature of Wakari, and not vice versa.

One thing we haven't showcased very much is the fact that you have concurrent access to many different Python environments. Python 2.6 + Numpy 1.5? No problem. Want to try out Numpy 1.7 on Python 3.3? Just change the drop-down, and you get another shell with that. Want to play around with Travis Oliphant's Numba compiler for Python (https://store.continuum.io/cshop/numbapro), but don't want to go through the hassle of installing LLVM, LLVM-Py, and Numba on your own system? You're just one button click away.

Like I said, I can understand that Wakari isn't going to solve everyone's problems right now, but it's fairly cool even in this initial version, and it'll only improve over time. :-)

mileswu · on Nov 9, 2012

> We absolutely intend to make it a powerful platform for computation on big datasets.

The datasets I run on tend to be several terabytes and all the clusters that I use utilize various distributed filesystems (eg. Lustre [1] or Dcache [2] or Xrootd [3]) to store these files across local storage on all the nodes of the cluster.

I think for big datasets some kind of support for these distributed filesystems will be necessary as most private (scientific) computing clusters use them. The complication is of course that to use these filesystems, one needs to build support for each access protocol into a python module (off the top of my head I'm not sure if they exist yet; perhaps they do). While most of these filesystems do offer standard posix 'mounting', this is usually not recommended for performance reasons.

Perhaps I'm outside of the intended use case anyway. To a certain extent this problem is solved for my field of research (of course nothing ever works all the time heh), as all LHC/OSG Computing Grid centers run the same Scientific Linux distribution with the theory being that any code that works on one should work on another.

But this sounds great for teaching classes and labs! I look forward to using it.

[1] http://wiki.lustre.org/index.php/Main_Page [2] http://www.dcache.org/ [3] http://xrootd.slac.stanford.edu/

pwang · on Nov 10, 2012

Thanks for the feedback!

You're right that for scientific use cases, the data access story defines mode and methods of computation. Your use case of having a large dataset stored across many files on a distributed FS is one of the cornerstone motivations for building Blaze, next-generation distributed Numpy:

https://speakerdeck.com/sdiehl/blaze-next-generation-numpy

http://vimeopro.com/continuumanalytics/pydata-nyc-2012

Rather than traditional "load all the data into memory" sort of approaches, Blaze is inherently out-of-core, and allows the user to define mappings of an index space onto local or remote files, and then manipulate that structure at the Python prompt as easily as if it were a small matrix stored in memory. There are existing PGAS approaches that are similar in spirit, but they tend to invoke heavyweight MPI machinery or make assumptions about the regularity or structure of the dataset that is being distributed across the cluster. Our guidance in designing and implementing Blaze is to make simple distributed things easy, and hard things possible; we are not trying to solve decades of distributed computing and linear algebra problems as a start-up. :-)

So, hopefully you can see that with Blaze as your data access mechanism, and Wakari as a web-based front-end, you should be able to do large scale compute from within a web browser.

dbaupp · on Nov 9, 2012

Will you (or do you) offer PyPy support too? It is looking to be the fastest Python implementation around, and even more so as its support for Numpy matures.

pwang · on Nov 9, 2012

Possibly; it is not on our immediate roadmap, but is something we can add if there is enough demand.

If you are interested in accelerating Numpy, have you seen Numba? One decorator can net you a several hundred times speedup over pure python, and in many cases, a significant speedup over Numpy: http://www.slideshare.net/teoliphant/numba-lightning

http://continuum.io/blog/the-python-and-the-complied-python

https://vimeo.com/continuumanalytics/review/53105906/9f5dcbb...

rajbot · on Nov 9, 2012

> Javascript ploting for the next set of graphs I need to produce for my research though is something I'm actually pondering about. It has the ability to be interactive which can be really helpful.

If you haven't already, you should check out the wonderful ipython HTML Notebook:

http://ipython.org/ipython-doc/dev/interactive/htmlnotebook....

jh73 · on Nov 9, 2012

It's certainly true that a lot of principle investigators think this, but computing time is never free. A lot of the clusters have to win their own grants or fight for funding from the university. Nevertheless it looks like they are planning to include private cloud integration.

I'm really looking forward to using this.

paddy_m · on Nov 9, 2012

We will be coming out with a private cloud version. That's an interesting use case, managing your own cluster in a university. We anticipate demand for private cloud from private industry where security criteria are different.

w1ntermute · on Nov 9, 2012

And so continues the trend of software with Japanese names. Wakari, or 「わかり」, means "understanding" or "comprehension".

tkf · on Nov 9, 2012

Actually it's not a complete word. So, it's like "understan" or "comprehen". Probably because it's still beta.

w1ntermute · on Nov 10, 2012

Nope, it actually is a complete word: http://tangorin.com/general/%E3%82%8F%E3%81%8B%E3%82%8A

You probably came to that conclusion from the fact that the word 「分かります」 means "to understand". Actually, 「分かります」 is made up of two parts, 「分かり」 (the same word that started this whole discussion), the nominalized form of the verb 「分かる」 ("to understand"), and the polite suffix 「ます」. 「分かり」 (or 「わかり」, as it's often written when nominalized) can of course be used by itself as well.

> Probably because it's still beta.

From a software branding perspective, that would be a really stupid thing to do. It would severely damage your SEO, and you'd be stuck renaming all kinds of resources and files after you go out of beta.

wesm · on Nov 10, 2012

Whoa there cowboy. I'm pretty sure that tkf (if he's the same one who's submitted pull requests to my projects) is from the land of the rising sun. And his second statement was a joke. And funny IMHO

w1ntermute · on Nov 10, 2012

> I'm pretty sure that tkf (if he's the same one who's submitted pull requests to my projects) is from the land of the rising sun.

Him being Japanese doesn't automatically make him an authority on Japanese grammar. The small Midwestern town I grew up in is chock full of white-as-bleach European Americans whose ancestors have exclusively spoken English for several generations. Most of them couldn't tell you the difference between an adjective and an adverb if their lives depended on it.

The same goes for many of the foreign (read: white) English "teachers" in Japan. Most of them barely graduated from college, couldn't find any other work, and so they decided to go to Japan to teach English. In actuality, they should be going back to America and spending a couple years in middle school remedial English classes.

In fact, just to confirm, I got out my copy of Koujien (an authoritative Japanese dictionary comparable to Merriam-Webster's Collegiate Dictionary or the Oxford English Dictionary) to look up 「わかり」. And sure enough, there it is:

> わかり【分り】

> 　わかること｡さとること｡のみこみ｡会得えとく｡了解。

> And his second statement was a joke. And funny IMHO

Well, that went over my head. If so, then his first statement was probably meant as a joke as well.

nine_k · on Nov 9, 2012

「分かり」 even, so that it's not completely phonetic.

w1ntermute · on Nov 9, 2012

It can also be written like that, but the Google Japanese IME, which orders options according to the frequency with which they occur in the pages they index, gives 「わかり」 as the first option.

One common mistake made by over-eager foreign learners of Japanese is to try to write everything they can in kanji. In actuality, a lot of stuff is left in kana.

protonormal · on Nov 9, 2012

I'm a big fan of the concept, but this is hardly new (although the interactive graphs are great)

For anyone who wants something like this, check out the IPython notebook and Sage:

http://ipython.org/ http://sagemath.org/

stefanu · on Nov 9, 2012

Actually, if you look at it closer and try it you will see that Wakari includes IPython Notebook - in the right pane you can chose one of three different python interpreters: python (pure), ipython or ipython notebook.

The ipython notebook is running remotely - in the cloud, not on your local machin. You have access to your explorative analytical workspace from anywhere and it's state is persistent.

pwang · on Nov 9, 2012

This is our MVP, so there are a lot of shared features with some other existing projects. One of our "deep tech" features I don't think anyone else has is the ability to quickly and dynamically switch between different Python environments, meaning that you can easily try out new versions of libraries or test your code with new versions of Python.

We are also iterating on the overall UI and user workflow, to really facilitate the data exploration & analysis process with Python.

burcin · on Nov 10, 2012

lmonade (http://www.lmona.de) is another (free) scientific software distribution that tries to address these problems. The technology underlying lmonade, basically Gentoo linux, can handle this.

    $ eselect python list
    Available Python interpreters:
      [1]   python2.6
      [2]   python2.7 *
      [3]   python3.1
      [4]   python3.2

pwang · on Nov 11, 2012

Thanks for the pointer! However, note that we offer all of the Scientific Python stack built against each of these interpreter versions (and against different Numpy/Scipy versions) as well. :-)

capkutay · on Nov 9, 2012

I enjoyed using Sage for my linear algebra/probability class. Has all the light weight math functions built in, but you still have to write enough code to actually understand the problem you're solving.

anonymouz · on Nov 9, 2012

Also keep an eye on https://salv.us/, I think it is/will be more or less Sage run on a large cluster.

jderick · on Nov 9, 2012

I notice this seems to be a web based portal to a linux environment? If so, anyone care to explain the advantages over say just offering a VNC server to the same linux env?

paddy_m · on Nov 9, 2012

Great question. This is our first release, and we will be fleshing out features, here are a couple of advantages to our approach. We will be able to have much finer grained sharing than is possible with VNC. Eventually you will be able to publish a single interactive plot, or a single environment to collaborate on. For many new users, VNC is a non starter because of the setup involved. VNC works really well for controlling a single computer, but it doesn't help you much with managing clusters, python execution environments, sharing or many of the other things that we are doing with Wakari.

akoumjian · on Nov 9, 2012

I'm very excited to give this a try. Congrats on the mvp release.

pwang · on Nov 9, 2012

Thank you!

hogu · on Nov 9, 2012

and you've been approved!