It seems to me that journalists and activists could benefit from software that could help track networks, timelines, etc., and organize unstructured data the way an intelligence analyst would. Anything like that out there?
I've been working on a project called Aleph (https://github.com/pudo/aleph, live: http://data.occrp.org/), which is targeted at investigative reporters. It's supposed to handle a related set of problems, which is data integration of diverse public records and journalistic lead generation (i.e. what to investigate next). The road map for the coming two years includes some visual analytics similar to what I guess most people use Palantir for (we'll have to keep it very simple).
The big problem in NGO space is obviously finding engineers: we've got some budget for this stuff, but getting people to join us, work at a lower salary and then code their heart out (because teams are tiny) is hard.
On the fun side, you get to see your code applied to fighting real-world problems every day :)
there isnt any 10 ish year old low technical risk analytical platform. The wheel is remade for everycustomer, palantir is not a software company but a services one.
I think you're asking the wrong question. There is nothing special about Palantir software that you can't find in the open source world already. What Palantir does is just throw a bunch of bodies at the problem to "integrate" things.
If you truly want to provide an open source alternative then curating some jupyter notebooks with the right libraries and integration is all it takes. A properly curated set of R libraries with a nice interface would also do the trick.
The problem with Neo4j is that the end results are great, but the ingestion pipeline (especially for unstructured data) is very hard to make general purpose.
The ICIJ used a combination of Apache Tika, Nuix, Tesseract and a bunch of other components when loading data into Neo4j before interrogating it within Linkurious.
It's also worth noting, that Panama data-set is riddled with data quality issues (even if this is understandable given the size of the team compared to the scale of the problem).
My first job out of university was with a Palantir competitor (Detica at the time I was hired, then BAE Systems Detica, then finally BAE Systems Analytics or something like that[1]). There's nothing general purpose about either of their platforms (nor the similar offering from SAS). Companies like that just throw a bunch of fresh graduates at the data, and they hand-write loads of custom ETL code for every data set. A lot of times, even the "analytics" are just shitty little pattern matches over tiny subgraphs formed from the data, the vast majority of which are also coded anew by the "analysts"[2] for each data set. Data quality issues were handled with a massive case analysis during ETL.
There's really no magic going on in such products — the only part that's really general purpose is the GUI used to view the end results. The rest of it is just a bunch of lowly peons doing a ton of gruntwork to hammer the data into a form that said GUI will accept.
[1]: That last name change was after I'd left. I didn't even make it a full year at the company before my conscience got the better of me and I quit.
[2]: An "analytics" job at Detica was really just a half-step above data entry. It was mind-numbing and soul-eating. There was a very high turnover rate because even new graduates were overqualified for the position, and almost everyone was miserable.
In the UK we have https://fullfact.org , an independent fact checking charity. It isn't quite the same thing, e.g. is more of a service than software, but is open source friendly, e.g. was the topic of a recent Lucene hackathon.
Most of the core platforms these companies use are built on open source. I think the fundamental reason there isn't something comparable in the free and open arena is because at the end of the day to coral the data volumes you will find your self well on your way down the road to hell.
There is a minimum physical hardware cost in the multi-million dollar range just to get started.
Not to mention the labor cost to operate and architect said hardware. Any thing in the Petabyte scale is looking at a minimum skeleton crew of (if they are top talent, more people if they aren't top talent) maybe 3-4 people between sysadmin architect and data engineers to keep things production worthy. The industry is still light on people with those skills so ontop of the bare hardware you are probably talking about 600 to 800 K in annual labor cost as a starting point.
This would all be before you get that much usage, the minute that utilization becomes high that skeleton crew won't cut it any more. The SysAdmin will need to turn into an on-call ops team. Data Engineering will need specialists for on-boarding vs analytics vs access layer.
If you are doing anything controversial don't forget the importance of a solid security organization which is highly challenging in the distributed computing space.
These are just the technical hurdles. Depending on the data you would be bringing in there are quite possibly legal barriers as well depending on the locality.
So again its not so much that this isn't possible, more that there hasn't been someone willing to endow the kind of funding that would be required to scale something of this nature solely for the benefit of the public good.
Nah. I'm thinking of the people trying to understand the relationship between (for example) Trump and the Russians. They don't have that much data because they're not drinking the firehose of wiretaps. They're relying on news reports, contacts, financial documents, etc. It's a question of helping people connect the dots.
We are starting a program at Faraday (http://faraday.io) to offer complimentary access to journalists and educators. Faraday certainly isn't Palantir, but it's similar.
If you're a journalist or educator and want to check it out, shoot me an email, I'm andy@
I dunno, are there non-state actors with the capacity to inflitrate and disrupt militarized adersaries at a global scale in furtherance of the foreign policy motives of a super power on the world stage?
I wonder if they'd release under the GPL3 or or BSD licence. Maybe they'll let me fork them on github.
Not exactly what you're asking for. But http://www.jplusplus.org/en/ have mailing lists and is a community of journalists looking through data. I'm sure they have alternatives or suggestions.
I'm thinking about software that does link analysis - this person works with that person, got paid by other person, etc. Basically, computer support for intelligence analysts, but in the public service.
Are you certain that you want a competitor to palantir? It sounds like you might want I2 Analyst Notebook from IBM, or something like maltego from Paterva. Maltego is also free (ish) which is cool.
Because that capability is too valuable. So if you're interested in writing code that can do that you can find someone to make your life comfortable on the condition that they get to keep your code. If repo of code starts growing that can do that someone will come offering them candies :-).
As it gets a bit more common place, or if tools to do some of the heavy lifing get out.
Keep an eye on some programs like the one at CU Boulder, it should produce some interesting research that moves this along outside of the halls of industy.
Given their pricing (and the pricing of nearest competitors), combined with the relatively solved technical problem areas they operate in (connect to a data source, do some human driven NLP, draw a graph, show a map, map everything to some ontology), what they do is very ripe for some serious open source disruption.
The easy part is really the interface and information displays, the harder part (and where they make the lions share of their money) is in data connection services and software customization.
Building an Open Source Palantir tool wouldn't be all that difficult, in fact a great many organizations just build some subset of that tools using readily available open source components and with tighter coupling to their business needs. But these efforts are fractured and disorganized and there isn't a great centralized open source tool that really replicates their system.
Should there be? I think the general problem of pulling together lots of information into a common pool, then being able to annotate that data and map it to a semantic model is useful, and it generalizes well. But at the same time, many many sources of data are already available in nice semantically organized ways, with simpler interfaces (think IMDB, Pouet, Wikipedia, etc.) it's not quite clear that their approach offers enough payoff over these easier methods.
There are companies out there(Sumologic is one) which provide free access depending on unstructured data volume. But in case, you have huge volumes of data in TBs and you want to manage it using open source, you might end up spending money in scaling and supporting system that you should instead buy paid versions.
Look around in their 293 videos on that channel and I'm sure you'll see a wide variety of their products, I just pulled out the first one that popped up and looked like a demo. Remember, companies need to advertise to find new customers, so they can't be too secretive.
(My biggest tip to someone looking to interview at a company is to find their youtube channel and watch a few videos of them demo their product before you interview. You will be an expert on their product compared to 95% of the people they interview, which will make you stand out. If they publish white papers, read one or two that look interesting to you. If they publish white papers and none of them look interesting to you, that's a pretty important sign too. Sure, bone up on how to find a cycle in a directed graph, but spend an hour or two on the company themselves and it will be very rewarding.)
The big problem in NGO space is obviously finding engineers: we've got some budget for this stuff, but getting people to join us, work at a lower salary and then code their heart out (because teams are tiny) is hard.
On the fun side, you get to see your code applied to fighting real-world problems every day :)