Hacker News new | past | comments | ask | show | jobs | submit login

Is Google's HR data set really "Big Data" or just "data?"

Seems like it would fit into a normal database. Or maybe even an unwieldy Excel spreadsheet.




I quote my friend who works in "Big Data":

"sometimes I think Big Data is just Excel on 128GB of RAM"


Big data is a misnomer. Complex data is a better description. Having a terabyte of simple data with 2 columns is really not that difficult to analyze and won't give you much information. Whereas having a few hundred mbs data with complex relationships and many dimensions can yield tons of information and is far more difficult to analyze.

Difficulty in "big data" should be about its horizontal breadth (covering many aspects of a system) rather than its vertical depth (covering one aspect of a system in great resolution).


Not to go all senselessly pedantic, but doesn't Excel have a limit of like 55,000 rows?


The devil is in the details. Big Data is really a massive cluster of VMs running maxed out Excel spreadsheets, and instrumented to restart automatically and restore from redundant backup, a la RAID, when the Excel process crashes one of the Windows VMs.


If you're being pedantic, it's 1,048,576 rows from Excel 2007, 65,536 rows before that.


2^20 rows! :)


Current limit is 1,000,000 pr worksheet. However there is a tool called PowerPivot which lets you get around that limit and do analysis on larger data sets.


I didn't see this response when I replied to the parent. Power Pivot is pretty great when you can use it.


Not since 2007 where it got bumped to over 1 million. http://office.microsoft.com/en-us/excel-help/excel-specifica...


He's saying his file has way more rows than that, so maybe they upped the limit in the more recent versions of excel? (I think he also wrote a bunch of VBA and hooked into some external systems too)


Not when you use it with Power Pivot.


16 384 columns and 1 048 576 rows actually


2^14 columns * 2^20 rows * 8 bytes per cell = 128GB. Bang on.


Just data. Working with "big data" is just boring old business intelligence in 99.9% of the cases.


It's just me or this kind of comment is one of the most common in HN lately, when any story about data analysis comes up ?

We get it, big data it's not really "big" unless you talk giga(tera?)bytes.

Do not take upon yourself to educate any single person that misuses the term. It's not worth it. :)


I guess it's a natural reaction to people wanting to jump on that "big data" bandwagon. Kind of like size envy I guess? So sad...


True, the point is that many people writing these stories cannot really tell (or care) about the difference. "Big data" is a sexy definition, so they go with it regardless of wheter it's actually relevant.

Most of the people here do, so these comments are really preaching in the wrong place...


Probably not, but the year that I joined they had processed a million resumes. So they probably have some level of data (ranging from phone screen only to on-site interview) on anywhere from 8 to 12 million engineering candidates. For the folks who have come on site there might be a 5 - 8k words of text in their file for phone screens probably less than 1K depending on if they include a code sample or not. Most of the folks they processed at the time didn't get to on-site interviews so it probably skews to the lower end.

Its "not" big data in the sense that it needs a cluster to process but it is a pretty large sample set of the current population of engineers who might want to work there.


But then you have a tech company that isnt sprucing a buzz word


They've reported receiving 1million applications per year. If even a fraction of those get interviewed (with 1-5 interviews per candidate) that's a good chunk of data. Correlate that with regular performance reviews of 30k employees... I'd say that's a small Big Data problem.


30k? Data. Not Big Data

And "Small Big Data" is probably data as well.


He's not talking about 30k rows, he's talking about 30k people. It could easily be big data if you monitor & document their every working moment, but they probably aren't doing that so you're probably right.


Yes, 30k people, so it's what? Some interview reports, some performance reviews, HR report/history of the employee?

It really doesn't look like something big.


1 million applications received. Say 10% of those go into some sort of evaluation process = 100k assessments/year. Say 10% of those go through an interview panel of (on average 3 interviews) = 30k assessments/year For 30k employees with (say on average) 2 assessments per year = 60k assessments/year.

So 1 million CVs per year on which to do some sort of evaluations, and 200k individual assessments per year. Over the past five years that roughly 6 million data points.

Since there's no hard-and-fast rule on this, that's why I called it small Big Data.


Even if it's 100 million rows. That's something a single beefy server with SQL Server 2012 can handle. That's not big data.

Big data is a million times 100 million rows.


> Big data is a million times 100 million rows.

[citation needed]

This whole thread is pointless. There is no definition of Big data.



Definitely not. And it's a good example of how useful POD ("plain old data") can be. They ask 6 team members 18 questions about what they think of their boss and give those 108 datapoints to her and it's tremendously valuable.


Its "deep learning" data hahaha.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: