This actually hit my previous company in a software context.
We would number our hotfixes sequentially. Many would be items demanded by a single client, so would get deployed as hotfixes only to that customer's site, and just rolled into the main trunk for the next quarterly release for everyone else. Clients would always be notified about hotfixes going onto their live sites.
One savvy client noticed the hotfix numbering sequence. Naturally, that ensued quite a number of extremely awkward discussions as they would regularly ask why our software needed so many hotfixes (tens per week) and why they weren't entitled to all of them right away.
Solution: a new policy to randomly generate hotfix numbers. Which of course led to the next problem, that now the sequence was not obvious from the names, so dependent hotfixes would sometimes get deployed in the wrong order. Why can't anything be easy...
Just name the hotfixes by day (140222). It monotically increases, and if you do multiple hotfixes in one day, suffix a/b/c etc. Generally you're unlikely to get up to b or c, and there's no clue to how many previous versions there's been.
Leads to longer timestamps though, when you include the HHMM. If you're generally not doing more than one per day, this makes it a little easier - usually six digits instead of ten.
Except that a hotfix number is a human-interaction number, not an automation device. Speaking as someone who has worked on the support phones, I'd much, much, much rather someone only have to read back a six-digit number than a ten-digit one.
That was proposed (with a numeric suffix rather than letter), but rejected because we often would do 3 to 5 in a day, and somebody decided that would still leak too much information about our hotfix workflow. Another aspect would be that hotfixes would sometimes go several days between packaging and deployment (for all your usual megacorp red-tapey risk-aversey reasons), and we didn't want to explain to clients why 20120214 wasn't deployed until 2012-02-21.
i'm sure you (or your employer for that matter) have better things to do than twisting your internal policies to client's dis/liking.
just tell them to fuck off -- in diplomatic speak -- that's what your/a boss is (paid) for.
> why our software needed so many hotfixes
"well. software does not fall from the skies. humans are involved. we can stop fixing shit before you find out and complain, take your money and leave you with crackers. decide!"
I've worked in enterprise software and when customers are paying tens of millions of dollars per year for your software, it's difficult to tell them to fuck off... even more so when your customers all know each other.
Nobody said its easy, we are saying that it must be done or you and the customer will both be unhappy.
I think that far less software businesses fail due to telling their customers to fuck off than vice versa.
At least that is what my anecdotal evidence says. I know of a couple of software businesses that failed because they couldn't get themselves to say no to their customer.
I know of none that told their customer to fuck off and failed.
> i'm sure you (or your employer for that matter) have better things to do than twisting your internal policies to client's dis/liking
I'm sort of at a loss here. The customer is always right, right? If you're deploying software, everything the customer sees is what they're paying for. If it matters to them, and it bothers them (which I can see in this case), I think the good companies fix it, and the bad companies have "better things to do"
"The customer is always right" is an incredibly broken methodology. It works great for a mom&pop shop in a small town… not so much in any other situation.
You could always have a fixed length randomly generated number prefix, with the numbers after that being sequential, making it obvious internally the sequence while obfuscating to the customer.
For example if you did a four digit prefix your first three hotfixes might be:
What you need is random but monotonically increasing sequential version numbers. Add a random number to your last version number, and that is your new version number. ;)
Very good idea. Timestamps do not reveal how many hotfixes are made and are chronological by default. To be sure, there could of course be a central instance providing the stamps and ensuring that no two fixes get the same ... (when several fixes could be made in parallel)
There is some practical relevance to software development here. One shouldn't expose sequential IDs (a.k.a. serial numbers) to the public for anything non-public.
I see this Hacker News post has a numerical ID in the URL, for example; I can estimate the size of Hacker News given enough of these numbers... More directly, I can modify that numerical ID to crawl Hacker News.
Many sites do this; it's generally better to generate a (random or hashed or generated from a natural key) 'slug' to use as the key instead. For example, Amazon generates a unique, non-sequential, 10-digit alphanumeric string for each item in their catalog.
I'd advise against modifying that numerical ID. Recent history shows this is a punishable offense (hacking!) under the lovely CFAA and supported by the incompetent judicial system.
Other sites like Pastebin use a sequential ID too, but the convert number to another base (like from base 10 to base 43). It's shorter too. (I haven't found the link to the related stackoverflow discussion)
I joined in 2005, when you still needed to be part of a school which had a Facebook network. At that time the ID for most users was made up of a network number (which was in the 4-digit range by the time my school was added) and then a 5-digit user code. So you could (and still can, given the right info) figure out the first person to join from your school, as well as when your school got a network relative to some others nearby.
You can still use this to find some mundane information about Facebook's early userbase. For example, the first non-Harvard schools added were, in order, Stanford, Yale, Cornell, Dartmouth, U.Penn, Columbia, etc., and that the first people added to most of those networks either grew up with or attended the same high school as Zuck.
The flipside is that you can give off the impression of having a large user base/product catalog/etc if you number things sequentially... but start at a large non-round number.
It seems like one could use the same technique to estimate the initial (lowest-observable) serial number...
From the article:
If starting with an initial gap between 0 and the lowest
sample (sample minimum), the average gap between samples is
(m - k)/k; the -k being because the samples themselves are
not counted in computing the gap between samples.
Perhaps someone with a better grasp on the math can confirm that this makes 'obfuscating size by starting with a higher serial number' an ineffective mechanism?
Yes. If you're only looking at the gaps between the numbers, adding a constant offset to the serial numbers would have no effect on the estimate.
On the other hand, if instead of ordering them sequentually I roll a die and add the number of spots to the previous serial number, I think I can trick you into thinking I have three times as many tanks as I actually do. In fact, I feel quite confident of it.
If I find sufficiently many of your tanks, the distribution of the differences in serial numbers would start showing that we aren't talking about a random sample from 1…n
For example, having seen 250 IDs in the 1…1000 range and 200 in the 1001…2000 range, the next ID in the 1…2000 range I see should fall in the 1…1000 range with probability 750/(750 + 800) ~= 0.48 in the 'normal' case, and around 36/(36 + 86) ~= 0.30 with your method of doling out IDs.
And I think it would be a factor of 3.5 (the expected number of eyes on a throw of a die). That's why I expect your method to dole out 286 out of every 1000 IDs.
But it would require me from checking for this, depend on finding such discrepancies in samples, and increase the variance on my estimates for a given sample size.
I don't have the math (or time (or ambition!)) to confirm or deny this, but I think that it's not really meant to stand up to serious scrutiny - just enough for someone to glance at an invoice or a URL or a receipt and being more impressed/less wary. (See fecak's reply to my comment.)
Some businesses do this with invoice numbers to give the false impression of being more successful (more invoices should equal more sales and customers) than it actually is.
I did this myself with personal checks. I asked my bank to give me personal checks starting at some moderately sized number so that I wouldn't be giving anyone check #0000001.
The military do this as well. The British SAS when first formed was designation "L Detachment Special Air Service" as part of a disinformation campaign to persuade the Germans there were paratroopers, and at least 11 other detachments of the same size, in North Africa.
So there are ways to game the statistical analysis.
Nonsequential IDs can also help you avoid inadvertently baking in assumptions about your IDs being small, which are then broken as you scale up and cause sorrow and woe.
For example, Twitter uses sequential(ish) ID numbers for tweets. Havoc ensued when they crossed 2^32 tweets, because lots of software out there ended up being written with the assumption that a tweet ID could fit into a 32-bit integer. It was true for some years, and then suddenly it wasn't anymore, and everything broke.
Yes there are a huge amount of sequences out there with no special need of being sequential. Apple store invoices. UPS tracking numbers (sharded by shipper). Lots of similar things that can provide a glimpse into the inner workings of what are often publically traded companies with good reason to not leak such information.
It's astounding how accurate they were using only statistical methods:
> Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an estimate of 270 produced in February 1944, substantially more than had previously been suspected.
> German records after the war showed production for the month of February 1944 was 276.
The analysis of tank wheels yielded an estimate for the
number of wheel molds that were in use. A discussion with
British road wheel makers then estimated the number of
wheels that could be produced from this many molds...
Huh, I once visited a military base where people on the trip wanted to be photographed with a tank. The soldiers said it was OK, as long as somebody obscured the tank's serial number by standing in front of it. I wonder if their training in this respect was inspired by this history!
(But if so, why not print the serial numbers inside the tank, not outside? Or maybe encrypt or HMAC them?)
The answer was that when things like tanks get moved, it's by rail or ship. When figuring out what goes where, having to climb up on a railcar to figure out whose tank it was gets old.
The Soviets used to semi-randomly change ship pennant numbers and other identifying marks on their equipment to throw off NATO efforts to develop accurate strength estimates, orders of battle, and movement pictures.
As a less intentional version of this - at early stages of the 1973 war, Israeli forces in certain areas tended to consist of whatever scraps of forces arrived at the front first. This led Egyptian and Syrian intelligence to wildly overestimate Israeli force sizes, since they would sometimes see the identifying marks of multiple brigades when facing a very small force.
So, I HMAC the serial numbers of the tank. Now you have a series of very large numbers, evenly distributed.
Someone else can do the proof, but I'm pretty sure the total number of tanks is inversely proportional to the smallest difference between two observed HMACs (or between the highest observed HMAC and the highest possible HMAC).
EDIT: Actually, scratch that. That only works if you see all the tanks...
I can't understand HMAC enough to know whether it solves it, but there seems to be a trade-off between keeping it secure and introducing randomness and making people do lookups on the other end (which would have been super slow then, and I don't know how feasable hash tables are when dealing with punchcards)
Why is encryption weak due to the simple pattern of increasing numbers? Encrypting increasing numbers is how CTR and GCM mode work; those are arguably some of the best modes of operation we have commonly available.
A MAC of a message m can only be computed with the knowledge of a key K. Specifically, with a cryptographic hash function h,
HMAC(K, m) = h(K + a || h(K + b || m)),
where + is addition mod 2 (xor), || is concatenation and a and b are constants. (This construction takes into account possible length extension attacks on h.)
Given that h is secure, knowledge of any reasonable number of pairs (m, HMAC(K, m)) does not allow you to recover K, and without K, you cannot compute HMAC(K, m) for known m, i.e. enumerate all the possible MACs for serial numbers.
Don't remember where I read it at least 12 years ago, but someone talked about an April Fools prank where they released three pigs in their high school, with numbers 1, 2, and 4 written on them. Allegedly the administrators spent weeks looking for number 3.
The "weeks looking for number 3" part sounds apocryphal to me.
Snopes [0] has one real example and one television example of this prank. In the real example, the students were caught on camera, and there was no long search for the remaining livestock.
Thanks for actually researching this. As I remember, I read this bash.org, so I was pretty skeptical of the story just because of the source. It's one of those stories I'd like to believe was true just because I like it.
That's the version from 20 years ago. Those people have grown up and become school administrators, so it doesn't work as well anymore. The new plan is to release pigs numbered 1,2, 5,6,7, 10 and -2.
Book seems nice, but its discussion on the German Tank problem sets up code to calculate posteriors from priors that are more detailed than the Bayesian argument in the Wikipedia article. That is useful, but you don't get the problem intuition that the answer is the maximum ID you saw inflated by a factor depending on the expect ID-gap. There are some assumptions in the Wikipedia Bayesian analysis, but they are less determining than the ones in the book.
There isn't really 'a' correct frequentist or Bayesian answer to the problem; it's more two different ways of thinking about the problem, which could well get you the same numerical results (though they might not).
The frequentist way of thinking about it is to ask what you mean by "more correct", i.e. what properties do you want an optimal estimator to have? Another way of putting this is: if you were to set up a simulation where the real answer is known and data is sampled, and then you judge estimators by how close they get (according to some penalty function scoring closeness) when you run this simulation 10,000 times, which estimator would score the best? The estimator with the minimum variance of all unbiased estimators (the MVUE) will do optimally under some definitions of optimal; the MLE is another one that is optimal for other definitions. Note that they're both frequentist and give different answers.
The Bayesian analysis of the situation is that it basically comes down to your choice of prior: the observed information is not in itself sufficient to produce a single "best" estimate, but rather you combine it with your prior distribution to produce an estimate. The Bayesian could end up producing exactly the same estimate as the estimators this article labels "frequentist". The Bayesian argument would be that what these estimators are really doing under the language of MVUE/MLE/etc. is implicitly choosing priors, whereas the Bayesian would explicitly choose one. The Bayesian would also probably not really like the simulation-experiment idea (which is a pretty directly frequentist thought experiment).
The Bayesian argument would be that what these estimators are really doing under the language of MVUE/MLE/etc. is implicitly choosing priors
I'm curious, which choice of priors corresponds to the frequentist answer? It looks like it comes down to the distribution (n|k)? Seems like something that must've been studied.
Unsurprisingly, ...German Wikipedia has (sadly) some badly behaved admins that delete pages because the are not "relevant". These admins that delete pages like at random are called "Löschnazi" by the community [1].
My observation is that one has to resort to the English version, because either the page is missing or it is biased to Germany (albeit German language is also the native language of Austria, Switzerland, etc.).
German Wikipedia has of course also a lot of good efforts like the WikiData and Geolocation sub-project, etc. Hopefully they can kick the badly behaved admins soon... the reader to author ratio is already alarming.
I don't think it has any significance in that context to be honest. I've got a German background & have seen this posted a couple of times. Every time I think...of all the events/things in WW2 this is what you pick as interesting?
I remember my theoretical stats teacher showing us this problem. It's used all the time in ecology. His example used it to estimate the number of alligators in Louisiana swamps. They tag the alligators, release, and then using the tags they re-capture over subsequent years, they can get an estimate of how many alligators exist in the wild!
The German tank problem applies when all the alligators come pre-tagged with serial numbers :) Capture-recapture is a slightly different problem: https://en.wikipedia.org/wiki/Mark_and_recapture
So here's an idea. Conventional intelligence was off by quite a bit, spurring the allies to overproduce tanks (which was possible due to the absurd American industrial capacity), which then allowed the allies to cleanly overwhelm the order of magnitude fewer tanks they actually came in kinetic contact with.
In case anyone is interested in the numbers, USA produced in the region of 50,000 Sherman tanks during the Second World War. 5-10x more than Germany. Plus the British tanks. It didn't take all that long to reach Berlin!
I first read about this work a few years ago, but had I encountered it before college, I think I might have majored in statistics. Such powerful results - feel like magic.
I encountered a slightly different problem trying to find the size of the union of a bunch of sets. We ended up just storing the smallest k int64 hashes of each item in each set, and computing 2^64 / ((largest hash - smallest hash) / (k - 1) as an estimate of the size of the union.
This is a very old (and nice) problem in streaming algorithms. The solution currently used in most places is HyperLogLog[1], which basically uses the distribution of log(minimum value of hash) for a set of hashes.
I think the most important information is the table:
Month Statistical estimate Intelligence estimate German records
June 1940 169 1,000 122
June 1941 244 1,550 271
August 1942 327 1,550 342
We would number our hotfixes sequentially. Many would be items demanded by a single client, so would get deployed as hotfixes only to that customer's site, and just rolled into the main trunk for the next quarterly release for everyone else. Clients would always be notified about hotfixes going onto their live sites.
One savvy client noticed the hotfix numbering sequence. Naturally, that ensued quite a number of extremely awkward discussions as they would regularly ask why our software needed so many hotfixes (tens per week) and why they weren't entitled to all of them right away.
Solution: a new policy to randomly generate hotfix numbers. Which of course led to the next problem, that now the sequence was not obvious from the names, so dependent hotfixes would sometimes get deployed in the wrong order. Why can't anything be easy...