Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
De-anonymizing programmers from binaries (2017) (arxiv.org)
103 points by pvitz on Aug 9, 2022 | hide | past | favorite | 27 comments


Related:

De-anonymizing programmers from executable binaries - https://news.ycombinator.com/item?id=16598962 - March 2018 (39 comments)

When coding style survives compilation: De-anonymizing programmers from binaries - https://news.ycombinator.com/item?id=10806956 - Dec 2015 (67 comments)


Not directly relevant but it got me thinking:

Has anyone tried to use the source code and white paper to figure out who Satoshi Nakamoto might be?

If you can figure it out from binaries, surely there is a lot more info. And you have the github and the blogosphere to compare.


I bet everything he has ever done including the Bitcoin Whitepaper has been analysed very precisely. Satoshi must be an expert in hiding.


> Satoshi must be an expert in hiding.

There are allegations that Satoshi may have accidentally slipped up and leaked IP address that was not a Tor exit-node or other anonymous-proxy.

Could either be Satoshi fucking up and not using Tor all the time (has happened to other 'anonymous' entities) or perhaps they needed a clearnet connection for some reason and managed to use another internet connection not attached to any identifiers that would lead back to them despite that.


he's dead.


One point in favour of that theory is he has billions of dollars in bitcoin and has never attempted to spend any of it. Or maybe has lost the keys.


Two of the top candidates for Satoshi are known to have died as well: Hal Finney and Len Sassaman.


If he or she were dead, but their writing and code is extant, presumably they could still be identified.


This uses 600 'candidate' programmers. But I wonder how much harder it becomes when on e.g. an arbitrary piece of github gist code. As the number of candidates increases (with many writing the same styles) I'd imagine the problem becomes enormously more difficult.


With copilot I'm sure the data exists and it's a matter of modeling it out. The fact you can't configure co-pilot to do celebrity coding by fine tuning it to a particular person or organizations repositories is actually surprising to me.


This is a good point.

The confidence level of the identification will be reduced as the sample size increases.


I am thinking of a future when every piece of code can be traced back to a common ancestor because everyone is using a tool like copilot and there is no identifying signature. De-anonymization is only going to become more difficult.


How effective is, say, movfuscator against this?

https://github.com/xoreaxeaxeax/movfuscator


That's a good point. In general there's probably a ton of easy ways to adverserially rewrite the binary against a deanonymizer like this. You could probably even make a program that rewrites your own code into someone elses style to frame them.


Their experiment with different optimization level is with symbol information intact and they did not mention whether they have debug information enabled. Stripping the binary but with no optmimization reduces accuracy by 24%, but they did not mention the accuracy of O3 + stripped binary, so I guess it is probably not that good as this is so obvious that they should have tried.

Interesting research anyway.


Also lto might have a severe impact there.


Basically , to outsmart this algorithm you can use deniability attack

You just say that someone imitated your style. It's not like binary has cryptographic signature of person who compiled it, even then you can say that someone stole your private key.


This is very interesting in terms of threatintel and attributing malware to attackers.


Must be pretty brain-dead malware programmers to be IDed based on this.


I don't even want to know what the GDPR-implications of this are..


None. There is a difference between actually storing personal data and it being possible to forensically analyze data not eligible for protection and potentially correlate other similar data that in turn is tagged with metadata.

The second party has the obligation not the first. Ultimately all risk of exposure of personal data derives from the second party. For example if you mail in an executable to a client and put some code on github under your own real name the holder of the exe has no obligation because it is impossible for your identity to be exposed by it or indeed an infinite number of similar executables.

It is only when combined with your github profile where you willingly shared a work sample and your real info that you could possibly be exposed.


Hinges on how loosely a data protection authority is willing to read "filing system" because screw you, I'd think.


Technically, it means that any system holding binaries is capable of holding data identifying a user, which has crazy gdpr implications.

Practically I don't expect this to have any impact beyond state surveillance, where obtaining a binary for a virus (or, you know, drm-defeat code) can identify its creator against any public code they would have posted elsewhere.


Most malware is packed and/or obfuscated. I'd imagine this defeats fingerprinting relatively handily since the binary is rewritten. I'm sure this technique is used to catch particularly dumb adversaries, but against anyone with a hint of operational security it wouldn't work at all. Moreover, what's stopping a determined adversary from rewriting the binary with a signature that matches another person? Using this as a targeting method would have a lot of collateral damage.


Anyone whose job it is to evade detection will pull their output through a scrambler. What this will catch is small-time criminals, probably in minority groups (frequently categorised as "high risk" by police).


But is a developer a user? I think saying that is quite the jump.


This implies compilers are not as efficient as they could be and work needs to be done on that. Style in a binary is waste.

If it's true.

Are they sure it's not from text within the programs or other fingerprints?

I wish they gave examples of the fingerprints. It's hard to even know how to move forward without that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: