"The take home message should be that identifying someone in a group of ten people requires very little effort. Anyone with access to even low dimensional data, such as basic demographic, can do that. This is not very surprising."
"To summarize this point, the title says: “Identification of individuals by trait prediction using whole- genome sequencing data” but most of the trait predictions is carried by ethnicity of the individual (genomic PCs) rather than the trait specific SNPs."
Summarizing: DNA can estimate ethnicity, which can be used to predict a range of traits, including facial structure. These results would be far more impressive if they were able to predict faces in data composed of a single ethnicity. HLI may have overstated their results, but nevertheless they raise important points about re-id.
The fact that identical twins look similar suggests that most of the variation in facial identity is going to be driven by a heritable element. If you could fully execute the program encoded by a person's DNA, I'd expect to see a reproduction of their face.
However, that depends on correctly mapping the program to its output. I'm skeptical that we're at the point where we can model this correctly. I think this is why you see that all of the predicted faces in the Venter paper look like a generic/averaged face.
They also probably grow up together, have similar diets, speak the same language... there are lots of correlated environmental factors with twins aside from their DNA.
A better comparison would be separated twins (widely separated). But I'm not sure if anyone has looked at the relative differences between twins raised together vs apart.
> A better comparison would be separated twins (widely separated).
The two people in The Parent Trap are twins and were widely separated, yet they still looked identical, so much so that their parents could not even tell them apart.
There is a documentary I think on Netflix of twins who find each other randomly on the internet. I think the were Korean and we're adopted by two families. One in California and another in France.
> The fact that identical twins look similar suggests that most of the variation in facial identity is going to be driven by a heritable element.
I don't see how you get to that conclusion. If a set of twins inherited nothing from their parents they could still be identical. The DNA could be generated randomly and then duplicated.
This is just a logic argument as clearly everyone inherits many things from parents.
Anonymous participants in genome research have been re-identified in the past [0]. In one of these examples, people were re-identified merely using zip code, date of birth and gender. Everyone should assume that they can be identified once they have revealed even a few personal datapoints. The genome contains millions of datapoints.
We actually do have some idea about necks and I believe noses - the important elements are DNA enhancers. There was a presentation at the Society for Developmental Biology in 2012 on this topic, though I can't recall the scientist who discussed this...
Could you just throw loads of genomes (about 1.5gb of data) and a photo of peoples faces and train a deep learning model? It would probably do okay, unless of course environmental factors are more important than genetics here. Judging by how different siblings tend to look I’d say genetics isn’t the whole story.
In theory, yes. In reality, we don't have enough data for that to work. A model that takes the whole genome as input is excessively expressive and would overfit, finding spurious correlations everywhere. In the near term, we still need to preprocess the genome to extract lower-dimensional features of interest.
you could use montecarlo to search for an effective compression of the input data. That's Singular Value Decomposition if I'm not mistaken. Dimensional Reduction is a hot topic in coding theory. Optimaly, understanding of the biologic process involved would certainly help here. DNA is thought to be higly compressed and self modifying, so a smaller encoding is unlikely. Therefore, seperation of the DNA sequences involved in the effect under scrutiny might fail on the possibly highly random inputs without good techniques and heuristics. Effectively, pre-processing could involve recompilation and all the techniques used in software analysis, only then live debugging is to be taken literal.
Reduction to spectra can be used to achieve sparseness, that cover the error margin with probalistic precission including invariants and the mentioned external factors, maybe also data aquisition errors from cost effective (read cheap:) methods.
From Erlich [0]:
"The take home message should be that identifying someone in a group of ten people requires very little effort. Anyone with access to even low dimensional data, such as basic demographic, can do that. This is not very surprising."
"To summarize this point, the title says: “Identification of individuals by trait prediction using whole- genome sequencing data” but most of the trait predictions is carried by ethnicity of the individual (genomic PCs) rather than the trait specific SNPs."
Summarizing: DNA can estimate ethnicity, which can be used to predict a range of traits, including facial structure. These results would be far more impressive if they were able to predict faces in data composed of a single ethnicity. HLI may have overstated their results, but nevertheless they raise important points about re-id.
[0] http://www.biorxiv.org/content/early/2017/09/07/185330.1