Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If a chain of hashes is associated with a single work ... you might get somewhere.

I've thought through the problem of fingerprinting records (as in, any recorded data: text, images, audio, video, software, etc.) in a way that coherently identifies it despite changes over time. Git and related revision control systems probably offer one useful model. Another is to generate signatures via ngrams of the text in such a way that's resilient (i.e., non-brittle) despite varioius changes: different fonts, charactersets, slight variances in spelling (e.g., British vs. American English, transliterations between languages), omissions or additions, or other changes. Different versions of the same underlying work, e.g., PDF, HTML, or ASCII text, translations, different editions, etc., all have much in common, though in ways that a naive file hash wouldn't immediately recognise or reveal.

We often refer to works through tuples such as author-title-pubdate, or editor-language-title. This is a minute fraction of the actual content of most works, but is remarkably effective in creating namespaces. Controlled vocabularies and specific indexing systems (Dewey Decimal, OCLC, Library of Congress Catalog Number, ISBN, DOI, etc.) all refine this further, but require specific authority and expertise. I'd like to see an approach which both leverages and extends such classifications.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: