Oh boy, this gives me a chance to talk about one of the gems of astronomy software which deserves to be better known: HEALPixel tesselation!
HEALPixels stand for 'Hierarchical Equal-Area Iso-latitudinal Pixels'. It is a scheme that was developed to analyze signals that cover the entire sky, but with variable density.
Like HTM or Hilbert curves, this can be used to organize spatial data.
The tesselation looks kind of funny but has many good features - it doesn't have discontinuities at poles, and is always equal area. And with the "nested" healpixel formulation, pixels are identified by integers. Pixel IDs are hierarchical based on leading bits - so, for example, pixel 106 (=0110 1010) contains pixel 1709 (=0110 1010 1101). This lets you do some marvelous optimizations in queries if you structure your data appropriately. Nearest neighbor searches can be extremely quick if things are HEALPix-indexed - and so can radius searches, and arbitrary polygon searches.
HEALPixels are used today for more than just their original intent. LSST will use them for storing all-sky data and point source catalogs, for example.
I experimented with geospatial Hilbert Curves as a Postgres extension [0] for PostGIS using the S2 [1] spherical geometry library. S2 uses a scale free cell coverage pattern that is numbered using six interlocking space filling Hilbert Curves [2]. This approach is similar to what is used in this article. S2 doesn't use hexagons but a different cell structure but the idea is generally the same.
By having both high level (cell) and low level (cell id) geometries it was a very powerful library which allowed projection from the hilbert space into a Postgres spatial index (spgist) including various trees, like noted in this article. It appears to be still quite active in development.
Ah - I've used S2 in the past. Great work. It scales really well for larger datasets.
Interesting that you mention the use of multiple hilbert curves as well. We also experimented with two Hilbert Curves, rotated by 90 degrees. This helps to get around what we've dubbed the "Hilbert Equator" problem where two objects are quite far on the curve because they are placed close to one of the major fault lines in the fractal (for lack of a better word)
There are always caveats with these types of indexing methods, especially if you require dynamic high-performance indexing and support for rectangles/polygons. You can only move the edge cases around, there is no way to eliminate them. This allows you to tailor an indexing scheme for specific workload assumptions but this obviously breaks down if you need an algorithm that generalizes to many workloads and data models. There isn't just one type of edge case with this type of indexing, there are several which may or may not be relevant depending on what you are trying to do.
Some research from the 1980s showed it is only possible to mitigate bounded categories of edge case, thereby improving generality, by indexing on complex higher-dimensionality embeddings. Mitigating more categories requires more dimensions and more complexity. However, no one could figure out how to construct these embeddings for even basic cases or deal with more practical curse of dimensionality issues, so that is largely forgotten (the researchers themselves made comments to the effect that they didn't think a tractable solution was possible). I've never even been able to find that literature in electronic form, unfortunately.
Like AI, it is an interesting open-ended problem space. You can prove that an elegant optimal solution is not tractable so it ends up being a search for asymptotically optimal algorithms that become exponentially more complex the closer you get to optimal.
Tackling these boundary problems are the literal "edge cases" in geospatial indexing and they exist everywhere, so this i a good reason for using an existing library as the authors have already solved them.
Hexagons are cool, but they are not necessarily the bestagon for a spherical geometry since you cannot break a hexagon into smaller hexagons, whereas an S2 cell is a "cube" with spherical sides or HEALPix uses a rhombic dodecahedron [0] both of which can be split into smaller divisions of themselves.
Not to discourage you from your experimentation, it's all trade offs and you might find a good one. Good luck!
> you cannot break a hexagon into smaller hexagons
Actually you can.
1. Think of hexagons as six equilateral triangles sharing a center point.
2. Place one smaller vertically flipped equilateral triangle, in each original triangle.
3. Each original hexagon center point is now the center of a smaller (1/2 linear dimension, 1/4 area) hexagon.
4. New small hexagons replace each of the six original hexagon's edges. Since edges are shared, this is an increase a 3x increase in number of hexagons.
So each new hexagon has 1/4 the area of the original ones (and 1/2 the linear dimensions). This results in 2x the linear dimension resolution, 4x the area resolution.
Grids could also be increased in scale the same way. By retaining a half-sized (in linear terms) square at the center of each original square, and turning each original edge and corner into new squares. With the same 1/2 and 1/4 ratios of linear and area scaling.
Does that end up with specific points that have that same difficulty at intersections of equators? I suppose even if that is the case, a point is something that's easier to position over a dead area than a line would be.
It's an important point. We've ended up using two Hilbert curves, rotated by 90 degrees to tackle this problem. If it is close on either, it is close in 2D space
As mentioned in other replies, this doesn't fully solve a lot of the cases. I think z order curves might actually be better here, they are slightly less accurate individually, but in a way that would seemingly minimize the worst case results when taking the minimum of the two. But at that point you are probably better off using a single z-order curve and using the coordinates to check the manhattan or Chebyshev distance using bitwise operations.
I would second the use of Z-order curves. Hilbert curves were famously known as more optimal on spinning disk in a specific context a few decades ago. Inertia has kept them top-of-mind even though modern systems do not have the problems Hilbert curves were better at solving and those curves are more complex to use than e.g. Z-order curves.
The problem of finding the shortest path on a map (a la google maps etc) is often solved with approaches like "hub networks" and "contraction hierarchies" rather than simple spatial networks. Things like dijkstra's algorithm are only efficient if you're input is the raw graph and point and you only want one path once.
Here is one of the recent advancements in performing similarity (including nearest neighbor) search on sparse high dimensional data using learned indices:
https://arxiv.org/abs/2204.10028
The paper basically deals with the most adverse case but the approach should work in lower dimensions and in data with a non-uniform density distribution as well.
The purpose of using space-filling curves is to convert coordinates into 1 dimensional data, thus they can be indexed by ordinary database index, no matter distrusted or not. The problem is it's much slower than a "real" spatial index like R-tree, and the supported queries are rather limited.
On the other hand, spatial data can be easily partitioned, rendering the aforementioned method less useful. Also, I've never seen anyone needed more than a single instance of postgres to index all the spatial data, unless you are dealing with spatial temporal data, which can grow infinitely large. Then again, spatial temporal data can be easily partitioned.
I knew about hexagons but not about R-trees and Hilbert curves, and then combining the two. Even though that was an inbound content article, it was at least informative.
HEALPixels stand for 'Hierarchical Equal-Area Iso-latitudinal Pixels'. It is a scheme that was developed to analyze signals that cover the entire sky, but with variable density.
Like HTM or Hilbert curves, this can be used to organize spatial data.
The tesselation looks kind of funny but has many good features - it doesn't have discontinuities at poles, and is always equal area. And with the "nested" healpixel formulation, pixels are identified by integers. Pixel IDs are hierarchical based on leading bits - so, for example, pixel 106 (=0110 1010) contains pixel 1709 (=0110 1010 1101). This lets you do some marvelous optimizations in queries if you structure your data appropriately. Nearest neighbor searches can be extremely quick if things are HEALPix-indexed - and so can radius searches, and arbitrary polygon searches.
HEALPixels are used today for more than just their original intent. LSST will use them for storing all-sky data and point source catalogs, for example.
More here:
- Original NASA/JPL site: https://healpix.jpl.nasa.gov/
- Popular Python implementation: https://healpy.readthedocs.io/en/latest/
- Good PDF primer: https://healpix.jpl.nasa.gov/pdf/intro.pdf
And an experimental database being built on healpix for extremely large data volumes (certainly many TB, maybe single-digit PB): https://github.com/astronomy-commons/hipscat