[GH-2830] Improve Geography query support - core#2831
[GH-2830] Improve Geography query support - core#2831zhangfengcdt wants to merge 32 commits intoapache:masterfrom
Conversation
…ation with lazy-parsed JTS and S2 caches. This enables GeoArrow-compatible serialization while maintaining full S2 functionality on demand.
- Functions.distance(Geography, Geography) → Spheroid.distance() — returns meters - Functions.area(Geography) → Spheroid.area() — returns m² - Functions.length(Geography) → Spheroid.length() — returns meters - All route through WKBGeography.getJTSGeometry() (no S2 parse needed)
also, added getShapeIndexGeography() as a third lazy cache
- null type ambiguity - level 1 functions missing Geography dispatch
…hole winding order
|
@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece. |
I am still working on it. I will clean up the benchmark codes and also for this PR, it will focus on building the core architecture of the WKB-based Geography serialization with cached S2. I will keep a few core ST functions and tests and move other to individual PRs following the merging. |
|
ST Function Performance: Geography vs Geometry (cached objects, ns/op)
|
paleolimbot
left a comment
There was a problem hiding this comment.
This seems like it is headed in the right direction!
The WKBGeography is the right direction I think (In C++ I call this a GeoArrowGeography and it's slightly more general but the same idea). As written, I am not sure it is able to make anything faster (I am guessing that Spark already handles filter()s and won't materialize a Java object unless it will actually be used for something).
This may not be true in Java, but in C++ the S2Polygon and maybe S2Polyline has an internal index for itself and also one for each loop (lazily constructed, but almost always needed for initializing from WKB to figure out the nesting of shells/holes). The optimization of implementing the S2Shape interface directly on WKB is pretty much 100% to avoid those extra index builds (so that for any given function there's at most one shape index per object that is built). I don't know how flexible the S2Shape is in Java (in C++ implementing it was not too bad). I think you would need to do something similar to see improved performance (but maybe the first thing you want to do is get the functions and tests in place).
| if (result == null) { | ||
| try { | ||
| WKBReader reader = new WKBReader(); | ||
| result = reader.read(wkbBytes); |
There was a problem hiding this comment.
I am guessing that this line is very expensive (it is in C++). The recent redo of s2geography in C++ that I just did is mostly about avoiding this line (e.g., by manually iterating over edges and using lower-level S2 primitives.
There was a problem hiding this comment.
We now avoid this for Point, LineString, and Polygon by implementing WkbS2Shape (an S2Shape on pre-converted S2Point arrays) and building the ShapeIndex directly from it.
| @Override | ||
| public S2Region region() { | ||
| return getS2Geography().region(); | ||
| } |
There was a problem hiding this comment.
I optimized this one for points using the S2PointRegion for points to avoid constructing Geography. That might not matter here (in general when implementing the functions I also made every attempt to avoid accessing the Region to avoid the index build)
| /** Spherical containment test using S2 boolean operations. */ | ||
| public static boolean contains(Geography g1, Geography g2) { | ||
| if (g1 == null || g2 == null) return false; | ||
| Predicates pred = new Predicates(); | ||
| return pred.S2_contains(toShapeIndex(g1), toShapeIndex(g2), s2Options()); | ||
| } |
There was a problem hiding this comment.
This one has a similar array of optimizations for small geometries, including for small polygons (I probably should have put the small polygon optimization in the distance path as well but I ran out of will)
@paleolimbot Thanks for the review and they are really helpful. I have implemented some of the optimizations. Here's what changed and the benchmark results.
The major improvement: WkbS2Shape avoids building 3 redundant S2ShapeIndex instances (one per S2Loop, one for S2Polygon, and only the ShapeIndexGeography one is actually used). For st_contains false case, it becomes slower because S2BooleanOperation.Contains() has different internal execution paths for true vs false results. Not sure if sedona-db also does this (S2Polygon for polygon predicates), which may see similar tradeoffs. But here is the results: |
|
Cool! In s2geography there's quite a bit of code to avoid the shape index build at all for ST_Distance() and each individual predicate, which may bring your results closer to on par. Notably, there's the possible intersection check which should help the "point is not in polygon" case. |
- point container return false immediately - point target use S2ClosestEdgeQuery instead of building ShapeIndexTarget
Thanks for pointing to the s2geography optimizations. I implemented the following two optimization for now:
I also tried to implement the MayIntersect that might help with the contains false case, but looks like the MayIntersects path (~700ns) is more expensive than the cached ShapeIndex path (~220ns) in a single call. In this PR, I was thinking to optimize the hot path, which may benefit more with cached ShapeIndex (good for joins). For the code path optimizations, adding mayIntersects pre-filters may be better; potentially adding checking covering intersection pre-filtering as well. I'd prefer to add them in following PRs with a configurable flags to avoid the hot path gets slower. There seems to be tradeoffs we need to make default and also custom sedona configs for users to tune these optimizations since the geography has too much overhead compared to the geometry functions. What do you think? |
|
That sounds good to me! I the work I did in C++ was mostly to expose all of the possible optimizations but I haven't tuned them yet (and the tuning is probably much different in Java than it is in C++). The tuning/extreme optimization is also much better done in the presence of integration testing against BigQuery/PostGIS (which is in the works on the SedonaDB side). The point optimizations are nice because they truly are easy outs (and point/polygon is I think common). One of the things that may have affected my initial implementation is that SedonaDB benchmarks mostly have one Scalar argument (where the cost of computing the covering is amortized over many more rows). I didn't implement a S2LatLngRect bound comparison but I think computing that might be cheaper. |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes #<issue_number>What changes were proposed in this PR?
Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.
Core architecture:
Geography functions (3 new):
Docs: API docs for all 3 new functions in docs/api/sql/geography/
Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR
How was this patch tested?
Did this PR include necessary documentation updates?