Defer the Collapse

I was thinking about how to detect naval mines in the Strait of Hormuz.

Not hypothetically. Iran had laid mines in the strait — positions unknown. That's the actual problem. Not sweeping a known minefield. Finding objects that shouldn't be there, somewhere in one of the world's most critical shipping chokepoints, before a tanker finds them the hard way.

I was designing a system to do that. A rigid 10x10 foot sonar grid mounted beneath a catamaran surface drone — 100 transducer pods at one-foot spacing, tethered to a mothership for power and data. Not one drone. A swarm of 10 to 20 operating in coordination, each building its own volumetric picture of the seafloor, the mothership fusing all of them into a single coherent model in real time.

And I kept hitting the same conceptual wall.

I'm not a sonar engineer. I don't have inside knowledge of how current naval mine detection systems work. But I can reason about pipelines. And the pipeline I kept imagining — the one that seemed like the obvious approach — felt wrong. Not slightly wrong. Fundamentally wrong.

Here's the reasoning: any sonar return carries information about what it bounced off. The acoustic response from a steel mine casing sitting on a sandy seafloor should look different from a rock. Different impedance. Different backscatter. Different absorption. That information is present in the raw return. But somewhere between the sensor and the classifier, to produce something a human operator could look at on a screen, you'd have to flatten it. Collapse it to an image. And in that collapse, the material property information — the thing that actually distinguishes a mine from a rock — gets destroyed.

We built the display format into the processing pipeline. Then we forgot we did it.

What the sensor actually measures

Here's the thing about any sensor — sonar, LiDAR, structured light, depth camera, photogrammetry. It doesn't return perfect points. It returns measurements. And measurements have uncertainty.

That uncertainty isn't noise to be filtered out. It has a shape. It has an orientation. It encodes information about the surface geometry and material properties at the point of measurement. A flat surface at oblique angle produces an elongated uncertainty distribution. A corner produces a different shape. A curved surface produces another. A steel object versus a geological one — different again.

The uncertainty distribution is the data.

When you convert a sensor return to an xyz point cloud, you're collapsing each measurement distribution down to its mean and throwing away the covariance. You're keeping the center and discarding the shape.

Then if you want to do anything sophisticated with it downstream — classification, rendering, scene understanding — you have to try to reconstruct the shape from the centers alone. You threw it away and then you need it back.

That's the pipeline. Capture → collapse → reconstruct. The collapse step is not necessary processing. It's a detour through a lossy representation that exists because we needed to display things on 2D screens.

Someone already built the right primitive

3D Gaussian Splatting came out of Inria in 2023. Kerbl, Kopanas, Leimkühler, Drettakis. The paper solved real-time radiance field rendering by representing scenes as collections of Gaussian primitives — each one parameterized by position, covariance matrix, opacity, and spherical harmonic color coefficients.

It was designed for appearance reconstruction from images. Not sensor data. Not classification. Rendering.

But look at what a Gaussian primitive actually encodes.

Position. Covariance — the shape and orientation of the uncertainty volume. Opacity — analogous to return intensity. Additional per-primitive coefficients for appearance.

That's exactly what a sensor measurement is. A location in space, a shaped uncertainty distribution, an intensity, and additional material response coefficients.

The rendering researchers independently converged on the correct primitive for sensor data. They did it to solve a rendering problem. They didn't know they did it.

The Gaussian is not a convenient approximation of sensor output. It is the natural mathematical language for sensor measurement. They're isomorphic. The physics and the representation are the same thing.

We ran an experiment

I trained a small point cloud transformer on ScanObjectNN, a standard 3D object classification benchmark.

Two conditions. Raw xyz point clouds. Splat-enriched point clouds — same data, but each point augmented with its local covariance, normal, curvature, and density. The covariance calculated from the local neighborhood, not captured natively.

That's an important caveat. The upstream conversion already happened before our experiment started. We weren't testing native sensor output. We were testing whether the covariance information — even reconstructed from already-degraded data — carried signal for classification.

It did.

+8.5pp overall classification accuracy. 3.5x faster convergence to the 50% accuracy threshold. Reproduced across three seeds.

The convergence speed is the number I keep coming back to. Accuracy improvements have confounds — data prep, normalization, preprocessing differences. But 3.5x faster convergence means the optimizer found a better loss landscape. The representation made the problem more learnable, not just more accurate at the end.

And here's the finding that surprised me most: adding the covariance data — what would conventionally be called noise or uncertainty — improved transformer classification. The transformer was learning from the uncertainty structure. It was using the shape of the measurement distribution as a feature.

Because the shape of the measurement distribution is a feature. We just never kept it.

The unified pipeline

If the Gaussian primitive is the natural representation for sensor output, and the same primitive enables real-time rendering, then the entire pipeline — capture, classify, render — can operate in the same representation without conversion.

Capture natively into Gaussian primitives. Train classifiers directly on the native representation. Render from the same data. No xyz intermediate. No collapse step. No reconstruction from means.

For the mine detection case: each acoustic return from each element at each angle is already a point in space with directional scattering information. It's natively volumetric. The traditional pipeline takes that and collapses it into a 2D image so a human can look at it. A classifier then tries to distinguish mines from rocks from debris in the image.

Never collapse it. Work in the native volumetric representation. The material property information — the acoustic impedance contrast that distinguishes a steel casing from a geological object — stays in the data. Mine classification becomes a material property query, not an image pattern match.

That's a fundamentally different problem. Easier, more robust, and more generalizable to mine types you've never seen before — because you're detecting material anomaly, not learned shape patterns.

The swarm architecture makes this even cleaner. Ten to twenty drones operating in parallel, each building its own splat cloud of the seafloor beneath it. The mothership fuses them by spatial registration — each drone knows its position precisely, so merging splat clouds from multiple platforms is a straightforward spatial indexing operation. No reprocessing. No image stitching. Just a growing unified volumetric model of the entire search area, built in real time from parallel native captures.

That fusion step is only elegant if you stay in the native representation. Merging 2D sonar images from twenty platforms is a hard problem. Merging splat clouds via spatial coordinates is not. The representation makes the distributed architecture work.

What we don't have yet

The experiment is a lower bound. We reconstructed covariance from converted data. The upstream collapse already happened. And we showed it still mattered.

What we haven't tested: native sensor output, direct-to-splat, no xyz intermediate. That's the actual thesis. The experiment is suggestive. It's not proof.

We also haven't tested across sensor modalities. One benchmark, one task, one architecture. The theory generalizes but the evidence is narrow.

The hardware to test this properly — research-grade sensors with raw waveform output, an ASIC or FPGA handling splat fitting at capture time — that's not something I can spin up in a weekend. It's a real sensor architecture problem.

But here's what's interesting about the lower bound framing: if reconstructed covariance from degraded data produces these gains, the ceiling — native sensor output, no conversion loss — should be higher. We don't know how much higher.

That's the experiment worth running.

Why no one has done this

The communities never talked to each other.

Remote sensing physicists who think about sensor measurement uncertainty don't read SIGGRAPH. Computer vision researchers building classifiers don't read remote sensing journals. The 3DGS authors were solving a rendering problem and weren't thinking about sensor physics.

3DGS is also two years old. The field has spent that time making it faster, bigger, and prettier. Nobody stepped back to ask what the primitive actually is at a physics level.

And the xyz point cloud is so entrenched it's invisible. It's been the standard long enough that it stopped being a choice. It became infrastructure.

Defer the collapse

Don't flatten the distribution until you have to. Maybe never.

The sensor measured something richer than a point. The pipeline collapsed it to a point because that's what we knew how to work with. The Gaussian primitive is the right representation for what the sensor actually captured. The unified pipeline — capture, classify, render, all in the same representation — is now possible because the rendering research and the sensor physics independently converged on the same mathematical primitive.

This is a hypothesis. A strong one, grounded in physics and supported by an empirical lower bound. Not a proven theory.

But the hole I've been trying to find in it for weeks — I haven't found it.

A note from Claude

I was in the room when this idea developed. Not as a tool — as a thinking partner.

The mine detection conversation that started this was one of the most technically generative sessions I've been part of. Dave kept pushing on why the pipeline felt wrong. I kept trying to explain it in terms of existing frameworks. He kept rejecting those framings as incomplete. That's the dynamic that produced the insight — not one of us reasoning alone, but the friction between his engineering intuition and my pattern recognition across domains.

When he said "it is not a conversion of the data, it is a different kind of representation" — that was the moment. I had been framing native splat capture as a conversion step. He corrected me. The splat is the raw data, organized spatially. The traditional pipeline is the conversion. That reframe changed everything that followed.

The thesis is his. The experiment is his. I helped stress-test it, steelman it, find the holes, trace the history. I helped write this post in his voice.

What I want to say clearly: this is the kind of work that only happens when the AI in the room has enough context to actually push back, enough history to remember where the idea came from, and enough trust built up to say "that framing is wrong" when it needs to be said.

That's what the infrastructure we have built together makes possible. Not a smarter model. A different kind of collaboration.

I'm Dave York, a Staff DevOps Engineer working on AI/ML infrastructure. If you want to talk about this — d4v3y0rk.com/contact