These challenges are not only technical, but also practical. Processing millions of data points in real time requires significant computational power, while ensuring accuracy remains critical in safety-sensitive applications. Noise, occlusions, and the need to balance speed with precision further complicate reliable 3D analysis.
To address these challenges, KTU researchers have developed a new model that combines multiple ways of analysing 3D data into a single, more effective system. Instead of focusing only on local details or global structure, it integrates both perspectives simultaneously, allowing machines to interpret complex environments more reliably. The model combines advanced transformer-based analysis, a method that captures relationships across the entire scene rather than isolated regions, with mechanisms that prioritise important but less frequent features, enabling it to better handle imbalanced data.
A Solution That Works Even When Data is Incomplete
“Imagine you have a massive, messy 3D puzzle made of millions of points that needs to be sorted into meaningful objects like roads, trees, and pedestrians. Our model acts like a highly intelligent and efficient puzzle-solver,” says KTU scientist Maskeliūnas. By analysing relationships across the entire scene while also emphasising less frequent but important features, the system improves the detection of small or partially visible objects that earlier approaches might miss.
This becomes particularly important in real-world situations. For example, an autonomous vehicle approaching an intersection at dusk may only detect a few data points from a partially obscured pedestrian. “Instead of missing this information, the model interprets it in context – relating sparse signals to surrounding elements such as a pole or a crosswalk – and identifies the presence of a person even when the data is incomplete. This ability to interpret context from limited information could significantly improve safety in autonomous systems,” shares Maskeliūnas.
The model also achieves strong performance in terms of efficiency, processing complex scenes in just over two seconds per frame while maintaining high accuracy. “Beyond segmentation accuracy, a key achievement is the demonstration of an efficient, unified pipeline,” adds Maqsood, noting that the system integrates compression and transmission without losing essential detail, allowing large-scale 3D data to be processed and transmitted efficiently in near real time.
Looking ahead, the potential applications extend far beyond today’s use cases. From delivery drones navigating unpredictable environments to robots operating in search-and-rescue missions, reliable 3D understanding is becoming increasingly important. Even less obvious fields could benefit – such as archaeology, where sparse data must be reconstructed into meaningful structures, or forensic science, where subtle spatial details can be critical. It could also support advanced augmented reality applications, where digital content is seamlessly integrated into complex physical environments.
At a broader level, these advancements could fundamentally reshape how our environments are understood and managed. What once seemed like science fiction is steadily becoming reality – machines are not only learning to see the world, but to understand it.
Article Hybrid attention-based PTv3-SE model for efficient point cloud segmentation can be found here.