Paper Summary: NPMs: Neural Parametric Models for 3D Deformable Shapes

8 min readSep 14, 2023

Palafox, Pablo, et al. “Npms: Neural parametric models for 3d deformable shapes.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Link to original paper

Abstract: Parametric 3D models have enabled a wide variety of tasks in computer graphics and vision, such as modeling human bodies, faces, and hands. However, the construction of these parametric models is often tedious, as it requires heavy manual tweaking, and they struggle to represent additional complexity and details such as wrinkles or clothing. To this end, we propose Neural Parametric Models (NPMs), a novel, learned alternative to traditional, parametric 3D models, which does not require hand-crafted, object-specific constraints. In particular, we learn to disentangle 4D dynamics into latent-space representations of shape and pose, leveraging the flexibility of recent developments in learned implicit functions. Crucially, once learned, our neural parametric models of shape and pose enable optimization over the learned spaces to fit to new observations, similar to the fitting of a traditional parametric model, e.g., SMPL. This enables NPMs to achieve a significantly more accurate and detailed representation of observed deformable sequences. We show that NPMs improve notably over both parametric and non-parametric state of the art in reconstruction and tracking of monocular depth sequences of clothed humans and hands. Latent-space interpolation as well as shape/pose transfer experiments further demonstrate the usefulness of NPMs. Code is publicly available at this https URL.

1. Introduction

The title alone carries a lot of questions: What are deformable shapes? What are parametric models? How are neural networks used for this modeling? Let’s start with each of them one by one.

What are deformable shapes?
Deformable models are curves or surfaces defined within an image domain that can move under the influence of forces.

What are parametric models?
This is a way of 3D shape representation which maps a parameter domain to the 3D surface. A simple 2D example would be x = cos(t) and y = sin(t) are parametric representations of a circle.

Constructing such a model for deformable shapes or even rigid shapes is quite complex and many times requires manual intervention and domain knowledge. This paper proposes Neural Parametric Models (NPMs), wherein they learn both the shape and pose representation that can be used like a traditional parametric model to fit to new observations. The authors demonstrate its performance on the task of reconstruction and tracking of monocular depth sequences, as well as their capability in shape and pose transfer and interpolation. Their main contributions can be stated as follows:

2. Related Work

“DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation” (link) proposes a feed-forward network to predict an SDF value given a query location conditional on a latent code that represents the shape, trained in an auto-decoder fashion. To know in-depth about how it works, read this blog.

However, DeepSDF produces static surfaces that are not controllable. This paper proposes to use the representation power of DeepSDF to learn representations for both shape and pose, to enable controllable 3D models.

3. Method

Neural Parametric Models (NPMs) are a learned approach to construct parametric 3D models from a dataset of different posed identities. A brief overview is as follows:

The dataset consists of a meshes featuring a set of shape identities from the same class category in different poses. However, it must follow 2 constraints:

Each shape identity is poses canonically (e.g. T-pose)
Each shape identity has several posed or deformed instances for which surface correspondences to the canonical shape are available

3.1 Learned Shape Space

The shape space is learned in a similar manner to the DeepSDF paper. It is learned by an MLP in an auto-decoder fashion which predicts the implicit SDF for shape identities in their canonical pose.

All the shapes are first normalized to reside within a unit bounding box and then made watertight. For each shape in the training set, points are sampled alongside their SDF value in two ways:

Near surface points by adding random noise to surface points
Uniformly sampled points in the unit bounding box

Using this sampled data, the Shape MLP is trained.

3.2 Learned Pose Space

The pose space is learned by an MLP which predicts the deformation field, that is, it maps points from the canonical pose to the corresponding point location in the space of the deformed pose. This prediction is conditional on both the learned shape space and learned pose space.

This MLP is trained on a set of deformation fields from the shapes’ canonical pose to the arbitrary poses. However, how do we get the dataset for training such a model?

They sample surface points on the previously normalized canonical shapes for each shape in the dataset, alongside storing the barycentric weights for each sampled point. Each point is then randomly displaced a small distance d along the normal direction of the corresponding triangle in the mesh. Then, for each posed shape available for the canonical shape, they compute the corresponding point using the barycentric weights and d to sample to posed mesh. This in turn gives the deformation field defined near the surface.

The training is carried out as follows:

3.3 Inference-time Optimization

Once our latent representations of shape and pose have been constructed, one can leverage these spaces at test time by traversing them to solve for the latent codes that best explain an input sequence of L depth maps. The authors thus fit NPMs to the input data by solving for the unique latent shape code and the L per-frame latent pose codes that best explain the whole sequence of observations.

The first step is obtain initial estimates of the shape code and pose codes. This is done via two 3D convolutional encoders, one for shape and pose codes respectively. Both encoders take as input the back-projected depth observation in the form of a partial voxel grid. It is passed through 3D convolutional and fully connected layer to output a latent code estimate. It is trained using the latent shape and pose codes learned from the training set.

Once one has the initial shape code, one can extract the canonical shape surface. The authors state that from the canonical shape surface, they sample surface points and add random displacements to them. These are our input points.

Then, they minimize the following loss over the shape code and its respective pose codes:

The exact details about the loss can be found in the paper.

4. Experiments

The authors evaluate NPMs on the task of reconstruction from monocular depth sequence observations. They also demonstrate shape and pose transfer and lastly, show interpolation between learned shape and pose spaces.

The authors present a comprehensive comparison with state-of-the-art methods on clothed human datasets and show the general applicability by learning NPM for hands.

For evaluation, the authors report the following metrics: Intersection over Union (IoU), Chamfer-l2 (C-l2) and End-Point Error (EPE).

4.1 Model Fitting to Monocular Depth Sequences

Real human data

The authors present a comparison to the state-of-the-art on monocular depth data rendered from the CAPE dataset. NPMs clearly outperform previous approaches on all 3 metrics. The authors state that their approach to learn shape and pose spaces provides both effective shape and pose regularization over the manifolds, while capturing local details. This results in more accurate reconstruction and tracking performance. A qualitative comparison can be seen below:

Effect of the encoder initialization

In order to gauge the effectiveness of the encoder initialization for the shape and pose codes, they ran experiments with no shape encoder, no pose encoder and no shape & pose encoders. In place of the encoder predicted initialization, they use the average shape and pose latent codes from the training set. The authors note that shape and pose code estimates provided by the encoders resulted in a closer initialization and led to an improvement in reconstruction and tracking performance.

Synthetic human data

The authors also evaluate on synthetic sequences from the DeformingThings4D dataset, wherein NPMs again outperform the previous approaches on all three metrics. Qualitative comparison can be seen below:

4D point cloud completion

The authors also compare this approach with a previous model called OFlow on their 4D point cloud completion task. For NPMs, only monocular depth sequence of depth maps are passed as input as compared to dense point cloud trajectories sampled for OFlow. However, even with more partial data, NPMs achieve significantly improved performance.

Hand registration

We have seen that NPMs can be constructed on various datasets with posed identities and now the authors demonstrate its applicability to hand data generated with the MANO model. The authors state that NPMs accurately capture both global structure as well as smaller-scale details.

4.2 Shape and Pose Transfer

Due to the different spaces of shape and pose representation, NPMs enable shape and pose transfer, that is, we can transfer a given shape to a specified pose (shape transfer) or a given shape to different poses (pose transfer).

4.3 Latent-Space Interpolation

Similar to the DeepSDF paper, the latent spaces of shape and pose can be traversed to obtain novel shapes and poses.

4.4 Limitations

While NPMs demonstrate great potential for constructing and fitting learned parametric models, several limitations remain. The authors note that their implicit representation of shape and pose deformation can struggle with very flat surfaces as they comprise little volume and must have precisely defined inside and outside. They also state that while NPMs can capture fine-scale details, high frequency details (sharp edges) still remain challenging.

5. Conclusion

Neural Parametric Models enable the construction of learned parametric models with disentangled shape and pose representations which can accurately represent 4D sequences of dynamic objects. The authors demonstrate its performance on both real-world and synthetic data, proving how it can learn implicit functions to expressively capture local details in shape and pose.

6. Final Words

The paper is quite comprehensive and well-written. The approach builds upon DeepSDF for deformable shapes to outperform existing benchmarks. However, one thing to note is that training (~8 days) and inference-time optimization (100 frames takes ~4 hours) take a significant amount of time on a GeForce RTX 3090.

7. References

Palafox, Pablo, et al. “Npms: Neural parametric models for 3d deformable shapes.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Park, Jeong Joon, et al. “Deepsdf: Learning continuous signed distance functions for shape representation.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.