Outline

Introduction

Background

Methods & Experiments

Results & Analysis

Conclusion

References

1. Introduction

It is widely accepted that overparameterization plays a central role in the remarkable generalization capabilities of modern deep neural networks. Specifically, increasing model capacity leads to loss landscapes characterized by numerous flat valleys or basins, each corresponding to similarly low loss. A central feature of these landscapes is mode connectivity: the phenomenon in which a path of low loss exists between two distinct, well-trained solutions (modes) of the network. Rather than being isolated points, these minima are linked, making it possible to traverse the loss landscape along a smooth path without encountering significant barriers in loss.

Most existing research has focused on the general existence of such low-loss paths. However, less attention has been paid to the detailed geometric properties, such as sharpness or curvature of these paths and how they evolve quantitatively with the degree of overparameterization. A quantitative analysis of this geometric evolution is crucial for identifying key factors that govern the loss landscape and generalization performance. This understanding can be instrumental to advancing both the theoretical understanding of deep learning and the practical design of more efficient and robust neural networks. The fundamental nature of this topic makes it broadly applicable across a wide range of deep learning architectures and tasks.

Mode connectivity refers to the existence of low-loss paths connecting different trained solutions in the parameter space of neural networks.

1.1 Novelty and Significance

This project aims to help bridge the gap between properties of the optimization landscape and the practical performance of deep learning models. This will be done by performing a comprehensive geometric analysis of the loss landscape for multiple deep neural networks. The novelty of our work comes from the quantitative investigation of barrier height, path curvature, and sharpness as a function of a single, controlled factor: network depth, a proxy for overparameterization. The results aim to provide intuitive and meaningful insights into how overparameterization influences the landscape and performance of deep learning models. The significance of these results is that they offer a clearer link between loss landscape geometry and empirical performance, enabling more principled design and training strategies of deep learning models.

The study controls for network depth while measuring geometric properties of the loss landscape.

1.2 Hypotheses

The hypotheses for our work include:

Lower Barrier Heights Accompany Deeper Networks
Increased network depth will lead to lower barrier heights (smaller maximum change in accuracy along the path) between connected minima. This suggests that the loss landscape becomes smoother and more connected as overparameterization increases.

Path Curvature Decreases with Overparameterization
The curvature of the low-loss paths connecting distinct solutions will decrease as network depth increases, indicating that the mode connectivity approaches a linear relationship.

Increased Overparameterization Correlates with Flatter Minima
As the network's depth increases, the local minima found by the optimizer will become flatter. Flatter minima are generally associated with better generalization.

Path Geometry Directly Predicts Generalization
A measurable correlation will exist between the geometric properties (low barrier height, low curvature, low sharpness) of the low-loss paths and the final test accuracy of the models, thereby providing a useful predictor for generalization performance beyond training loss alone.

2. Background

Empirical and theoretical studies have demonstrated that wider networks tend to exhibit mode connectivity: where distinct minima are connected by a smooth, low-loss path [1, 2]. These networks tend to occupy regions of the loss landscape that are smoother, with shorter and less pronounced barriers between minima. This characteristic suggests that wider models, by virtue of their increased parameterization, might allow for greater flexibility in optimization, potentially leading to more robust solutions.

However, most prior work has focused on the existence of connectivity, rather than the geometry of these connections. In particular, even fewer studies have measured how landscape properties, such as barrier height and path sharpness, vary not just within one path, but across varying architectural factors like depth in practical deep learning settings on real datasets (e.g., ResNet or CNNs on CIFAR-10).

Moreover, while increased connectivity through overparameterization has been linked to improved generalization, this relationship is not without limits [3, 4]. Beyond a certain degree of overparameterization, increased capacity may no longer improve generalization. Larger models might exhibit wider but shallower basins in the loss landscape, potentially impairing test performance on out-of-distribution data. Furthermore, increasing depth and width might give opposing results [5]. The precise conditions under which the benefits of increased connectivity plateau, and how the loss landscape geometry contributes to this effect, remains underexplored.

The collective body of existing research confirms the central role of overparameterization in shaping the loss landscape [6, 7]. It has been established that the set of optimal solutions in networks where the number of parameters significantly exceeds the number of data points is not discrete but rather forms a high-dimensional submanifold [8, 9]. Awareness of this phenomenon fundamentally changes the optimization challenge. Finding a good solution requires finding not only low-loss minimum, but also a minimum that generalizes well, a property not reflected in the loss itself. This has motivated work to study the effect of hyperparameters such as learning rate and batch size on the width of the minima that stochastic gradient descent (SGD) converges to [10]. Furthermore, dedicated flatness-aware optimization strategies, such as sharpness-aware minimization (SAM) [11] and stochastic weight averaging (SWA) [12], have been developed to actively seek out flat minima [13].

Furthermore, a substantial area of research has focused on the relationship between optimization and flat minima, hypothesizing that the flatness of a basin correlates directly with a model's superior generalization ability [14]. However, the connection between these macroscopic landscape properties, like the existence of flat minima, and the microscopic geometry of the connections between them, like path curvature and sharpness, remains loosely quantified, particularly in the context of controlled architectural variations like depth on modern architectures [6, 7].

Thus, a significant gap remains in rigorously and empirically linking the degree of overparameterization to the quantitative geometry of the optimization landscape. This motivates a systematic investigation of how overparameterization shapes the geometry of the loss landscape, and how this, in turn, influences key properties like generalization. By studying these aspects in a controlled and thorough manner, we aim to uncover insights that could inform the design of more efficient architectures and optimization strategies, ultimately leading to more reliable and interpretable models.

Overparameterized networks tend to form connected, flatter regions of the loss landscape, which has been linked to better optimization and generalization.

Flatness-aware optimization strategies like sharpness-aware minimization (SAM) and stochastic weight averaging (SWA) have been developed to actively seek out flat minima.

How these flat, connected regions change with model size remains poorly understood. Bridging this gap requires studying the geometry of paths between solutions across architectures.

3. Methods & Experiments

Before introducing the specific metrics, we briefly summarize the overall methodology used to generate the objects we evaluated.

For each network architecture, we trained two independent models from different random initializations. Each training run produced a set of parameters, denoted A and B, which corresponded to distinct minima of the loss landscape. Using these two solutions as endpoints, we then constructed a low-loss connecting curve in parameter space by introducing and optimizing a third point, C, which defined a nonlinear interpolation between A and B.

This resulted in a parameterized path passing through (A, C, B), along which we sampled and evaluated loss, curvature, and sharpness. For comparison, we also evaluated the straight-line interpolation between A and B, as well as geometric properties of the plane spanned by the three points.

The path was parametrized using established interpolation schemes: Bezier curve, and piecewise linear (PolyChain). In addition, we computed the loss on a grid of points across this plane in order to visualize local variations in loss away from the optimized path.

Along each path we evaluated: training and test loss, curvature metrics, maximum barrier height, and generalization gap. This procedure allowed us to compare different architectures not only in terms of whether a low-loss path existed, but also the geometry of it.

We used piecewise linear paths and Bezier curves to search for low-loss paths between independently trained models.

Along each path, loss, curvature, barrier height, and generalization gap was measured.

3.1 Network Architecture

To investigate how mode connectivity and loss-landscape geometry varies as a function of architectural complexity, we compared several variants of the same neural network family. Specifically, we evaluated five residual networks of increasing depth: ResNet-8, ResNet-26, ResNet-38, ResNet-65, and ResNet-119.

Residual networks (ResNets) introduce skip connections that allow information to bypass nonlinear layers, enabling efficient optimization of deep architectures and mitigating vanishing gradients [15]. Beyond their practical advantages, ResNets serve as a useful test case for loss-landscape analysis. Their residual block structure with skip connections and ease of scalability leads to loss landscapes with nontrivial geometry [9, 16], while still being compact enough to train extensively under controlled conditions [1]. We restrict our study to a single architectural family rather than comparing radically different model types for two reasons:

Varying depth within the same architecture allows us to isolate the effect of overparameterization without confounding differences in inductive biases or other factors.

Exploring mode connectivity requires repeated training, interpolation, and evaluation across networks, which scales unfavorably with architecture complexity. Restricting to a single family keeps the computational cost feasible.

We chose ResNets in particular because they were covered in our course and are known to outperform "plain" CNNs on image recognition tasks [15]. By spanning more than an order of magnitude in model size, this architectural sweep allowed us to examine how path geometry, barrier height, and generalization evolve as networks transition from moderately to heavily overparameterized regimes.

The number of parameters in the chosen models can be seen in the table below.

Table 1: Parameter Counts for Selected Architectures
Architecture	Parameters
ResNet-8	~80k
ResNet-26	~370k
ResNet-38	~570k
ResNet-65	~660k
ResNet-119	~1.2M

Using one architecture isolates depth effects and keeps computation manageable. ResNets were chosen because they outperform plain CNNs on image recognition tasks.

3.2 Datasets

We conducted experiments on two widely used image-classification benchmarks: CIFAR-10 and CIFAR-100. Each dataset consists of 60,000 RGB images of size 32×32, split into 50,000 training and 10,000 test images, covering everyday object categories.

The two datasets differ primarily in granularity: CIFAR-10 has 10 classes with 6,000 images per class, while CIFAR-100 has 100 classes with 600 images per class. CIFAR-100 poses a more challenging classification problem due to the fact that it has fewer images per category. This typically results in lower test accuracy, and will potentially also affect the loss landscape.

The CIFAR-10 and CIFAR-100 datasets were chosen for their strong baselines and computational feasibility for exhaustive experimentation.

3.3 Low-Loss Path Search

To identify a low-loss path between two solutions, we optimize a parameterized curve that connects the independently trained endpoints. The procedure for finding this curve is based on the work done by Garipov et. al. [1]. Let

θ_{A}, θ_{B} \in ℝ^{| net |}

be the two sets of weights and biases for the two independently trained neural networks with the same architecture. |net| is here the number of parameters of that specific architecture. Our aim is finding a curve in

ℝ^{| net |}

connecting those two parameter-endpoints. This translates to finding a continuous function

φ_{θ} : [0,1] \to ℝ^{| net |}

with the property that

φ_{θ} (0) = θ_{A} and φ_{θ} (1) = θ_{B} .

Of course, we want the path to be of "low-loss", so we need to impose some more conditions. As done by Garipov, we restrict ourselves to a fixed parametric family, so that the problem reduces to that of finding some set of parameters

θ_{C}

that minimizes

𝔼_{t \sim q_{θ} (t)} [L (φ_{θ} (t))]

where

L

is the loss function used to train the endpoints and

q_{θ} (t) := ∥ {φ'}_{θ}^{'} (t) ∥ \cdot {(\int_{0}^{1} ∥ {φ'}_{θ}^{'} (t) ∥ d t)}^{- 1}

As it is generally intractable to compute this in high dimensions, at each iteration we instead sample t on [0,1], and make a gradient step as to minimize

𝔼_{t \sim U [0,1]} [L (φ_{θ} (t))]

The two types of curves considered is the (quadratic) Bezier curve

φ_{θ_{C}} (t) = {(1 - t)}^{2} θ_{A} + 2 t (1 - t) θ_{C} + t^{2} θ_{B}, t \in [0,1]

and the piecewise linear "PolyChain"

φ_{θ_{C}} (t) = {\begin{cases} 2 (t θ_{C} + (0.5 - t) θ_{A}) & for t \in [0,0.5] \\ 2 ((t - 0.5) θ_{B} + (1 - t) θ_{C}) & for t \in [0.5,1] \end{cases}

Low-loss paths are found by optimizing Bezier or PolyChain curves between trained models.

3.4 Training Configuration

All networks were trained from scratch using identical optimization settings to maintain a consistent baseline across architectures. Minimal effort was devoted to hyperparameter tuning to avoid introducing biases toward any particular model.

Each model was trained for 80 epochs using stochastic gradient descent with momentum (0.9), an initial learning rate of 0.1, and weight decay of 3×10^-4

A piecewise learning rate schedule was applied in which the learning rate remained constant during the first half of training and decayed linearly during the subsequent 40% of epochs. Standard data augmentation was used in the form of random cropping and horizontal flipping, and the batch size was kept constant across architectures. No architecture-specific regularization or optimization heuristics were employed beyond weight decay and the learning rate schedule

3.5 Compute Hardware

All models were trained on a single L4 or A100 GPU using Google Colab Pro. Training, path search, and path analysis required a total of 1-7 hours, for each model. Including experiments which did not pan out, the total number of compute units used for this project is between 320-380.

3.6 Metrics

To evaluate how network architecture influences the geometry of low-loss paths, we quantify several landscape properties that reflect smoothness, curvature, and robustness. While many potential metrics exist, we focus on three measures that capture complementary geometric characteristics: barrier height, angle-based curvature, and sharpness. Together, these metrics provide insight into how easily solutions can be connected, how much deviation from linearity is required, and how locally stable the solutions are along the path.

3.6.1 Barrier Height

Given a parameterized path between two independently trained models, sampled at N points, we define the barrier height as the difference between the maximum and minimum accuracy observed along the path. Intuitively, this measures the “cost” of traveling between two minima: a high barrier implies that the path crosses a region of poor performance, while a nearly flat barrier indicates a smooth and well-connected landscape. We compute barrier height for two different interpolation regimes: the computed low-loss path (Bezier and PolyChain), and the straight-line interpolation between the two endpoints. Barrier height can be evaluated on the training accuracy, test accuracy, or the generalization gap. From the sampled accuracy along the path we also compute the Area Under the Curve (AUC), which captures the total loss accumulated along the path. Instead of integrating over the entire interpolation range, we restrict the calculation to the segment between the points where accuracy reaches its minimum and maximum values. This helps distinguish models that share the same barrier height but differ in how much of the path exhibits elevated loss.

3.6.2 Angle

The curve-based interpolation methods that were used (Bezier curves or PolyChains) introduce a middle point that is optimized to minimize loss along the path. The relative position of this middle point provides a proxy for how “curved” the low-loss path must be in order to connect the two solutions. Conceptually, if the middle point lies close to the straight segment connecting the endpoints, the landscape is close to linearly connected. Conversely, if the middle point lies far from the segment, the optimization process had to search farther from a linear trajectory to avoid high-loss regions, suggesting a higher degree of curvature or ruggedness in the landscape. To quantify this, we treat the three points (the two minima and the optimized middle point) as a triangle in parameter space, and compute the angles of the triangle, and the area of the triangle given that the distance between endpoints is fixed.

3.6.3 Sharpness

While barrier height and curvature reflect global properties of the path, we also desire to characterize the local geometry of the landscape around the low-loss solutions. Specifically, we are interested in whether the path corresponds to flat, stable basins or sharp, narrow valleys. Flat regions are typically associated with improved generalization, robustness, and lower sensitivity to perturbations. A natural way to quantify local geometry is through the Hessian matrix, which captures second-order curvature of the loss. However, computing the full Hessian is intractable for modern neural networks due to its size. Instead, we approximate curvature by estimating the top-k eigenvalues of the Hessian at sampled points along the path. These dominant eigenvalues represent the directions of largest curvature, and therefore serve as a proxy for the “sharpness” of the loss landscape [17]. Large top eigenvalues indicate steep, narrow regions (high sharpness), while smaller values indicate broader, flatter basins. Practical estimation of top eigenvalues can be achieved via iterative methods such as power iteration [18].

4. Results & Analysis

We present a detailed analysis of the loss landscape and connectivity between minima for ResNet models of varying capacity on CIFAR-10 and CIFAR-100 datasets. We primarily focus on Bezier paths, as PolyChain paths exhibited similar trends. Across all experiments, we examine how these metrics vary with model capacity. The greatest observed differences in results occurs between ResNet-8 and ResNet-26, which correlates with the biggest relative jump in capacity, a ~4.5× increase in parameters.

4.1 Convex Combination

A convex combination of two minima corresponds to a linear interpolation between their parameter vectors. The path can be represented as

θ (t) = (1 - t) θ_{1} + t θ_{2}, t \in [0, 1]

Studying such linear paths is useful because it provides a simple baseline for understanding the geometry of the loss landscape: any drop in performance along this line indicates a barrier between the minima. It is important to note that the observed barrier height along a convex combination can be large simply because the path is linear, and not necessarily because the minima are inherently separated. Higher-order paths, presented in Section 4.2, may circumvent these high-loss regions and reveal more connected low-loss trajectories.

As can be seen in the plot below - across architectures - straight-line interpolation caused similarly large accuracy drops. Effectively showing that increased model complexity does not reduce barrier height along linear paths. However, larger models showed a narrower low-accuracy region, indicating they occupy wider basins of good performance, while smaller models traverse longer high-loss regions. Thus, overparameterization affects the width but not the depth of the barrier on linear paths. This does not support the idea that deeper networks reduce barrier height, though it remains possible that barrier height decreases along curved paths, which will be examined next.

Figure 1: Test accuracy along the line-segment between modes.

Linear paths show large barriers for all models — wider networks just shrink the low-accuracy region.

4.2 Barrier Height

The tables below shows that smaller networks experienced large dips in accuracy when moving along the low-loss path, whereas larger networks were much more stable. For example, ResNet-8 lost more than 7% test accuracy at its worst point, while ResNet-65 and ResNet-119 dropped only ~2%. This indicates that deeper/wider models are easier to connect smoothly: their loss basins are flatter and more consistent. However, gains seemed to level off beyond ResNet-65, suggesting that very large models may not benefit further in terms of barrier height.

Table 2: Area Under Curve (AUC) and Barrier Height for the CIFAR-100 test set.
Model	(Convex) AUC	(Convex) Peak	(Bezier) AUC	(Bezier) Peak
ResNet-8	35.0173	63.74%	4.1395	7.02%
ResNet-26	39.5827	85.68%	2.7302	5.19%
ResNet-38	41.9170	91.16%	2.2612	4.22%
ResNet-65	38.1347	94.06%	1.1581	2.29%
ResNet-119	34.8687	96.60%	1.1634	2.08%

Larger networks show much smaller accuracy dips along low-loss paths.

4.3 Angle

The table below reports the interior angle ∠ACB for each model on the two datasets, as well as the distance between the path midpoint C, and the endpoints A, B.

Table 3: Path curvature metrics for CIFAR-10
Model	\|\|A-C\|\|₂	\|\|B-C\|\|₂	∠ACB (deg)
ResNet-8	38.75	38.69	69.39
ResNet-26	49.16	48.49	62.31
ResNet-38	49.78	53.37	62.78
ResNet-65	50.40	49.96	56.91
ResNet-119	51.89	54.48	55.16

Table 4: Path curvature metrics for CIFAR-100
Model	\|\|A-C\|\|₂	\|\|B-C\|\|₂	∠ACB (deg)
ResNet-8	58.72	59.81	70.75
ResNet-26	76.25	78.02	68.76
ResNet-38	85.79	84.68	70.18
ResNet-65	70.44	72.95	65.91
ResNet-119	77.92	78.33	64.14

Since the two norms ||A - C|| and ||B - C|| are close to each other, the low-loss path is roughly symmetric, and therefore we can reliably use the interior angle ∠ACB as a proxy for the path's degree of nonlinearity. Smaller angles correspond to a larger outward displacement of C and therefore to a more strongly curved low-loss path.

On CIFAR-10, the angle decreased consistently with model size, from approximately 69° for ResNet-8 to approximately 55° for ResNet-119, indicating that larger models relied on more strongly curved paths to maintain low loss. This goes against our original expectations that the curvature of the low-loss paths would have an inverse correlation with model capacity. A similar, although less pronounced, trend was observed on CIFAR-100.

As a supplemental note, while ||A − C|| and |B − C|| are useful to validate the assumption that the low-loss path is symmetric, two reasons make it difficult to arrive at a conclusive statement for the trend of either metric alone across model capacity. Firstly, while there is an increase in the norm from ResNet-8 to ResNet-26 on the CIFAR10 dataset, the differences amongst other neighboring models with CIFAR10 are small, while the CIFAR100 dataset shows no clear trend. Secondly, Euclidean distance in high-dimensional space can be misleading as the differences in noisy dimensions can accumulate to dominate the distance between the relevant dimensions. Practically, this makes it difficult to confidently identify a relationship between model capacity and the distance between minima

Counter-intuitively, larger networks require more curved paths to connect minima, despite having flatter loss landscapes overall.

4.4 Sharpness

The sharpness of the low-loss path provides insight into the robustness and generalization capabilities of a trained model. Our analysis reveals that, for the same optimizer hyperparameters, smaller models like ResNet-8 tend to find solutions in sharper minima compared to larger models like ResNet-119 which opt instead for flatter, wider basins. This supports our hypothesis.

Figure 2: Sharpness along the connected Bezier paths for CIFAR10 and CIFAR100.

However, despite the low barrier height for a large model like ResNet-119, our results show that the sharpness still changes by a factor of 2 between the endpoints and the midpoint of the low-loss path. To ensure that this behavior was a more global rather than local property, the average of the five largest eigenvalues of the Hessian was studied, which supported the behavior exhibited by the largest eigenvalue alone. The results can be seen below.

Figure 3: Mean of the five largest Hessian eigenvalues along the connected Bezier paths for CIFAR10 and CIFAR100.

The sharpness behavior highlights that a low loss alone might not be enough to classify a path as consisting of truly equivalent solutions. This can be relevant in scenarios such as when the low-loss path is being used to create an ensemble of neural networks. If truly similar performance is desired, it can be wise to also consider the sharpness when creating such an ensemble, otherwise, the varying sharpness can lead to unequal generalization capabilities amongst the ensemble.

Sharpness measurements used the top eigenvalues of the Hessian matrix as computed via power iteration. Results show that larger models find flatter minima, but sharpness still varies along the path.

4.5 Loss Landscape

Visualizations of the test error over a two-dimensional grid in the plane spanned by the two trained solutions and the optimized midpoint revealed how error varies not only along the low-loss path but also in its surrounding region. In all cases, the straight-line interpolation passed through a pronounced high-error ridge, while the optimized curve circumvented this region by bending into areas of lower error.

Notably, the extent and shape of the low-error basins differed systematically across architectures: ResNet-8 exhibited a relatively narrow and fragmented low-error region, with steep transitions into high-error zones, whereas deeper models such as ResNet-65 and ResNet-119 displayed substantially broader low-error basins, within which the optimized path remained confined.

This suggests that larger networks form wider and more coherent regions of good generalization performance, despite requiring a curved path to connect minima. The transition from narrow to broad basins is visible in the progressive flattening and widening of the low-error contours across models, indicating that overparameterization produces landscapes in which low-error regions are not only connected but also spatially expansive.

Figure 4: Planar slices of the highly dimensional loss landscape of varying size ResNet models.

Deep networks reshape the landscape into wide, gently sloping valleys. Small models remain confined to steep, narrow basins.

4.6 Generalization Gap

The difference between the train and test accuracies, referred to as the generalization gap, can provide valuable insight into the quality of the solution which goes beyond simply looking at its loss. As previously discussed in sections 1 and 2, it is generally accepted that flatter minima result in improved generalization capabilities. We hypothesized that increased flatness (induced by increased model capacity) would align with improved generalization capabilities. To test this hypothesis, we studied the generalization gaps from the low-loss Bezier curves. However, the results from the CIFAR-10 dataset only mildly support our hypothesis while the results for CIFAR-100 actually go against it. We believe this inconsistency can be explained by the model capacities used in our study.

Figure 5: Generalization gaps for CIFAR10 and CIFAR100.

As can be seen in the plot above, for the CIFAR-10 dataset, the generalization gap follows the double descent phenomenon, and the results in the plot below support our hypothesis that low sharpness (induced by overparameterization) leads to improved generalization along the entire length of the low-loss path. However, the evidence is only mildly supportive. We believe the interpolation threshold occurs around 0.5M parameters, but most of our model capacities fall above this threshold.

Additionally, we do not have a fine range of data in the ”classical” regime showing the lead-up to the interpolation threshold, and we believe a model larger than ResNet-119 would better illustrate the behavior in the ”modern” interpolating regime. Therefore, to produce stronger support for our claims and better characterize the relationship between overparameterization, sharpness, and generalization, our future work with the CIFAR-10 dataset intends to study models with new sizes such as 0.01M, 0.3M, 0.5M, and 2M.

For CIFAR-100, the results in the right-hand side of the plot below, do not support our hypothesis that links overparameterization, sharpness, and generalization. We believe this is because we have not yet reached the interpolation threshold. This is supported by the right-hand side of the figure above which shows that while the training error decreases each time model capacity is increases, the generalization gap only increases. We believe the interpolation threshold is near 1.2M parameters as that is the first time the training error approaches 0%.

Due to the large computational requirements to train a ResNet-119 model, it was not within the scope of our timeline to train a larger model. Given additional time, optimization hyperparameters could be adjusted in an attempt to reach the interpolation threshold earlier. Alternatively, compute resources aside, models of larger sizes such as 2M and 5M could also provide valuable data. We believe these larger models would go above the interpolation threshold and generate support for our hypothesis.

Figure 6: Error and generalization gap for CIFAR10 and CIFAR100.

Generalization improved only mildly with flatness on CIFAR-10 and reversed on CIFAR-100. The likely cause is that most models lie above the interpolation threshold for CIFAR-10 but below it for CIFAR-100.

5. Conclusion

This project performed a systematic, quantitative geometric analysis of the loss landscape for ResNet architectures of varying depth, helping bridge the gap between theoretical properties of the optimization landscape and empirical model performance. Our findings support many core tenets of existing literature on mode connectivity, provide new information about the geometric properties of the connecting low-loss paths, and highlight novel, counter-intuitive insights that warrant further research.

5.1 Key Findings

Our analysis of the Bezier-optimized low-loss paths revealed a clear evolution of the loss landscape with increased network depth:

Mode Connectivity and Barrier Height:
We confirmed that increased network depth reduces the barrier height along the curved paths between distinct solutions. Test accuracy loss along the path dropped from over 7% for the shallowest model (ResNet-8) to ~2% for the deepest models, indicating that overparameterization creates a smoother, more easily traversable loss manifold.

Sharpness and Robustness:
We verified the hypothesis that deeper networks tend to converge to flatter minima, which literature associates with improved generalization. We demonstrated that sharpness is not uniform along the low-loss path, varying by a factor of 2 even when the loss and barrier height remain minimal.

Path Curvature:
Contrary to our initial hypothesis, the low-loss paths connecting minima in deeper networks were found to be more strongly curved. This suggests that while overparameterization creates broad, connected basins, the optimizer must exploit higher-order non-linearities in parameter space to navigate the lowest-loss routes between them.

Generalization Gap:
The inconsistency in the generalization gap results between CIFAR-10 and CIFAR-100 suggests that decreased sharpness alone is insufficient to guarantee improved generalization, and that the benefits may only manifest past the interpolation threshold. This requires further investigation with larger models.

5.2 Significance

The primary significance of this work lies in its quantitative findings, which move beyond the mere existence of low-loss paths to characterize their geometry.

Theoretical Advancement:
The observed variation in sharpness along the path has profound implications for ensemble methods. It shows that traversing the loss manifold yields solutions that are not functionally equivalent in terms of robustness, necessitating that ensemble generation strategies incorporate sharpness as a selection criterion, not just low loss.

Principled Model Design:
By establishing a clear, quantitative link between network depth and geometric metrics (barrier height, sharpness), our results enable more principled model design. Researchers can now anticipate the topological characteristics of the loss landscape induced by architectural choices and make more intelligent decisions on model architecture and optimizer selection.

5.3 Future Work

Given the fundamental nature of our findings and the computational constraints of this study, several avenues for future research are warranted:

Architectural Universality Testing:
It would be valuable to apply these techniques to more complex and fundamentally different architectures, such as Transformers. The primary objective is to investigate if the observed geometric phenomena are universal properties of deep neural networks or if they are tied to specific model types, and to explore if modern complex architectures, where theory is harder to apply [19], present new phenomena.

Efficiency of Path Discovery:
A deeper exploration is needed into the conditions under which the low-loss path transitions to being linear or near-linear. A computationally powerful result would involve predicting the existence and geometry of low-loss paths in complex architectures, similar to recent work on predicting optimizer dynamics [20], which could lead to novel insights into the loss landscape itself.

Scaling to the Interpolating Regime:
Future work should extend the CIFAR-100 analysis into the "modern interpolating regime". This is necessary to fully characterize the generalization gap inconsistency and verify whether the benefits of increased capacity only manifest past the interpolation threshold.

Optimizer Influence on Path Geometry:
The counter-intuitive observation of increasing path curvature warrants investigation into how different optimizer regularization techniques and hyperparameter settings might affect the non-linearity of the low-loss paths, aiming to identify methods that promote straighter, lower-curvature trajectories.

References

[1] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, "Loss surfaces, mode connectivity, and fast ensembling of DNNs", Advances in Neural Information Processing Systems, 2018.

[2] F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, "Essentially no barriers in neural network energy landscape", Proceedings of the 35th International Conference on Machine Learning, 2018.

[3] S. Sagawa, A. Raghunathan, P. W. Koh, and P. Liang, "An investigation of why overparameterization exacerbates spurious correlations", Proceedings of the 37th International Conference on Machine Learning, 2020.

[4] H. Hassani and A. Javanmard, "The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression", The Annals of Statistics, vol. 52, no. 2, pp. 441-465, 2024.

[5] Z. Zhu, F. Liu, G. Chrysos, and V. Cevher, "Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization)", Advances in Neural Information Processing Systems, 2022.

[6] R. Sun, D. Li, S. Liang, T. Ding, and R. Srikant, "The global landscape of neural networks: An overview", IEEE Signal Processing Magazine, vol. 37, no. 5, pp. 95-108, 2020.

[7] Q. Nguyen, P. Bréchet, and M. Mondelli, "When are solutions connected in deep networks?", Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.

[8] Y. Cooper, "Global minima of overparameterized neural networks", SIAM Journal on Mathematics of Data Science, vol. 3, no. 2, pp. 676-691, 2021.

[9] B. Simsek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, and J. Brea, "Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances", Proceedings of the 38th International Conference on Machine Learning, 2021.

[10] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, A. Storkey, and Y. Bengio, "Three factors influencing minima in SGD", 2018.

[11] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, "Sharpness-aware minimization for efficiently improving generalization", International Conference on Learning Representations, 2021.

[12] P. Izmailov, D. Podoprikhin, T. Garipov, D. P. Vetrov, and A. G. Wilson, "Averaging weights leads to wider optima and better generalization", Conference on Uncertainty in Artificial Intelligence, 2018.

[13] J. Kaddour, L. Liu, R. Silva, and M. J. Kusner, "When do flat minima optimizers work?", Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022.

[14] D. Caldarola, B. Caputo, and M. Ciccone, "Improving generalization in federated learning by seeking flat minima", Computer Vision – ECCV 2022: 17th European Conference, 2022.

[15] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition", 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.

[16] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, "Visualizing the loss landscape of neural nets", Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.

[17] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, "On large-batch training for deep learning: Generalization gap and sharp minima", CoRR, 2016.

[18] J. Martens and I. Sutskever, "Training Deep and Recurrent Networks with Hessian-Free Optimization", pp. 479-535, Springer Berlin Heidelberg, 2012.

[19] L. Oneto, S. Ridella, and D. Anguita, "Do we really need a new theory to understand over-parameterization?", Neurocomputing, vol. 543, p. 126227, 2023.

[20] J. Cohen, A. Damian, A. Talwalkar, J. Z. Kolter, and J. D. Lee, "Understanding optimization in deep learning with central flows", The Thirteenth International Conference on Learning Representations, 2025.

[21] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, "timgaripov/dnn-mode-connectivity", GitHub-repository, 2018.

AI Declaration

In conducting this project, we've utilized the generative artificial intelligence tools ChatGPT and GitHub Copilot. These tools were used sparsely to assist with non-technical coding, such as that for formatting plots, and generating HTML from LaTeX. These are applications that we consider outside the scope of this course, and therefore deemed acceptable to complete with limited AI assistance.
In summary, the usage of Gen-AI has been restricted to:

Proofreading of written text and suggestions of alternative formulations.
Suggestion of bug-fixes for erroneous code.
Generation of code for plots.
"Translation" of LaTeX code to HTML.

This means that we have not used Gen-AI for:

Generation of "new" text for this report.
Generation of completely "new" code for any of the analyses.
Generation of ideas or conception of the project's methodological approach.
Interpretation of the results.

Code Declaration

Parts of the code for finding the low loss path have been taken from the following repository [21]. We have modified the code for it to fit with our project.