Formally, let \( D = \{(x_i, y_i)\}_{i=1}^{n} \) be your training pool and
\( C \) be a context of size \( m \) sampled from \( D \).
A resampler defines a distribution \( q(i) \) over indices
(or a constrained sampling procedure). Training or inference-time conditioning then depends on:
\[
C \sim q, \quad \text{not } \mathrm{Uniform}(D)
\]
On imbalanced data, uniform \( q \) yields majority-class dominated gradients and conservative boundaries.
Resampling intervenes at the data selection layer to increase signal in underrepresented regions
without adding architectural complexity.
Fine-tuning vs resampling (pragmatic view)
-
Fine-tuning can make sense when you have “clean” datasets
(large, well-labeled, stable distribution) and you can afford gradient updates and careful calibration.
-
Resampling is the compute-efficient alternative when you want adaptation without retraining:
you change \( q \), not \( \theta \). In TabTune this is wired as inference-time tuning
(tuning_strategy="inference") so you can keep a strong zero-shot baseline
and still adapt by context selection.
Implementation surface (one switch, multiple samplers)
Resampling is configured in tuning_params via
context_sampling_strategy, plus sampler-specific knobs (e.g.,
hybrid_ratio, kmeans_centers,
min_pos, oversample_weight).
Mathematical definitions for each sampling method
Let \( m = \) context_size.
For classification with \( K \) classes, define class sets \( S_c = \{ i : y_i = c \} \)
and counts \( n_c = |S_c| \).
1) Uniform sampling
Preserves original distribution.
\[
q_{\text{uni}}(i) = \frac{1}{n}
\]
Sample \( m \) indices i.i.d. from \( q_{\text{uni}} \) (or without replacement if
allow_replacement=False).
2) Stratified sampling
Stabilizes representation without rebalancing; classification keeps class proportions, regression uses quantile bins.
Classification (fixed proportions):
\[
m_c = \left\lfloor m \cdot \frac{n_c}{n} \right\rfloor,
\qquad
\sum_{c=1}^{K} m_c = m
\]
Then sample \( m_c \) uniformly from each \( S_c \).
Regression (binned targets):
Bin \( y \) into \( B \) bins \( b(i) \in \{1, \dots, B\} \) using quantiles (qcut; fallback cut).
\[
m_b = \left\lfloor m \cdot \frac{n_b}{n} \right\rfloor
\]
Sample \( m_b \) uniformly from each bin.
3) Balanced sampling
Forces equal representation across classes (or bins); strong recall, but shifts training vs inference distribution
and increases threshold sensitivity.
Classification (equal mass per class):
\[
q_{\text{bal}}(i) = \frac{1}{K} \cdot \frac{1}{n_{y_i}}
\]
Equivalently, set \( m_c \approx m / K \) and sample within each class.
Regression (equal mass per bin):
\[
q_{\text{bal-reg}}(i) = \frac{1}{B} \cdot \frac{1}{n_{b(i)}}
\]
4) Weighted minority oversampling
Inverse-frequency weighted sampling with replacement plus a boost multiplier
(oversample_weight) and a minimum enforced minority count
(min_pos).
Binary case \( y \in \{0,1\} \), minority \( y = 1 \). Let \( \beta = \)
oversample_weight.
\[
w(i) =
\begin{cases}
\beta \cdot \frac{1}{n_1} & y_i = 1 \\
\frac{1}{n_0} & y_i = 0
\end{cases}
\qquad
q_{\text{wos}}(i) = \frac{w(i)}{\sum_{j=1}^{n} w(j)}
\]
Min-pos constraint
Enforce at least \( m_1 = \) min_pos positives by construction:
- Sample \( m_1 \) indices from \( S_1 \)
- Sample \( m - m_1 \) remaining indices from \( q_{\text{wos}} \) (or a background sampler)
This is explicitly not naïve duplication; it is intentional signal amplification.
5) SMOTE / SMOTENC (synthetic minority)
Used when imblearn is available.
Pipeline: temporary NaN imputation → SMOTE (numerical) / SMOTENC (categorical) → subsample back to context size.
Numerical SMOTE generation
Pick minority anchor \( i \in S_1 \), neighbor \( j \in \mathcal{N}_k(i) \), sample \( \lambda \sim U(0,1) \):
\[
\tilde{x} = x_i + \lambda (x_j - x_i),
\qquad
\tilde{y} = 1
\]
SMOTENC abstraction
Interpolate numeric features as above; set categorical features via discrete operator over neighbors (e.g., per-feature mode):
\[
\tilde{x}^{(c)}_t
=
\mathrm{mode}
\left\{
x^{(c)}_{r,t}
:
r \in \{i\} \cup \mathcal{N}_k(i)
\right\}
\]
Build an augmented pool and sample \( m \) points from it.
6) Diversity-based sampling (MiniBatch KMeans)
Focus is coverage, not balance. Workflow: impute → one-hot encode → MiniBatch KMeans → pick one representative per cluster.
Let \( \phi(x) \) be the impute + one-hot map. Fit KMeans with
\( K_c = \) kmeans_centers producing centroids \( \mu_1, \dots, \mu_{K_c} \).
\[
c(i)
=
\arg\min_{k}
\left\|
\phi(x_i) - \mu_k
\right\|_2^2
\]
Representative per cluster
\[
i_k
=
\arg\min_{i : c(i)=k}
\left\|
\phi(x_i) - \mu_k
\right\|_2^2
\]
Context is \( \{ i_1, \dots, i_{K_c} \} \), then trim or fill to size \( m \) as needed.
7) Hybrid strategies (signal + coverage)
TabTune ships hybrid_balanced_diverse (classification) and
hybrid_stratified_diverse (regression), mixing balanced/stratified sampling
with diversity using hybrid_ratio.
Let \( \rho = \) hybrid_ratio.
\[
m_{\text{sig}} = \lfloor \rho m \rfloor,
\qquad
m_{\text{cov}} = m - m_{\text{sig}}
\]
\[
C = C_{\text{sig}} \cup C_{\text{cov}}
\]
where \( C_{\text{sig}} \) is sampled via balanced (classification) or stratified (regression),
and \( C_{\text{cov}} \) via KMeans diversity.