Foundations refresher, day 1 — re-walking deep-learning basics before thesis kickoff
Before reading deeper into the learned-compression literature for thesis work, I am running a structured re-walk through the foundations of deep learning — slowly enough to actually look at each piece. The PyTorch use in the sEMG project below sat on top of architectural priors I had taken on trust; the goal of this refresher is to put those priors back on first-principles ground before they become invisible scaffolding under thesis-level work.
Day 1 covered the basics in one pass: what deep learning does mechanically (rule-finding from examples rather than rule-writing), how a neural network is structured (neuron → layer → depth as a feature ladder), how training works (loss as a scalar, gradient descent as blindfolded descent, learning rate as step size), and the overfitting / underfitting distinction with the four standard remedies (more data, L2 regularisation, dropout, early stopping). The textbook material I won’t rehash. What is worth recording from a day 1 is what I caught myself getting wrong:
Negative weights. I had been carrying the implicit picture that a “more important” feature gets a larger weight. Working through a toy “should I go to the beach” example forced the point that suppressing a decision is just as legitimate as supporting it, and the way a network represents “this feature pushes against the answer” is a weight with negative sign and large magnitude. Trivial in hindsight; not how I had been visualising it.
Depth as a strict ladder, not a soft metaphor. I knew “deep” meant many layers, and that early layers learn simple features. What I had not internalised is that each layer’s input vocabulary is literally the previous layer’s output. Edges → shapes → eye → face isn’t a slogan but the actual data flow, and “eye” is just a stable activation in some middle layer that the next layer uses as a primitive. That makes depth a different kind of design choice than I had treated it as — it controls the maximum composition height of the features the network can express, not just the parameter count.
Underfitting ≠ a worse case of overfitting. Asked to classify a hypothetical model with 70% train / 68% test accuracy, I called it overfitting because both numbers looked bad. The correct label is underfitting: both numbers being low and close to each other points to insufficient capacity, not memorisation. The two diseases need opposite treatments — underfitting wants more capacity or longer training; overfitting wants regularisation, dropout, or more data. Catching this confusion now is much cheaper than catching it later, embedded in a real experiment where the data and the architecture are both moving.
Day 2 will move into hand-writing a small model in PyTorch to make the learning-rate and overfitting points concrete; from there into CNNs, then into the learned-compression specifics (autoencoders, quantisation, entropy coding, hyperprior) that the thesis area runs on. Substantive updates to this site will probably come when those last pieces start to load.
Robustness check: does SINDy still win when the generator isn’t polynomial?
After writing the SINDy follow-up below I went back to a concern that had been sitting at the edge of my own thinking. The synthetic generator I used mixes a constant linear matrix from stiffness to sEMG and adds additive heteroscedastic-in-sEMG noise. That data-generating process sits structurally inside SINDy’s degree-2 polynomial hypothesis class. A reader who has worked with sparse-regression methods would reasonably ask: did SINDy win because it discovered the generator, or because the generator happened to be in its model class? Until I checked, I couldn’t tell.
So I ran an ablation. Trajectories unchanged; the sEMG observation
model gains three pieces of structure SINDy cannot fit cleanly:
(1) tanh saturation on the linear mixing
(emg = tanh(2·W·K)), modelling motor-unit
recruitment plateaus; (2) slow fatigue drift on W
(W_eff(t) = W₀ + 0.25·ΔW·sin(0.2π·t)),
making the observation map time-varying; (3) a
state-dependent noise floor
(σ ∝ σ₀ + 0.08·‖K‖) on top
of the existing amplitude-proportional component. Same 6/2 split,
same horizons, same SINDy hyperparameters. Ridge, SINDy, ESN, and an
MLP baseline (MLPRegressor with hidden_layer_sizes=(48, 24),
substituting for the PyTorch LSTM in the easy run so this ablation
runs in a torchless environment; both play the same “high-capacity
neural baseline” role) all run on both generators.
The good news. SINDy as a predictor holds up. Position
error is essentially unchanged across the two generators (3.9 / 3.9
mm at 50 ms, 7.0 / 7.0, 12.4 / 12.4). Stiffness degrades only
marginally (8.5→8.9, 16.2→17.1, 29.2→31.3). The other
three methods take a clean hit on stiffness: Ridge +49%, ESN +42%,
MLP +82% at 100 ms. The mechanism, I think, is that SINDy’s
discovered position dynamics rely heavily on state-based extrapolation
(dx/dt ends up mostly a function of state, not sEMG),
so corrupting the sEMG observation barely touches the prediction.
The qualification I owe the SINDy follow-up below. The discovered equations grow. On the easy generator SINDy kept [1, 5, 3, 12, 12, 11] nonzero terms across the six state components (44 total). On the hard generator it keeps [5, 11, 7, 22, 22, 16] (83 total) — nearly twice as dense. The “sparse polynomial a human can read” framing I leaned on in the post below becomes much weaker once the sEMG isn’t linear: a 22-term degree-2 polynomial is still much smaller than the full feature library, but it is no longer the kind of equation you would print on a slide and reason about by eye. So that artefact-as- interpretation claim was specific to the linear sEMG–K map; under realistic sEMG nonlinearity, SINDy still works as a predictor, but it works as a denser curve-fit, not as a transparent equation.
What this updates in my own thinking. Two things.
First, on prediction robustness I owe SINDy more credit than I would
have given it yesterday — the accuracy genuinely doesn’t
care about the kind of observation-model violations I expected to
break it. Second, the broader point I had been pulling toward
— that the value of a “discoverable” dynamics
method is the readable artefact, not the headline number — is
not free. The artefact is only readable when the data-generating
process itself is close to sparse-polynomial. Real human sEMG
presumably isn’t, and that is the test the result above does
not pass. Code:
sindy_robustness.py.
Discovering equations vs fitting them: a SINDy follow-up
I went back to the same simulated peg-in-hole data and put
SINDy in the comparison, this time the way it
is actually meant to be used: learn the differential equations
d(state)/dt = f(state, sEMG) directly from data,
keep only the sparse terms via Lasso, and integrate the current
state forward to the prediction horizon. Not static regression.
The result inverted my expectation. On position error, SINDy gives 3.9 mm at 50 ms and 7.0 mm at 100 ms — roughly twice as good as the next method (Ridge) at both windows. ESN takes over at 200 ms with 10.6 mm. On stiffness, SINDy is best at every horizon. LSTM, with ~32 k parameters against ~3 k training samples, finishes last across the board (22–24 mm) — the textbook overfitting regime, and simulation does not produce the kind of long-tail nonlinearity that would force a network to earn its capacity.
What I keep coming back to is not the numbers. It is the
artefact: SINDy keeps 1 nonzero term in
dx/dt, 5 in dy/dt, 3 in
dz/dt, and 11–12 in each stiffness component.
These are sparse polynomials a human can read.
I can print the equation, change one coefficient, see the
effect. That is a different research object than the hidden
state of an LSTM.
Where this leaves me: when data is consistent with a compact set of equations, finding those equations is more honest than fitting an input-output map. Whether real human-in-the-loop sEMG behaves that way is the next thing I want to find out. Code and full numbers: semg-impedance-prediction on GitHub.
Predicting What the Operator Means: A Design Sketch for Physics-Constrained Tele-Impedance Delay Compensation
This is a research proposal I have been sketching on my own while applying for thesis topics in this area. It is not an ongoing project — there is no trained model yet. I worked it through end-to-end as a way of stress-testing my own understanding before submitting applications that ask exactly this kind of question.
In teleoperation, communication delays of 50–200 ms are unavoidable, and they make the remote robot react late to the operator's intent. The question I keep coming back to: can deep learning predict the operator's future trajectory and joint stiffness from their surface EMG, far enough in advance to mask that delay — without breaking safety guarantees?
The answer, as I have read into the literature, looks like a layered system rather than a single black box. Surface EMG already leads force output by 30–80 ms (well documented in the sEMG-force literature), and energy-observer safety nets from the teleoperation literature catch the worst case. What seems to be missing is the middle layer: a learned model that explicitly predicts intent 100–500 ms into the future, slotted in between the natural sEMG lead and the safety controller.
Architecture I sketched. A pre-trained encoder (NinaPro, 40 subjects) consumes 8-channel sEMG plus position and velocity (14-dim input). Channel attention reweights the muscle channels; a 2-layer LSTM (hidden = 48) tracks dynamics; temporal attention summarises the recent window; softplus heads produce the next trajectory and a positive-definite stiffness vector. A physics-informed loss penalises the rate-of-change of stiffness, so the model can't cheat by predicting wild swings.
Ablation plan. Before fixing the architecture, the right move is a systematic comparison of five sequence models — Linear, 1D-CNN, GRU, LSTM, TCN — and then ablations over hidden size, depth, attention placement, and input modality (raw vs. filtered sEMG, with vs. without position and velocity). Pre-training would be followed by leave-one-subject-out fine-tuning so that any cross-user numbers stay honest.
Trust the model, but verify. Uncertainty would be estimated with MC-Dropout. When confidence drops, the system falls back to a classical energy-based safety controller — the learned prediction is only used when it has earned it.
What I like about this problem is the cleanliness of the separation: physics provides a hard prior (positive stiffness, bounded rate-of-change), a classical controller provides a safety floor, and deep learning fills a well-defined gap (a multi-step-ahead horizon that adaptive filters can't reach). The point isn't "deep learning everywhere" — it is deciding precisely where in the loop a learned component earns its place. That kind of decision-making, more than any single architecture, is what I want to keep working on.
Toward Multimodal Predictive Systems for Action-Time Prediction
The SINDy follow-up on the sEMG impedance design-space study (see
Discovering equations vs fitting them above) left me with a
sharper version of a question I had only been gesturing at before. The
follow-up replaced LSTM-with-architectural-priors as the protagonist
with SINDy used as a dynamics learner — learn
d(state)/dt = f(state, sEMG) from data, keep only sparse
terms via Lasso, integrate forward to the target horizon. On the same
synthetic peg-in-hole data, SINDy gave 3.9 mm at 50 ms and
7.0 mm at 100 ms — roughly twice as good as the next
method — while LSTM, with ~32 k parameters against ~3 k
samples, finished last across the board.
The numbers matter, but what stuck with me is the artefact.
SINDy keeps 1 nonzero term in dx/dt, 5 in
dy/dt, 3 in dz/dt. The full prediction model
fits on one page; each equation is a sparse polynomial a human can read,
falsify, and retrain in seconds. When the prediction is wrong at
200 ms, you can look at the equation and tell where the assumption
broke. That is a qualitatively different research object from the hidden
state of a recurrent network.
This is what reframes the original “safety fallback” problem for me. The earlier instinct was that even a low-error learned model has to defer to a classical, energy-based safety controller as a backup, because no one knows how to be accountable for what a black box would do in a situation no one has thought through. Causal ML is the obvious candidate, and I do not want to dismiss it — I just have not yet, in my reading so far, found a clean way to fit a causal-graph formulation into the kind of inner control loop this problem lives in; that read is provisional. But the SINDy result points to a different path I had not considered before: don’t bolt an interpretability layer onto a black-box prediction, don’t try to constrain it from the outside — make the prediction object itself something you can inspect.
The direction that pulls me — and I want to be upfront, it is an area I am only beginning to read into — is multimodal predictive systems in which the model’s belief about the future is itself a physically grounded, observable, checkable artefact. Sparse-polynomial discovery from sensor data is one minimal example: a 200 ms prediction is a few lines of algebra you can step through. Learned physical simulators are another: a predicted half-second unfolds in 3D and you watch it. They share the property that “is this prediction safe to act on?” can be answered by inspecting the prediction itself, not by adding an external filter beside it.
That reframing changes the problem from “make the AI prediction more interpretable” to “make the AI’s future the thing we inspect”. I do not pretend to know which architectural family — diffusion priors, video transformers, learned simulators, neural physics — does this best, or whether the framing survives contact with real human sEMG, which has cross-talk, fatigue drift, and motion artifacts the simulator does not produce. The next phase is to find out. That is exactly why it is the direction I want to spend it on.