Tianhao Wu

Not quite a blog. Just a place to leave the occasional write-up of where my research thinking is, so I can point to it next time someone asks.

Foundations refresher, day 1 — re-walking deep-learning basics before thesis kickoff

2026-05-26 — self-paced study, in parallel with thesis topic selection in learned point-cloud / 3D Gaussian Splatting compression

Before reading deeper into the learned-compression literature for thesis work, I am running a structured re-walk through the foundations of deep learning — slowly enough to actually look at each piece. The PyTorch use in the sEMG project below sat on top of architectural priors I had taken on trust; the goal of this refresher is to put those priors back on first-principles ground before they become invisible scaffolding under thesis-level work.

Day 1 covered the basics in one pass: what deep learning does mechanically (rule-finding from examples rather than rule-writing), how a neural network is structured (neuron → layer → depth as a feature ladder), how training works (loss as a scalar, gradient descent as blindfolded descent, learning rate as step size), and the overfitting / underfitting distinction with the four standard remedies (more data, L2 regularisation, dropout, early stopping). The textbook material I won’t rehash. What is worth recording from a day 1 is what I caught myself getting wrong:

Negative weights. I had been carrying the implicit picture that a “more important” feature gets a larger weight. Working through a toy “should I go to the beach” example forced the point that suppressing a decision is just as legitimate as supporting it, and the way a network represents “this feature pushes against the answer” is a weight with negative sign and large magnitude. Trivial in hindsight; not how I had been visualising it.

Depth as a strict ladder, not a soft metaphor. I knew “deep” meant many layers, and that early layers learn simple features. What I had not internalised is that each layer’s input vocabulary is literally the previous layer’s output. Edges → shapes → eye → face isn’t a slogan but the actual data flow, and “eye” is just a stable activation in some middle layer that the next layer uses as a primitive. That makes depth a different kind of design choice than I had treated it as — it controls the maximum composition height of the features the network can express, not just the parameter count.

Underfitting ≠ a worse case of overfitting. Asked to classify a hypothetical model with 70% train / 68% test accuracy, I called it overfitting because both numbers looked bad. The correct label is underfitting: both numbers being low and close to each other points to insufficient capacity, not memorisation. The two diseases need opposite treatments — underfitting wants more capacity or longer training; overfitting wants regularisation, dropout, or more data. Catching this confusion now is much cheaper than catching it later, embedded in a real experiment where the data and the architecture are both moving.

Day 2 will move into hand-writing a small model in PyTorch to make the learning-rate and overfitting points concrete; from there into CNNs, then into the learned-compression specifics (autoencoders, quantisation, entropy coding, hyperprior) that the thesis area runs on. Substantive updates to this site will probably come when those last pieces start to load.

Robustness check: does SINDy still win when the generator isn’t polynomial?

2026-05-19 — ablation on the SINDy follow-up below

After writing the SINDy follow-up below I went back to a concern that had been sitting at the edge of my own thinking. The synthetic generator I used mixes a constant linear matrix from stiffness to sEMG and adds additive heteroscedastic-in-sEMG noise. That data-generating process sits structurally inside SINDy’s degree-2 polynomial hypothesis class. A reader who has worked with sparse-regression methods would reasonably ask: did SINDy win because it discovered the generator, or because the generator happened to be in its model class? Until I checked, I couldn’t tell.

So I ran an ablation. Trajectories unchanged; the sEMG observation model gains three pieces of structure SINDy cannot fit cleanly: (1) tanh saturation on the linear mixing (emg = tanh(2·W·K)), modelling motor-unit recruitment plateaus; (2) slow fatigue drift on W (W_eff(t) = W₀ + 0.25·ΔW·sin(0.2π·t)), making the observation map time-varying; (3) a state-dependent noise floor (σ ∝ σ₀ + 0.08·‖K‖) on top of the existing amplitude-proportional component. Same 6/2 split, same horizons, same SINDy hyperparameters. Ridge, SINDy, ESN, and an MLP baseline (MLPRegressor with hidden_layer_sizes=(48, 24), substituting for the PyTorch LSTM in the easy run so this ablation runs in a torchless environment; both play the same “high-capacity neural baseline” role) all run on both generators.

Stiffness error at the 100 ms horizon. SINDy is the only method whose error does not change materially between the two generators (+6%); Ridge, ESN, and MLP all degrade by 40–80%. Full 4×3×2 table (position + stiffness, three horizons, four methods, both generators) on GitHub.

The good news. SINDy as a predictor holds up. Position error is essentially unchanged across the two generators (3.9 / 3.9 mm at 50 ms, 7.0 / 7.0, 12.4 / 12.4). Stiffness degrades only marginally (8.5→8.9, 16.2→17.1, 29.2→31.3). The other three methods take a clean hit on stiffness: Ridge +49%, ESN +42%, MLP +82% at 100 ms. The mechanism, I think, is that SINDy’s discovered position dynamics rely heavily on state-based extrapolation (dx/dt ends up mostly a function of state, not sEMG), so corrupting the sEMG observation barely touches the prediction.

The qualification I owe the SINDy follow-up below. The discovered equations grow. On the easy generator SINDy kept [1, 5, 3, 12, 12, 11] nonzero terms across the six state components (44 total). On the hard generator it keeps [5, 11, 7, 22, 22, 16] (83 total) — nearly twice as dense. The “sparse polynomial a human can read” framing I leaned on in the post below becomes much weaker once the sEMG isn’t linear: a 22-term degree-2 polynomial is still much smaller than the full feature library, but it is no longer the kind of equation you would print on a slide and reason about by eye. So that artefact-as- interpretation claim was specific to the linear sEMG–K map; under realistic sEMG nonlinearity, SINDy still works as a predictor, but it works as a denser curve-fit, not as a transparent equation.

What this updates in my own thinking. Two things. First, on prediction robustness I owe SINDy more credit than I would have given it yesterday — the accuracy genuinely doesn’t care about the kind of observation-model violations I expected to break it. Second, the broader point I had been pulling toward — that the value of a “discoverable” dynamics method is the readable artefact, not the headline number — is not free. The artefact is only readable when the data-generating process itself is close to sparse-polynomial. Real human sEMG presumably isn’t, and that is the test the result above does not pass. Code: sindy_robustness.py.

Discovering equations vs fitting them: a SINDy follow-up

2026-05-18 — sEMG impedance prediction, update (see the robustness check above for an important qualification on the “sparse, readable equation” claim)

I went back to the same simulated peg-in-hole data and put SINDy in the comparison, this time the way it is actually meant to be used: learn the differential equations d(state)/dt = f(state, sEMG) directly from data, keep only the sparse terms via Lasso, and integrate the current state forward to the prediction horizon. Not static regression.

The result inverted my expectation. On position error, SINDy gives 3.9 mm at 50 ms and 7.0 mm at 100 ms — roughly twice as good as the next method (Ridge) at both windows. ESN takes over at 200 ms with 10.6 mm. On stiffness, SINDy is best at every horizon. LSTM, with ~32 k parameters against ~3 k training samples, finishes last across the board (22–24 mm) — the textbook overfitting regime, and simulation does not produce the kind of long-tail nonlinearity that would force a network to earn its capacity.

What I keep coming back to is not the numbers. It is the artefact: SINDy keeps 1 nonzero term in dx/dt, 5 in dy/dt, 3 in dz/dt, and 11–12 in each stiffness component. These are sparse polynomials a human can read. I can print the equation, change one coefficient, see the effect. That is a different research object than the hidden state of an LSTM.

Where this leaves me: when data is consistent with a compact set of equations, finding those equations is more honest than fitting an input-output map. Whether real human-in-the-loop sEMG behaves that way is the next thing I want to find out. Code and full numbers: semg-impedance-prediction on GitHub.

Predicting What the Operator Means: A Design Sketch for Physics-Constrained Tele-Impedance Delay Compensation

2026-05-15 — self-directed study, written while preparing thesis applications

This is a research proposal I have been sketching on my own while applying for thesis topics in this area. It is not an ongoing project — there is no trained model yet. I worked it through end-to-end as a way of stress-testing my own understanding before submitting applications that ask exactly this kind of question.

In teleoperation, communication delays of 50–200 ms are unavoidable, and they make the remote robot react late to the operator's intent. The question I keep coming back to: can deep learning predict the operator's future trajectory and joint stiffness from their surface EMG, far enough in advance to mask that delay — without breaking safety guarantees?

The answer, as I have read into the literature, looks like a layered system rather than a single black box. Surface EMG already leads force output by 30–80 ms (well documented in the sEMG-force literature), and energy-observer safety nets from the teleoperation literature catch the worst case. What seems to be missing is the middle layer: a learned model that explicitly predicts intent 100–500 ms into the future, slotted in between the natural sEMG lead and the safety controller.

Proposed end-to-end architecture (no model trained yet): a learned component would occupy a precise gap between the natural sEMG lead and the classical safety controller.

Architecture I sketched. A pre-trained encoder (NinaPro, 40 subjects) consumes 8-channel sEMG plus position and velocity (14-dim input). Channel attention reweights the muscle channels; a 2-layer LSTM (hidden = 48) tracks dynamics; temporal attention summarises the recent window; softplus heads produce the next trajectory and a positive-definite stiffness vector. A physics-informed loss penalises the rate-of-change of stiffness, so the model can't cheat by predicting wild swings.

Ablation plan. Before fixing the architecture, the right move is a systematic comparison of five sequence models — Linear, 1D-CNN, GRU, LSTM, TCN — and then ablations over hidden size, depth, attention placement, and input modality (raw vs. filtered sEMG, with vs. without position and velocity). Pre-training would be followed by leave-one-subject-out fine-tuning so that any cross-user numbers stay honest.

Trust the model, but verify. Uncertainty would be estimated with MC-Dropout. When confidence drops, the system falls back to a classical energy-based safety controller — the learned prediction is only used when it has earned it.

What I like about this problem is the cleanliness of the separation: physics provides a hard prior (positive stiffness, bounded rate-of-change), a classical controller provides a safety floor, and deep learning fills a well-defined gap (a multi-step-ahead horizon that adaptive filters can't reach). The point isn't "deep learning everywhere" — it is deciding precisely where in the loop a learned component earns its place. That kind of decision-making, more than any single architecture, is what I want to keep working on.

Toward Multimodal Predictive Systems for Action-Time Prediction

2026-05-17 (updated 2026-05-18) — self-directed reflection, building on the SINDy follow-up

The SINDy follow-up on the sEMG impedance design-space study (see Discovering equations vs fitting them above) left me with a sharper version of a question I had only been gesturing at before. The follow-up replaced LSTM-with-architectural-priors as the protagonist with SINDy used as a dynamics learner — learn d(state)/dt = f(state, sEMG) from data, keep only sparse terms via Lasso, integrate forward to the target horizon. On the same synthetic peg-in-hole data, SINDy gave 3.9 mm at 50 ms and 7.0 mm at 100 ms — roughly twice as good as the next method — while LSTM, with ~32 k parameters against ~3 k samples, finished last across the board.

Position error on a fixed 6/2 train/test split (users 0–5 train, users 6–7 test), shared across all four methods. Note this differs from the original 5-method study, which used leave-one-subject-out; the LSTM number here (24.3 mm at 200 ms) and the LSTM number in the earlier study (28.5 mm at 200 ms) are not directly comparable.

The numbers matter, but what stuck with me is the artefact. SINDy keeps 1 nonzero term in dx/dt, 5 in dy/dt, 3 in dz/dt. The full prediction model fits on one page; each equation is a sparse polynomial a human can read, falsify, and retrain in seconds. When the prediction is wrong at 200 ms, you can look at the equation and tell where the assumption broke. That is a qualitatively different research object from the hidden state of a recurrent network.

This is what reframes the original “safety fallback” problem for me. The earlier instinct was that even a low-error learned model has to defer to a classical, energy-based safety controller as a backup, because no one knows how to be accountable for what a black box would do in a situation no one has thought through. Causal ML is the obvious candidate, and I do not want to dismiss it — I just have not yet, in my reading so far, found a clean way to fit a causal-graph formulation into the kind of inner control loop this problem lives in; that read is provisional. But the SINDy result points to a different path I had not considered before: don’t bolt an interpretability layer onto a black-box prediction, don’t try to constrain it from the outside — make the prediction object itself something you can inspect.

The direction that pulls me — and I want to be upfront, it is an area I am only beginning to read into — is multimodal predictive systems in which the model’s belief about the future is itself a physically grounded, observable, checkable artefact. Sparse-polynomial discovery from sensor data is one minimal example: a 200 ms prediction is a few lines of algebra you can step through. Learned physical simulators are another: a predicted half-second unfolds in 3D and you watch it. They share the property that “is this prediction safe to act on?” can be answered by inspecting the prediction itself, not by adding an external filter beside it.

That reframing changes the problem from “make the AI prediction more interpretable” to “make the AI’s future the thing we inspect”. I do not pretend to know which architectural family — diffusion priors, video transformers, learned simulators, neural physics — does this best, or whether the framing survives contact with real human sEMG, which has cross-talk, fatigue drift, and motion artifacts the simulator does not produce. The next phase is to find out. That is exactly why it is the direction I want to spend it on.

About

Research Statement

Notes

Foundations refresher, day 1 — re-walking deep-learning basics before thesis kickoff

Robustness check: does SINDy still win when the generator isn’t polynomial?

Discovering equations vs fitting them: a SINDy follow-up

Predicting What the Operator Means: A Design Sketch for Physics-Constrained Tele-Impedance Delay Compensation

Toward Multimodal Predictive Systems for Action-Time Prediction

Links