Portrait of Tianhao Wu

Tianhao Wu

M.Sc. Mechatronics, FAU Erlangen–Nürnberg

Tianhao.Wu.Mechatronik@outlook.com  ·  GitHub

About

I am an M.Sc. Mechatronics student at FAU Erlangen–Nürnberg, currently preparing thesis applications. My interests sit at the intersection of deep learning, sequence modelling, and physics-constrained systems: how to make learned models earn their place inside a control loop without breaking the safety guarantees a classical controller already provides. I work primarily in PyTorch.

Lately, my reading and thinking have been pulled toward multimodal predictive systems as a direction I want to explore — see Research Statement below.

Recent activity
  • — Started a self-paced deep-learning foundations refresher in parallel with thesis topic selection (target area: learned point-cloud / 3D Gaussian Splatting compression). Day 1 notes on what I caught myself getting wrong. notes →
  • — Ran a robustness ablation against the SINDy follow-up under non-polynomial sEMG (tanh saturation + fatigue drift + heteroscedastic noise). Prediction accuracy holds, but discovered equations nearly double in size — weakening the “sparse polynomial a human can read” framing from the day before. notes →
  • — Reframed SINDy as dynamics discovery + forward integration in the sEMG impedance project; sparse polynomial predictions beat LSTM at short horizons. repo →
  • — Posted a self-directed reflection on multimodal predictive systems for action-time prediction (below).
  • — Published an open-source design-space study on sEMG-driven impedance prediction (5 models × 3 horizons, ablations + MC-Dropout safety). repo →

Research Statement

Working through a self-directed design-space study on sEMG-driven impedance prediction taught me something I did not expect to find. The architectural priors I tried — attention, physics-informed losses, multi-modal inputs — genuinely improved prediction quality, and the resulting models forecast trajectory and stiffness 100–200 ms ahead with millimeter-level error. The methods worked. But the experience also exposed a deeper limitation: even a low-error learned model, deployed in a real teleoperation loop, still has to defer to a classical energy-based safety controller as a fallback — not because the prediction is wrong, but because no one knows how to be accountable for its decisions when the model itself is a black box.

That interpretability gap is what I now find most interesting. One natural candidate — causal machine learning — I have not yet been able to map onto a workable interface with a teleoperation control loop, in the reading I have done so far; this impression is provisional and I expect it will sharpen as I read more. The direction I am currently drawn toward — and I want to be upfront, it is an area I am only beginning to read into — is multimodal predictive systems in which the model’s predicted future is itself physically grounded, observable, simulatable, and checkable. If a prediction unfolds in 3D against a learned physics, the question “is this prediction safe to act on?” becomes answerable from the prediction itself, rather than delegated to a classical filter bolted on beside it. I am not claiming I have answers there yet; it is the direction I want to spend the next phase exploring.

Notes

Not quite a blog. Just a place to leave the occasional write-up of where my research thinking is, so I can point to it next time someone asks.

Foundations refresher, day 1 — re-walking deep-learning basics before thesis kickoff

Before reading deeper into the learned-compression literature for thesis work, I am running a structured re-walk through the foundations of deep learning — slowly enough to actually look at each piece. The PyTorch use in the sEMG project below sat on top of architectural priors I had taken on trust; the goal of this refresher is to put those priors back on first-principles ground before they become invisible scaffolding under thesis-level work.

Day 1 covered the basics in one pass: what deep learning does mechanically (rule-finding from examples rather than rule-writing), how a neural network is structured (neuron → layer → depth as a feature ladder), how training works (loss as a scalar, gradient descent as blindfolded descent, learning rate as step size), and the overfitting / underfitting distinction with the four standard remedies (more data, L2 regularisation, dropout, early stopping). The textbook material I won’t rehash. What is worth recording from a day 1 is what I caught myself getting wrong:

Negative weights. I had been carrying the implicit picture that a “more important” feature gets a larger weight. Working through a toy “should I go to the beach” example forced the point that suppressing a decision is just as legitimate as supporting it, and the way a network represents “this feature pushes against the answer” is a weight with negative sign and large magnitude. Trivial in hindsight; not how I had been visualising it.

Depth as a strict ladder, not a soft metaphor. I knew “deep” meant many layers, and that early layers learn simple features. What I had not internalised is that each layer’s input vocabulary is literally the previous layer’s output. Edges → shapes → eye → face isn’t a slogan but the actual data flow, and “eye” is just a stable activation in some middle layer that the next layer uses as a primitive. That makes depth a different kind of design choice than I had treated it as — it controls the maximum composition height of the features the network can express, not just the parameter count.

Underfitting ≠ a worse case of overfitting. Asked to classify a hypothetical model with 70% train / 68% test accuracy, I called it overfitting because both numbers looked bad. The correct label is underfitting: both numbers being low and close to each other points to insufficient capacity, not memorisation. The two diseases need opposite treatments — underfitting wants more capacity or longer training; overfitting wants regularisation, dropout, or more data. Catching this confusion now is much cheaper than catching it later, embedded in a real experiment where the data and the architecture are both moving.

Day 2 will move into hand-writing a small model in PyTorch to make the learning-rate and overfitting points concrete; from there into CNNs, then into the learned-compression specifics (autoencoders, quantisation, entropy coding, hyperprior) that the thesis area runs on. Substantive updates to this site will probably come when those last pieces start to load.


Robustness check: does SINDy still win when the generator isn’t polynomial?

After writing the SINDy follow-up below I went back to a concern that had been sitting at the edge of my own thinking. The synthetic generator I used mixes a constant linear matrix from stiffness to sEMG and adds additive heteroscedastic-in-sEMG noise. That data-generating process sits structurally inside SINDy’s degree-2 polynomial hypothesis class. A reader who has worked with sparse-regression methods would reasonably ask: did SINDy win because it discovered the generator, or because the generator happened to be in its model class? Until I checked, I couldn’t tell.

So I ran an ablation. Trajectories unchanged; the sEMG observation model gains three pieces of structure SINDy cannot fit cleanly: (1) tanh saturation on the linear mixing (emg = tanh(2·W·K)), modelling motor-unit recruitment plateaus; (2) slow fatigue drift on W (W_eff(t) = W₀ + 0.25·ΔW·sin(0.2π·t)), making the observation map time-varying; (3) a state-dependent noise floor (σ ∝ σ₀ + 0.08·‖K‖) on top of the existing amplitude-proportional component. Same 6/2 split, same horizons, same SINDy hyperparameters. Ridge, SINDy, ESN, and an MLP baseline (MLPRegressor with hidden_layer_sizes=(48, 24), substituting for the PyTorch LSTM in the easy run so this ablation runs in a torchless environment; both play the same “high-capacity neural baseline” role) all run on both generators.

Stiffness error at 100 ms horizon — robustness ablation Easy (linear W, additive noise) vs Hard (tanh + fatigue drift + heteroscedastic noise). Lower is better. 60 50 40 30 20 10 0 mean stiffness error 34.9 51.9 Ridge +49% 16.2 17.1 SINDy +6% 26.6 37.9 ESN +42% 21.9 39.8 MLP +82% four methods, two generators easy generator (linear sEMG) hard generator (tanh + fatigue)
Stiffness error at the 100 ms horizon. SINDy is the only method whose error does not change materially between the two generators (+6%); Ridge, ESN, and MLP all degrade by 40–80%. Full 4×3×2 table (position + stiffness, three horizons, four methods, both generators) on GitHub.

The good news. SINDy as a predictor holds up. Position error is essentially unchanged across the two generators (3.9 / 3.9 mm at 50 ms, 7.0 / 7.0, 12.4 / 12.4). Stiffness degrades only marginally (8.5→8.9, 16.2→17.1, 29.2→31.3). The other three methods take a clean hit on stiffness: Ridge +49%, ESN +42%, MLP +82% at 100 ms. The mechanism, I think, is that SINDy’s discovered position dynamics rely heavily on state-based extrapolation (dx/dt ends up mostly a function of state, not sEMG), so corrupting the sEMG observation barely touches the prediction.

The qualification I owe the SINDy follow-up below. The discovered equations grow. On the easy generator SINDy kept [1, 5, 3, 12, 12, 11] nonzero terms across the six state components (44 total). On the hard generator it keeps [5, 11, 7, 22, 22, 16] (83 total) — nearly twice as dense. The “sparse polynomial a human can read” framing I leaned on in the post below becomes much weaker once the sEMG isn’t linear: a 22-term degree-2 polynomial is still much smaller than the full feature library, but it is no longer the kind of equation you would print on a slide and reason about by eye. So that artefact-as- interpretation claim was specific to the linear sEMG–K map; under realistic sEMG nonlinearity, SINDy still works as a predictor, but it works as a denser curve-fit, not as a transparent equation.

What this updates in my own thinking. Two things. First, on prediction robustness I owe SINDy more credit than I would have given it yesterday — the accuracy genuinely doesn’t care about the kind of observation-model violations I expected to break it. Second, the broader point I had been pulling toward — that the value of a “discoverable” dynamics method is the readable artefact, not the headline number — is not free. The artefact is only readable when the data-generating process itself is close to sparse-polynomial. Real human sEMG presumably isn’t, and that is the test the result above does not pass. Code: sindy_robustness.py.


Discovering equations vs fitting them: a SINDy follow-up

I went back to the same simulated peg-in-hole data and put SINDy in the comparison, this time the way it is actually meant to be used: learn the differential equations d(state)/dt = f(state, sEMG) directly from data, keep only the sparse terms via Lasso, and integrate the current state forward to the prediction horizon. Not static regression.

The result inverted my expectation. On position error, SINDy gives 3.9 mm at 50 ms and 7.0 mm at 100 ms — roughly twice as good as the next method (Ridge) at both windows. ESN takes over at 200 ms with 10.6 mm. On stiffness, SINDy is best at every horizon. LSTM, with ~32 k parameters against ~3 k training samples, finishes last across the board (22–24 mm) — the textbook overfitting regime, and simulation does not produce the kind of long-tail nonlinearity that would force a network to earn its capacity.

What I keep coming back to is not the numbers. It is the artefact: SINDy keeps 1 nonzero term in dx/dt, 5 in dy/dt, 3 in dz/dt, and 11–12 in each stiffness component. These are sparse polynomials a human can read. I can print the equation, change one coefficient, see the effect. That is a different research object than the hidden state of an LSTM.

Where this leaves me: when data is consistent with a compact set of equations, finding those equations is more honest than fitting an input-output map. Whether real human-in-the-loop sEMG behaves that way is the next thing I want to find out. Code and full numbers: semg-impedance-prediction on GitHub.


Predicting What the Operator Means: A Design Sketch for Physics-Constrained Tele-Impedance Delay Compensation

This is a research proposal I have been sketching on my own while applying for thesis topics in this area. It is not an ongoing project — there is no trained model yet. I worked it through end-to-end as a way of stress-testing my own understanding before submitting applications that ask exactly this kind of question.

In teleoperation, communication delays of 50–200 ms are unavoidable, and they make the remote robot react late to the operator's intent. The question I keep coming back to: can deep learning predict the operator's future trajectory and joint stiffness from their surface EMG, far enough in advance to mask that delay — without breaking safety guarantees?

The answer, as I have read into the literature, looks like a layered system rather than a single black box. Surface EMG already leads force output by 30–80 ms (well documented in the sEMG-force literature), and energy-observer safety nets from the teleoperation literature catch the worst case. What seems to be missing is the middle layer: a learned model that explicitly predicts intent 100–500 ms into the future, slotted in between the natural sEMG lead and the safety controller.

Physics-Constrained Prediction Network — sEMG → Future Trajectory & Stiffness proposed architecture — no model has been trained yet Input (14-dim) sEMG ×8 pos ×3 · vel ×3 Pre-trained Encoder NinaPro · planned Channel Attention 2-layer LSTM hidden = 48 Temporal Attention Softplus heads K ≥ 0 Future trajectory (3) + stiffness (3) 100 – 500 ms ahead (Δt = teleoperation latency) low-confidence (MC-Dropout) Fallback: classical energy-based safety controller used when network is not confident Physics-informed loss MSE + λ · ‖∂K/∂t‖² (bounded stiffness rate)
Proposed end-to-end architecture (no model trained yet): a learned component would occupy a precise gap between the natural sEMG lead and the classical safety controller.

Architecture I sketched. A pre-trained encoder (NinaPro, 40 subjects) consumes 8-channel sEMG plus position and velocity (14-dim input). Channel attention reweights the muscle channels; a 2-layer LSTM (hidden = 48) tracks dynamics; temporal attention summarises the recent window; softplus heads produce the next trajectory and a positive-definite stiffness vector. A physics-informed loss penalises the rate-of-change of stiffness, so the model can't cheat by predicting wild swings.

Ablation plan. Before fixing the architecture, the right move is a systematic comparison of five sequence models — Linear, 1D-CNN, GRU, LSTM, TCN — and then ablations over hidden size, depth, attention placement, and input modality (raw vs. filtered sEMG, with vs. without position and velocity). Pre-training would be followed by leave-one-subject-out fine-tuning so that any cross-user numbers stay honest.

Trust the model, but verify. Uncertainty would be estimated with MC-Dropout. When confidence drops, the system falls back to a classical energy-based safety controller — the learned prediction is only used when it has earned it.

What I like about this problem is the cleanliness of the separation: physics provides a hard prior (positive stiffness, bounded rate-of-change), a classical controller provides a safety floor, and deep learning fills a well-defined gap (a multi-step-ahead horizon that adaptive filters can't reach). The point isn't "deep learning everywhere" — it is deciding precisely where in the loop a learned component earns its place. That kind of decision-making, more than any single architecture, is what I want to keep working on.


Toward Multimodal Predictive Systems for Action-Time Prediction

The SINDy follow-up on the sEMG impedance design-space study (see Discovering equations vs fitting them above) left me with a sharper version of a question I had only been gesturing at before. The follow-up replaced LSTM-with-architectural-priors as the protagonist with SINDy used as a dynamics learner — learn d(state)/dt = f(state, sEMG) from data, keep only sparse terms via Lasso, integrate forward to the target horizon. On the same synthetic peg-in-hole data, SINDy gave 3.9 mm at 50 ms and 7.0 mm at 100 ms — roughly twice as good as the next method — while LSTM, with ~32 k parameters against ~3 k samples, finished last across the board.

Position prediction error (mm) — design-space follow-up Synthetic peg-in-hole, fixed 6/2 train/test split (shared across the 4 methods). Lower is better. ★ best at horizon. 30 25 20 15 10 5 0 mean position error (mm) 3.9 8.7 9.7 22.0 50 ms 7.0 9.7 10.7 23.4 100 ms 12.4 12.1 10.6 24.3 200 ms prediction horizon SINDy (dynamics discovery) Ridge ESN LSTM
Position error on a fixed 6/2 train/test split (users 0–5 train, users 6–7 test), shared across all four methods. Note this differs from the original 5-method study, which used leave-one-subject-out; the LSTM number here (24.3 mm at 200 ms) and the LSTM number in the earlier study (28.5 mm at 200 ms) are not directly comparable.

The numbers matter, but what stuck with me is the artefact. SINDy keeps 1 nonzero term in dx/dt, 5 in dy/dt, 3 in dz/dt. The full prediction model fits on one page; each equation is a sparse polynomial a human can read, falsify, and retrain in seconds. When the prediction is wrong at 200 ms, you can look at the equation and tell where the assumption broke. That is a qualitatively different research object from the hidden state of a recurrent network.

This is what reframes the original “safety fallback” problem for me. The earlier instinct was that even a low-error learned model has to defer to a classical, energy-based safety controller as a backup, because no one knows how to be accountable for what a black box would do in a situation no one has thought through. Causal ML is the obvious candidate, and I do not want to dismiss it — I just have not yet, in my reading so far, found a clean way to fit a causal-graph formulation into the kind of inner control loop this problem lives in; that read is provisional. But the SINDy result points to a different path I had not considered before: don’t bolt an interpretability layer onto a black-box prediction, don’t try to constrain it from the outside — make the prediction object itself something you can inspect.

The direction that pulls me — and I want to be upfront, it is an area I am only beginning to read into — is multimodal predictive systems in which the model’s belief about the future is itself a physically grounded, observable, checkable artefact. Sparse-polynomial discovery from sensor data is one minimal example: a 200 ms prediction is a few lines of algebra you can step through. Learned physical simulators are another: a predicted half-second unfolds in 3D and you watch it. They share the property that “is this prediction safe to act on?” can be answered by inspecting the prediction itself, not by adding an external filter beside it.

That reframing changes the problem from “make the AI prediction more interpretable” to “make the AI’s future the thing we inspect”. I do not pretend to know which architectural family — diffusion priors, video transformers, learned simulators, neural physics — does this best, or whether the framing survives contact with real human sEMG, which has cross-talk, fatigue drift, and motion artifacts the simulator does not produce. The next phase is to find out. That is exactly why it is the direction I want to spend it on.