cloud – Roaming Wild Gaming Blog

Tutorial · Post-mortem

Published September 3, 2025 · Tags: nlp transformers tutorial post-mortem

TL;DR

I built a compact sentiment-classifier project (training + predict) as a short learning exercise using Hugging Face Transformers, Datasets, and PyTorch.
This post documents what we built, why, the errors we hit, how we fixed them, and a frank critique of the project with immediate next steps.

Motivation

We wanted a concise, reproducible exercise to practice fine-tuning transformer models and to document the common pitfalls newcomers (and sometimes veterans) face when building ML tooling. The goals were simple:

Build a tiny pipeline that trains a binary sentiment classifier on IMDB (or a tiny sampled subset) and saves a best model.
Make it easy to reproduce locally (Windows, small GPU), run smoke tests, and share learnings in a short blog post.

This repo is deliberately small and opinionated — it’s a learning artifact, not production ready. The value is in the problems encountered and how they were solved.

What we built

train.py — config-driven training script using Transformers.Trainer.
predict.py — loads the saved best model and predicts a single text.
config.yaml / dev_config.yaml — runtime configs; dev_config.yaml is minimized for fast smoke runs.
tests/test_smoke.py — tiny pytest forward-pass test using from_config() models (no downloads required).
.gitignore and project-level docs (this post).

Design decisions

Use configs (YAML) for hyperparameters so we can run fast dev experiments and larger runs without code edits.
Keep training code simple and readable rather than abstracted into many modules — easier for a small learning project.

Repro (quick)

Dev smoke run (PowerShell)

& "D:/Sentiment Classifier/.venv/Scripts/python.exe" "D:/Sentiment Classifier/sentiment-classifier/train.py" "D:/Sentiment Classifier/sentiment-classifier/dev_config.yaml"

Run tests

cd "D:/Sentiment Classifier"
& ".venv/Scripts/python.exe" -m pytest -q

What went wrong (real problems encountered)

Missing evaluation dependency: evaluate expected scikit-learn for some metrics. Result: metrics import errors.
Transformers API mismatch: different versions of TrainingArguments expect evaluation_strategy vs eval_strategy — passing the wrong kwarg crashed construction.
Save/eval strategy mismatch: load_best_model_at_end=True throws a ValueError unless save_strategy equals the evaluation strategy.
Deprecated Trainer argument: older Trainer usages set tokenizer= directly; docs recommend processing_class + data_collator=DataCollatorWithPadding(tokenizer).
YAML parsing quirks: bare no/yes become booleans; this broke a save_strategy field in dev configs.
Gigantic model files accidentally committed: pushing failed due to large results/ artifacts.

How we fixed them

Install missing packages (scikit-learn) so evaluate metrics work.
Add robust code in train.py to detect whether TrainingArguments.__init__ accepts evaluation_strategy or eval_strategy and pass the correct kwarg accordingly.
When load_best_model_at_end is true, programmatically align save_strategy with the chosen evaluation strategy.
Replace deprecated tokenizer= usage with processing_class=tokenizer and DataCollatorWithPadding.
Make small config values explicit strings (e.g., save_strategy: "no") to avoid YAML boolean parsing.
Remove large artifacts from git history: untrack results/, add .gitignore, create a backup branch, then filter history and force-push the cleaned repo.

Aggressive critique (honest, sharp)

Storing model artifacts in the repo — use Git LFS or object storage + download script.
Monolithic train.py — split into data/model/training/utils; add unit tests.
Weak config validation — enforce schema via Pydantic/JSON Schema; add --validate-config.
Sparse logging/handling — add structured logs and guards around external calls.
Minimal CI — GH Actions for pytest + lint (black/isort/flake8).
No model packaging/versioning — add a tiny registry step + manifest.
Security/privacy omitted — data intake checklist; pinned/ scanned dependencies.

Lessons learned

Small smoke tests catch integration regressions fast.
Prefer small dev configs; run full experiments separately.
Transformer APIs evolve; add lightweight compatibility layers (or pin).
Never commit large model artifacts to a Git repo.
YAML quirks are real — validate configs.

Immediate next steps

Add Git LFS or cloud storage for models.
Add GitHub Actions for CI (pytest + linting).
Refactor train.py into modules with unit tests.
Add config validation and a contributor README.

Appendix — exact commands used (select)

Setup & deps

# create venv (if needed)
python -m venv .venv
& ".venv/Scripts/pip.exe" install -r requirements.txt

Dev smoke run

& ".venv/Scripts/python.exe" "train.py" "dev_config.yaml"