Learning by Doing: a Minimal Sentiment Classifier

Tutorial · Post-mortem

Published September 3, 2025 · Tags: nlp transformers tutorial post-mortem

TL;DR

I built a compact sentiment-classifier project (training + predict) as a short learning exercise using Hugging Face Transformers, Datasets, and PyTorch.
This post documents what we built, why, the errors we hit, how we fixed them, and a frank critique of the project with immediate next steps.

Motivation

We wanted a concise, reproducible exercise to practice fine-tuning transformer models and to document the common pitfalls newcomers (and sometimes veterans) face when building ML tooling. The goals were simple:

Build a tiny pipeline that trains a binary sentiment classifier on IMDB (or a tiny sampled subset) and saves a best model.
Make it easy to reproduce locally (Windows, small GPU), run smoke tests, and share learnings in a short blog post.

This repo is deliberately small and opinionated — it’s a learning artifact, not production ready. The value is in the problems encountered and how they were solved.

What we built

train.py — config-driven training script using Transformers.Trainer.
predict.py — loads the saved best model and predicts a single text.
config.yaml / dev_config.yaml — runtime configs; dev_config.yaml is minimized for fast smoke runs.
tests/test_smoke.py — tiny pytest forward-pass test using from_config() models (no downloads required).
.gitignore and project-level docs (this post).

Design decisions

Use configs (YAML) for hyperparameters so we can run fast dev experiments and larger runs without code edits.
Keep training code simple and readable rather than abstracted into many modules — easier for a small learning project.

Repro (quick)

Dev smoke run (PowerShell)

& "D:/Sentiment Classifier/.venv/Scripts/python.exe" "D:/Sentiment Classifier/sentiment-classifier/train.py" "D:/Sentiment Classifier/sentiment-classifier/dev_config.yaml"

Run tests

cd "D:/Sentiment Classifier"
& ".venv/Scripts/python.exe" -m pytest -q

What went wrong (real problems encountered)

Missing evaluation dependency: evaluate expected scikit-learn for some metrics. Result: metrics import errors.
Transformers API mismatch: different versions of TrainingArguments expect evaluation_strategy vs eval_strategy — passing the wrong kwarg crashed construction.
Save/eval strategy mismatch: load_best_model_at_end=True throws a ValueError unless save_strategy equals the evaluation strategy.
Deprecated Trainer argument: older Trainer usages set tokenizer= directly; docs recommend processing_class + data_collator=DataCollatorWithPadding(tokenizer).
YAML parsing quirks: bare no/yes become booleans; this broke a save_strategy field in dev configs.
Gigantic model files accidentally committed: pushing failed due to large results/ artifacts.

How we fixed them

Install missing packages (scikit-learn) so evaluate metrics work.
Add robust code in train.py to detect whether TrainingArguments.__init__ accepts evaluation_strategy or eval_strategy and pass the correct kwarg accordingly.
When load_best_model_at_end is true, programmatically align save_strategy with the chosen evaluation strategy.
Replace deprecated tokenizer= usage with processing_class=tokenizer and DataCollatorWithPadding.
Make small config values explicit strings (e.g., save_strategy: "no") to avoid YAML boolean parsing.
Remove large artifacts from git history: untrack results/, add .gitignore, create a backup branch, then filter history and force-push the cleaned repo.

Aggressive critique (honest, sharp)

Storing model artifacts in the repo — use Git LFS or object storage + download script.
Monolithic train.py — split into data/model/training/utils; add unit tests.
Weak config validation — enforce schema via Pydantic/JSON Schema; add --validate-config.
Sparse logging/handling — add structured logs and guards around external calls.
Minimal CI — GH Actions for pytest + lint (black/isort/flake8).
No model packaging/versioning — add a tiny registry step + manifest.
Security/privacy omitted — data intake checklist; pinned/ scanned dependencies.

Lessons learned

Small smoke tests catch integration regressions fast.
Prefer small dev configs; run full experiments separately.
Transformer APIs evolve; add lightweight compatibility layers (or pin).
Never commit large model artifacts to a Git repo.
YAML quirks are real — validate configs.

Immediate next steps

Add Git LFS or cloud storage for models.
Add GitHub Actions for CI (pytest + linting).
Refactor train.py into modules with unit tests.
Add config validation and a contributor README.

Appendix — exact commands used (select)

Setup & deps

# create venv (if needed)
python -m venv .venv
& ".venv/Scripts/pip.exe" install -r requirements.txt

Dev smoke run

& ".venv/Scripts/python.exe" "train.py" "dev_config.yaml"

Run tests

& ".venv/Scripts/python.exe" -m pytest -q

Clean git history

git rm -r --cached results
git add .gitignore
git commit -m "chore: remove model artifacts from repo (keep locally) and respect .gitignore"
git branch backup-with-results
git filter-branch --force --index-filter 'git rm -r --cached --ignore-unmatch results' --prune-empty --tag-name-filter cat -- --all
git reflog expire --expire=now --all; git gc --prune=now --aggressive
git push origin --force main

Learning by Doing: a Minimal Sentiment Classifier

TL;DR

Motivation

What we built

Design decisions

Repro (quick)

What went wrong (real problems encountered)

How we fixed them

Aggressive critique (honest, sharp)

Lessons learned

Immediate next steps

Appendix — exact commands used (select)

Comments

Leave a comment Cancel reply

More posts

Lightweight RAG App: A Guide to Local Setup

Follow-Up Blog Post: Refining Production Architecture Through Real Implementation

Building Voice Recorder Pro: AI as a Development Partner, Not a Replacement

Learning by Doing: a Minimal Sentiment Classifier

TL;DR

Motivation

What we built

Design decisions

Repro (quick)

What went wrong (real problems encountered)

How we fixed them

Aggressive critique (honest, sharp)

Lessons learned

Immediate next steps

Appendix — exact commands used (select)

Share this:

Comments

Leave a comment Cancel reply

More posts

Lightweight RAG App: A Guide to Local Setup

Follow-Up Blog Post: Refining Production Architecture Through Real Implementation

Learning by Doing: a Minimal Sentiment Classifier

Building Voice Recorder Pro: AI as a Development Partner, Not a Replacement