Tag: cloud

  • Learning by Doing: a Minimal Sentiment Classifier

    Learning by Doing: a Minimal Sentiment Classifier

    Tutorial · Post-mortem

    Published · Tags: nlp transformers tutorial post-mortem

    TL;DR

    • I built a compact sentiment-classifier project (training + predict) as a short learning exercise using Hugging Face Transformers, Datasets, and PyTorch.
    • This post documents what we built, why, the errors we hit, how we fixed them, and a frank critique of the project with immediate next steps.

    Motivation

    We wanted a concise, reproducible exercise to practice fine-tuning transformer models and to document the common pitfalls newcomers (and sometimes veterans) face when building ML tooling. The goals were simple:

    • Build a tiny pipeline that trains a binary sentiment classifier on IMDB (or a tiny sampled subset) and saves a best model.
    • Make it easy to reproduce locally (Windows, small GPU), run smoke tests, and share learnings in a short blog post.

    This repo is deliberately small and opinionated — it’s a learning artifact, not production ready. The value is in the problems encountered and how they were solved.

    What we built

    • train.py — config-driven training script using Transformers.Trainer.
    • predict.py — loads the saved best model and predicts a single text.
    • config.yaml / dev_config.yaml — runtime configs; dev_config.yaml is minimized for fast smoke runs.
    • tests/test_smoke.py — tiny pytest forward-pass test using from_config() models (no downloads required).
    • .gitignore and project-level docs (this post).

    Design decisions

    • Use configs (YAML) for hyperparameters so we can run fast dev experiments and larger runs without code edits.
    • Keep training code simple and readable rather than abstracted into many modules — easier for a small learning project.

    Repro (quick)

    Dev smoke run (PowerShell)
    & "D:/Sentiment Classifier/.venv/Scripts/python.exe" "D:/Sentiment Classifier/sentiment-classifier/train.py" "D:/Sentiment Classifier/sentiment-classifier/dev_config.yaml"
    Run tests
    cd "D:/Sentiment Classifier"
    & ".venv/Scripts/python.exe" -m pytest -q

    What went wrong (real problems encountered)

    1. Missing evaluation dependency: evaluate expected scikit-learn for some metrics. Result: metrics import errors.
    2. Transformers API mismatch: different versions of TrainingArguments expect evaluation_strategy vs eval_strategy — passing the wrong kwarg crashed construction.
    3. Save/eval strategy mismatch: load_best_model_at_end=True throws a ValueError unless save_strategy equals the evaluation strategy.
    4. Deprecated Trainer argument: older Trainer usages set tokenizer= directly; docs recommend processing_class + data_collator=DataCollatorWithPadding(tokenizer).
    5. YAML parsing quirks: bare no/yes become booleans; this broke a save_strategy field in dev configs.
    6. Gigantic model files accidentally committed: pushing failed due to large results/ artifacts.

    How we fixed them

    1. Install missing packages (scikit-learn) so evaluate metrics work.
    2. Add robust code in train.py to detect whether TrainingArguments.__init__ accepts evaluation_strategy or eval_strategy and pass the correct kwarg accordingly.
    3. When load_best_model_at_end is true, programmatically align save_strategy with the chosen evaluation strategy.
    4. Replace deprecated tokenizer= usage with processing_class=tokenizer and DataCollatorWithPadding.
    5. Make small config values explicit strings (e.g., save_strategy: "no") to avoid YAML boolean parsing.
    6. Remove large artifacts from git history: untrack results/, add .gitignore, create a backup branch, then filter history and force-push the cleaned repo.

    Aggressive critique (honest, sharp)

    • Storing model artifacts in the repo — use Git LFS or object storage + download script.
    • Monolithic train.py — split into data/model/training/utils; add unit tests.
    • Weak config validation — enforce schema via Pydantic/JSON Schema; add --validate-config.
    • Sparse logging/handling — add structured logs and guards around external calls.
    • Minimal CI — GH Actions for pytest + lint (black/isort/flake8).
    • No model packaging/versioning — add a tiny registry step + manifest.
    • Security/privacy omitted — data intake checklist; pinned/ scanned dependencies.

    Lessons learned

    • Small smoke tests catch integration regressions fast.
    • Prefer small dev configs; run full experiments separately.
    • Transformer APIs evolve; add lightweight compatibility layers (or pin).
    • Never commit large model artifacts to a Git repo.
    • YAML quirks are real — validate configs.

    Immediate next steps

    1. Add Git LFS or cloud storage for models.
    2. Add GitHub Actions for CI (pytest + linting).
    3. Refactor train.py into modules with unit tests.
    4. Add config validation and a contributor README.

    Appendix — exact commands used (select)

    Setup & deps
    # create venv (if needed)
    python -m venv .venv
    & ".venv/Scripts/pip.exe" install -r requirements.txt
    Dev smoke run
    & ".venv/Scripts/python.exe" "train.py" "dev_config.yaml"
    Run tests
    & ".venv/Scripts/python.exe" -m pytest -q
    Clean git history
    git rm -r --cached results
    git add .gitignore
    git commit -m "chore: remove model artifacts from repo (keep locally) and respect .gitignore"
    git branch backup-with-results
    git filter-branch --force --index-filter 'git rm -r --cached --ignore-unmatch results' --prune-empty --tag-name-filter cat -- --all
    git reflog expire --expire=now --all; git gc --prune=now --aggressive
    git push origin --force main

    Credits: Neils Haldane-Lutterodt — project owner and experimenter.

    Want a shorter, social-ready summary? Tell me your audience and tone.