Project Vision
JoeyLLM is a hands-on project focused on building language-model workflows from end to end, with a strong emphasis on Australian and other domain-specific language use.
The goal is not just to produce models, but to deeply understand how data quality, infrastructure, and training choices shape the behaviour of modern LLM systems.
Our Pipeline
1
Clean & Filter
Process large web datasets (60 TB FineWeb corpus), filter and normalise text
2
Classify
Build classifiers to identify region, domain, and language patterns
3
Fine-Tune
Use curated, high-quality datasets to fine-tune specialised language models
Semester Roadmap
S1 Data & Infrastructure
- Explore & clean FineWeb dataset
- Filter content and normalise text
- Build text classifiers (region, domain, metadata)
- Produce high-quality filtered datasets for training
S2 Model Training
- Fine-tune existing models on curated datasets
- Regional models: Australian English, Canadian English, etc.
- Domain models: banking, defence, science, hobbies
- Understand how datasets shape model behaviour
Project Goals
Goal Deliverables
- Tools for cleaning large web datasets
- Text classification models
- Training workflows and pipelines
- Fine-tuned language models for specialised contexts
Infra Compute Environment
- Remote GPU servers with L4 GPUs
- JupyterHub environment at
10.55.0.245 - WireGuard VPN for secure access
- A100 GPU clusters for large training jobs
Repository Structure
| Folder | Purpose |
|---|---|
| knowledge-base/ | Combined compute infrastructure, data, models, learning resources, Q&A, papers, and platform references |
| roadmap/ | Project goals, planning docs, semester overviews |
| team-members/ | Combined member profiles and notebook progress tracking |
| management/ | Weekly reports, weekly TODOs, and coordination tracking |
Resources & Links
💾
FineWeb Dataset
60 TB web corpus for data cleaning work
🤖
Transformers Docs
Model training and fine-tuning reference
📊
HuggingFace Datasets
Dataset loading, inspection, preprocessing
📓
Jupyter Documentation
Notebook-based workflow on remote compute
🔒
WireGuard VPN
Secure tunnel to access compute infrastructure
📝
GitHub Markdown Guide
Writing docs that render cleanly on GitHub




