You're reading from LLM Design Patterns A Practical Guide to Building Robust and Efficient AI Systems

Product type Paperback

Published in May 2025

Publisher Packt

ISBN-13 9781836207030

Length 534 pages

Edition 1st Edition

Concepts

GPT/LLMs

Author (1):

Ken Huang

View More author details

Table of Contents (38) Chapters

Preface

1. Part 1: Introduction and Data Preparation

2. Chapter 1: Introduction to LLM Design Patterns FREE CHAPTER

3. Chapter 2: Data Cleaning for LLM Training

4. Chapter 3: Data Augmentation

5. Chapter 4: Handling Large Datasets for LLM Training

6. Chapter 5: Data Versioning

7. Chapter 6: Dataset Annotation and Labeling

8. Part 2: Training and Optimization of Large Language Models

9. Chapter 7: Training Pipeline

10. Chapter 8: Hyperparameter Tuning

11. Chapter 9: Regularization

12. Chapter 10: Checkpointing and Recovery

13. Chapter 11: Fine-Tuning

14. Chapter 12: Model Pruning

15. Chapter 13: Quantization

16. Part 3: Evaluation and Interpretation of Large Language Models

17. Chapter 14: Evaluation Metrics

18. Chapter 15: Cross-Validation

19. Chapter 16: Interpretability

20. Chapter 17: Fairness and Bias Detection

21. Chapter 18: Adversarial Robustness

22. Chapter 19: Reinforcement Learning from Human Feedback

23. Part 4: Advanced Prompt Engineering Techniques

24. Chapter 20: Chain-of-Thought Prompting

25. Chapter 21: Tree-of-Thoughts Prompting

26. Chapter 22: Reasoning and Acting

27. Chapter 23: Reasoning WithOut Observation

28. Chapter 24: Reflection Techniques

29. Chapter 25: Automatic Multi-Step Reasoning and Tool Use

30. Part 5: Retrieval and Knowledge Integration in Large Language Models

31. Chapter 26: Retrieval-Augmented Generation

32. Chapter 27: Graph-Based RAG

33. Chapter 28: Advanced RAG

34. Chapter 29: Evaluating RAG Systems

35. Chapter 30: Agentic Patterns

36. Index

Why subscribe?

37. Other Books You May Enjoy

Data input and preprocessing

Efficient data handling is crucial for LLM training, as we discussed in Part 1 of this book. Here, let’s explore advanced techniques for data input and preprocessing:

Import the required Python packages:

from datasets import load_dataset, concatenate_datasets
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
import numpy as np

Load and combine multiple datasets:

wiki_dataset = load_dataset("wikipedia", "20220301.en", split="train")
books_dataset = load_dataset("bookcorpus", split="train")
# Combine datasets
combined_dataset = concatenate_
    datasets([wiki_dataset, books_dataset])

Initialize the tokenizer and perform preprocess:

tokenizer = AutoTokenizer.from_pretrained("gpt2")
def preprocess_function(examples):
    # Tokenize the texts
    tokenized = tokenizer(
    ...

The rest of the chapter is locked

You're reading from LLM Design Patterns A Practical Guide to Building Robust and Efficient AI Systems

Table of Contents (38) Chapters

Data input and preprocessing

Authors (1)

Personalised recommendations for you

You're reading from LLM Design Patterns A Practical Guide to Building Robust and Efficient AI Systems

Table of Contents (38) Chapters

Data input and preprocessing

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you