Data Chunking for Large Datasets
Process large datasets efficiently by breaking them into manageable chunks to reduce memory consumption and improve performance.
The Problem: Memory Exhaustion
// BAD: Loading all records into memory $users = User::all(); // Could be millions of records!
foreach ($users as $user) { $user->sendNewsletter(); }
// BAD: Even with select, still loads everything $emails = User::pluck('email'); // Array of millions of emails
foreach ($emails as $email) { Mail::to($email)->send(new Newsletter()); }
Solution: Chunking Methods
- Basic Chunking with chunk()
// Process 100 records at a time User::chunk(100, function ($users) { foreach ($users as $user) { $user->calculateStatistics(); $user->save(); } });
// With conditions User::where('active', true) ->chunk(200, function ($users) { foreach ($users as $user) { ProcessUserJob::dispatch($user); } });
- Chunk By ID for Safer Updates
// Prevents issues when modifying records during iteration User::where('newsletter_sent', false) ->chunkById(100, function ($users) { foreach ($users as $user) { $user->update(['newsletter_sent' => true]); Mail::to($user)->send(new Newsletter()); } });
// With custom column Payment::where('processed', false) ->chunkById(100, function ($payments) { foreach ($payments as $payment) { $payment->process(); } }, 'payment_id'); // Custom ID column
- Lazy Collections for Memory Efficiency
// Uses PHP generators, minimal memory footprint User::where('created_at', '>=', now()->subDays(30)) ->lazy() ->each(function ($user) { $user->recalculateScore(); });
// With chunking size control User::lazy(100)->each(function ($user) { ProcessRecentUser::dispatch($user); });
// Filter and map with lazy collections $results = User::lazy() ->filter(fn($user) => $user->hasActiveSubscription()) ->map(fn($user) => [ 'id' => $user->id, 'revenue' => $user->calculateRevenue(), ]) ->take(1000);
- Cursor for Forward-Only Iteration
// Most memory-efficient for simple forward iteration foreach (User::where('active', true)->cursor() as $user) { $user->updateLastSeen(); }
// With lazy() for additional collection methods User::where('verified', true) ->cursor() ->filter(fn($user) => $user->hasCompletedProfile()) ->each(fn($user) => SendWelcomeEmail::dispatch($user));
Real-World Examples
Export Large CSV
class ExportUsersCommand extends Command { public function handle() { $file = fopen('users.csv', 'w');
// Write headers
fputcsv($file, ['ID', 'Name', 'Email', 'Created At']);
// Process in chunks to avoid memory issues
User::select('id', 'name', 'email', 'created_at')
->chunkById(500, function ($users) use ($file) {
foreach ($users as $user) {
fputcsv($file, [
$user->id,
$user->name,
$user->email,
$user->created_at->toDateTimeString(),
]);
}
// Optional: Show progress
$this->info("Processed up to ID: {$users->last()->id}");
});
fclose($file);
$this->info('Export completed!');
}
}
Batch Email Campaign
class SendCampaignJob implements ShouldQueue { public function handle() { $campaign = Campaign::find($this->campaignId);
// Process subscribers in chunks
$campaign->subscribers()
->where('unsubscribed', false)
->chunkById(50, function ($subscribers) use ($campaign) {
foreach ($subscribers as $subscriber) {
SendCampaignEmail::dispatch($campaign, $subscriber)
->onQueue('emails')
->delay(now()->addSeconds(rand(1, 10)));
}
// Prevent rate limiting
sleep(2);
});
}
}
Data Migration/Transformation
class MigrateUserData extends Command { public function handle() { $bar = $this->output->createProgressBar(User::count());
User::with(['profile', 'settings'])
->chunkById(100, function ($users) use ($bar) {
DB::transaction(function () use ($users, $bar) {
foreach ($users as $user) {
// Complex transformation
$newData = $this->transformUserData($user);
NewUserModel::create($newData);
$bar->advance();
}
});
});
$bar->finish();
$this->newLine();
$this->info('Migration completed!');
}
}
Cleanup Old Records
class CleanupOldLogs extends Command { public function handle() { $deletedCount = 0;
ActivityLog::where('created_at', '<', now()->subMonths(6))
->chunkById(1000, function ($logs) use (&$deletedCount) {
$ids = $logs->pluck('id')->toArray();
// Batch delete for efficiency
ActivityLog::whereIn('id', $ids)->delete();
$deletedCount += count($ids);
$this->info("Deleted {$deletedCount} records so far...");
// Give database a breather
usleep(100000); // 100ms
});
$this->info("Total deleted: {$deletedCount}");
}
}
Choosing the Right Method
Method Use Case Memory Usage Notes
chunk()
General processing Moderate May skip/duplicate if modifying filter columns
chunkById()
Updates during iteration Moderate Safer for modifications
lazy()
Large result processing Low Returns LazyCollection
cursor()
Simple forward iteration Lowest Returns Generator
each()
Simple operations High (loads all) Avoid for large datasets
Performance Optimization Tips
- Select Only Needed Columns
User::select('id', 'email', 'name') ->chunkById(100, function ($users) { // Process with minimal data });
- Use Indexes
// Ensure indexed columns in where clauses User::where('status', 'active') // status should be indexed ->where('created_at', '>', $date) // created_at should be indexed ->chunkById(200, function ($users) { // Process efficiently });
- Disable Eloquent Events When Appropriate
User::withoutEvents(function () { User::chunkById(500, function ($users) { foreach ($users as $user) { $user->update(['processed' => true]); } }); });
- Use Raw Queries for Bulk Updates
// Instead of updating each record User::chunkById(100, function ($users) { $ids = $users->pluck('id')->toArray();
// Bulk update with raw query
DB::table('users')
->whereIn('id', $ids)
->update([
'last_processed_at' => now(),
'processing_count' => DB::raw('processing_count + 1'),
]);
});
- Queue Large Operations
class ProcessLargeDataset extends Command { public function handle() { User::chunkById(100, function ($users) { ProcessUserBatch::dispatch($users->pluck('id')) ->onQueue('heavy-processing'); }); } }
class ProcessUserBatch implements ShouldQueue { public function __construct( public Collection $userIds ) {}
public function handle()
{
User::whereIn('id', $this->userIds)
->get()
->each(fn($user) => $user->process());
}
}
Testing Chunked Operations
test('processes all active users in chunks', function () { // Create test data User::factory()->count(150)->create(['active' => true]); User::factory()->count(50)->create(['active' => false]);
$processed = [];
User::where('active', true)
->chunkById(50, function ($users) use (&$processed) {
foreach ($users as $user) {
$processed[] = $user->id;
}
});
expect($processed)->toHaveCount(150);
expect(count(array_unique($processed)))->toBe(150);
});
test('handles empty datasets gracefully', function () { $callCount = 0;
User::where('id', '<', 0) // No results
->chunk(100, function ($users) use (&$callCount) {
$callCount++;
});
expect($callCount)->toBe(0);
});
Common Pitfalls
Modifying filter columns during chunk()
// WRONG: May skip records User::where('processed', false) ->chunk(100, function ($users) { foreach ($users as $user) { $user->update(['processed' => true]); // Changes the WHERE condition! } });
// CORRECT: Use chunkById() User::where('processed', false) ->chunkById(100, function ($users) { foreach ($users as $user) { $user->update(['processed' => true]); } });
Not handling chunk callback returns
// Return false to stop chunking User::chunk(100, function ($users) { foreach ($users as $user) { if ($user->hasIssue()) { return false; // Stop processing } $user->process(); } });
Ignoring database connection limits
// Consider connection timeouts for long operations DB::connection()->getPdo()->setAttribute(PDO::ATTR_TIMEOUT, 3600);
User::chunkById(100, function ($users) { // Long running process });
Remember: When dealing with large datasets, always think about memory usage, query efficiency, and processing time. Chunk your data appropriately!