Randomizing Large Text Files in PHP with low RAM usage and no database

Randomizing a text is normally a easy feat in PHP, the first example is a standard way to load and randomize text files with PHP with no database and is rather simple to grasp:

$lines = array();
 
//Open text file for reading
$fileopen = fopen("myfile.txt", "r");
 
//Load lines into a array
//Note: RTRIM is used because chr(13) is ignored and needs to be cleaned per line if using windows based text files.
while (!feof ($fileopen))
{
	$current = rtrim(fgets($fileopen, 4096));
	if ($current != "") { //Do not load any blank lines.
		$lines[] = $current;
	}
}
 
//Close text file
fclose($fileopen);
 
//Shuffle/Randomize lines.
srand((float)microtime() * 10000);
shuffle($lines);
 
//Open text file for writing. Will create if does not exist.
$filewrite = fopen("myfile-random.txt", "w");
 
//Write to file
foreach ($lines as $line) {
	fwrite($filewrite, $line . "\r\n");
}
 
//Close text file
fclose($filewrite);

This method above works great for small to medium text files. But what if you have a large file that has 500k lines, and is 250-500 megs in space? You simply cannot load/randomize large text files in a php array like this, it eats up your ram and you will get a Fatal error: Allowed memory size of xxxxxxxxx bytes exhausted. However, I did find a pretty good method that can load up a large amount of data with a minimal amount of ram.

$pos_lines = array();
 
//Open text file for processing
$fileopen = fopen("myfile.txt", "r");
 
//Gather position of the start of each line
do {
	$pos_lines[] = ftell($fileopen);
} while (false !== fgets($fileopen, 1024));
 
//Shuffle the array of lines.
srand((float)microtime() * 10000);
shuffle($pos_lines);
 
//Open text file for writing
$filewrite = fopen("myfile-random.txt", "w");
 
//This will seek to each line using fseek using the new randomized line position array and grab each line until there are no lines left
foreach ($pos_lines as $pos) {
	fseek($fileopen, $pos, SEEK_SET); // Seek to a line
	$line = rtrim(fgets($fileopen, 1024)); // Read line
	if ($line != "") {
		fwrite($filewrite, $line . "\r\n"); //Write line to new file
	}
}
 
//Close read/write text files
fclose($fileopen);
fclose($filewrite);

Let me briefly explain what is going on here. You are finding the start of each line position, and storing the position number of each line into an array. This means instead of storing a long line of text, you are storing small numbers. The array of stored ‘starting line positions’ is then randomized. Then you can then use fseek and fgets to go to each of the randomized starting line position (the array of line positions that was now randomized) thus it picks out one line at a time until there are no lines left to grab, in a random order.

If you use the first method, you can load about 50-60k lines for around 25megs of ram, if the lines are about 400-500 characters each. But using the second method (storing line positions), I was able to store 500-550k lines for around 25 megs of ram, which in turn is about 10x more better on ram usage. Alot better!

I do not know of any other methods, and do not thing there are other methods of randomization in PHP that do not use a large amount of ram. There are, however, ways to do it using MYSQL/PHP by storing the data in a table first, randomizing the table using Rand(), outputting the results, and then truncating  the table used which is a simple method as well. I would expect a MYSQL/PHP would probably be the best solution for lowest RAM usage…  This way however, is more clean, and involves no databases!

Hopefully this helps you out a great deal!

Share

Related posts:

  1. PHP: Randomizing a text file using PHP + MYSQL, quickly & efficient
  2. VB.NET/CSV File – How to add ‘Double quote text qualifiers’ quickly and easily!

About the Author

I mainly focus on Javascript/PHP/C++/.NET applications for everyday and work. I also am working on a remake of Stellar Frontier, an old 2D top down space battle game with a few fellow programmers.