Randomizing a text is normally a easy feat in PHP, the first example is a standard way to load and randomize text files with PHP with no database and is rather simple to grasp:
$lines = array(); //Open text file for reading $fileopen = fopen("myfile.txt", "r"); //Load lines into a array //Note: RTRIM is used because chr(13) is ignored and needs to be cleaned per line if using windows based text files. while (!feof ($fileopen)) { $current = rtrim(fgets($fileopen, 4096)); if ($current != "") { //Do not load any blank lines. $lines[] = $current; } } //Close text file fclose($fileopen); //Shuffle/Randomize lines. srand((float)microtime() * 10000); shuffle($lines); //Open text file for writing. Will create if does not exist. $filewrite = fopen("myfile-random.txt", "w"); //Write to file foreach ($lines as $line) { fwrite($filewrite, $line . "\r\n"); } //Close text file fclose($filewrite); |
This method above works great for small to medium text files. But what if you have a large file that has 500k lines, and is 250-500 megs in space? You simply cannot load/randomize large text files in a php array like this, it eats up your ram and you will get a Fatal error: Allowed memory size of xxxxxxxxx bytes exhausted.
However, I did find a pretty good method that can load up a large amount of data with a minimal amount of ram.
$pos_lines = array(); //Open text file for processing $fileopen = fopen("myfile.txt", "r"); //Gather position of the start of each line do { $pos_lines[] = ftell($fileopen); } while (false !== fgets($fileopen, 1024)); //Shuffle the array of lines. srand((float)microtime() * 10000); shuffle($pos_lines); //Open text file for writing $filewrite = fopen("myfile-random.txt", "w"); //This will seek to each line using fseek using the new randomized line position array and grab each line until there are no lines left foreach ($pos_lines as $pos) { fseek($fileopen, $pos, SEEK_SET); // Seek to a line $line = rtrim(fgets($fileopen, 1024)); // Read line if ($line != "") { fwrite($filewrite, $line . "\r\n"); //Write line to new file } } //Close read/write text files fclose($fileopen); fclose($filewrite); |
Let me briefly explain what is going on here. You are finding the start of each line position, and storing the position number of each line into an array. This means instead of storing a long line of text, you are storing small numbers. The array of stored ‘starting line positions’ is then randomized. Then you can then use fseek and fgets to go to each of the randomized starting line position (the array of line positions that was now randomized) thus it picks out one line at a time until there are no lines left to grab, in a random order.
If you use the first method, you can load about 50-60k lines for around 25megs of ram, if the lines are about 400-500 characters each. But using the second method (storing line positions), I was able to store 500-550k lines for around 25 megs of ram, which in turn is about 10x more better on ram usage. Alot better!
I do not know of any other methods, and do not thing there are other methods of randomization in PHP that do not use a large amount of ram. There are, however, ways to do it using MYSQL/PHP by storing the data in a table first, randomizing the table using Rand(), outputting the results, and then truncating the table used which is a simple method as well. I would expect a MYSQL/PHP would probably be the best solution for lowest RAM usage… This way however, is more clean, and involves no databases!
Hopefully this helps you out a great deal!
Related posts:
It’s too bad PHP has such low memory limits. I like your first solution better
Yeah they do have issues with such low ram usage. Normally I’ll use the first solution first but it won’t be able to handle stuff I do in work since I’m using 250meg files. Second solution works good for large text files, and is a slightly slower to process.
The first solution at 50k lines ran at a execution of: 0.1511 seconds
The second solution at 50k lines ran at a execution of: 0.3721 seconds
So you mainly are sacrificing time for quantity. Around 2.5x slower, you gain about 10x the processing quantity. I would also think it would be better to use the first solution if each line is around 1-5 characters, because the line position storage values will be hitting 4-9 numbers which would be worse to use. Though that case is pretty rare, I would think.
P.S. Also fixed the issues on this page with the coding, and fixed comments so it doesn’t need to be approved for now unless spam gets bad. I’ll probably make some sort of MYSQL idea that I’ve been thinking overall. Mainly just posting these as been working on small things here and there for work, might as well if I have the coding.