PHP: Randomizing a text file using PHP + MYSQL, quickly & efficient

I had one person approach me recently about my previous blog entry, stating that you cannot process randomization quickly in MYSQL because SELECT * FROM tablename ORDER BY RAND()"; would take horribly long to process when you have a large amount of records stored in a database. This is very true! RAND() is very slow and calling it in your SELECT function will run for minutes if you have a large database and want multiple records returned! So with this method, I agree; It is defiantly not a good idea. However, there are always more ways then one to solve a problem.

There is a solution to solving the slow speeds. First of all, trying to call SELECT with RAND() is a very bad idea on its own because executing a method like this will result in even one record taking awhile to process if you have a large database. (Note: There are ways to grab just one record very quickly, but I will not bother explaining that here as we are talking about multiple records!) So if you remove RAND() when using the SELECT function, what is the other alternative? Adding a random ID to the INSERT function! Using RAND() with INSERT is a entirely different story as its not trying to randomize all the lines before outputting the SELECT function, it will be only creating a random number for you.

Take this for example: $query = "INSERT INTO `" . $db_table . "` (id, data) VALUES (RAND()*1000000, '" . $datahere. "')"; .. Using this does work ALOT quicker, however it still is not fast enough for my tastes. Randomizing 50k records took about 79 seconds to execute on my laptop with default MYSQL settings. This is mainly due to the INSERT function being slower when processing 50k worth of lines. So what else is there?

LOAD DATA INFILE is our answer. LOAD DATA can process a text file or CSV file, 1000x quicker then using a INSERT function. What this does is loads a file directly to the database, and it is very clean coding wise because you do not need to open the file in PHP and read every line, which is also a plus for speed related issues. The only downfall is you NEED to call a full path when using this function, and sometimes understanding how LOAD DATA works can be rather strange if you are new to MYSQL. But that is a small price to pay for the speed bonus you gain. Let us take a look at what is required first and then the coding!

First of all, you need a MYSQL database and a table. After you create a database, use this query to execute a create table command, which is required for this example: CREATE TABLE `temp_randomdb` (id INT NOT NULL, data VARCHAR(255)); You can also do it manually, just simply create a table with a INT field named id, and a data VARCHAR field. Note: You may want to increase the amount of characters the DATA field can input if you require more then 255 characters per record! If you still need help, please look in the comments for a PHP solution after you have created the database required. After you have the table set up, proceed to the coding:

//MYSQL information -- Change these!
$db_name = 'yourdb'; //Database Name
$db_user= 'root'; //Database User
$db_password = 'yourpass'; //Database Password
$db_table = 'temp_randomdb'; //Database Table
$randlines = 100000000; //100 mil ID numbers to pick from randomly. Go higher if needed.
 
//Connect to MYSQL and select DB, output error if invalid.
$con = mysql_connect("localhost",$db_user,$db_password);
if (!$con) die('Could not connect: ' . mysql_error());
$result = mysql_select_db($db_name, $con);
if (!$result) die('Error selecting database: ' . mysql_error());
 
//Truncate table to empty it.
$query = "TRUNCATE TABLE `" . $db_table . "`";
$result = mysql_query($query);
if (!$result) die('Error truncating: ' . mysql_error());
 
//Open text file for reading -- CHANGE PATH HERE! ---  CHANGE PATH HERE! --- MUST BE FULL PATH TO FILE. You can use echo getcwd(); to get the full path!
$query = "LOAD DATA INFILE '/longpath/tofile/myfile.txt' INTO TABLE `" . $db_table . "` (data, @var1) SET id = FLOOR(1000+ RAND()*".$randlines.")"; //Insert lines into MYSQL with a random ID
$result = mysql_query($query);
if (!$result) die('Error load data infile query: ' . mysql_error());
 
//Open text file for writing. Will create if does not exist.
$filewrite = fopen("myfile-random.txt", "w");
 
//Grab and output to file.
$query = "SELECT data FROM `" . $db_table . "` order by id";
$result = mysql_query($query);
if (!$result) die('Error select query: ' . mysql_error());
while($row = mysql_fetch_array($result, MYSQL_ASSOC))
{
	fwrite($filewrite, $row['data'] . "\r\n");
}
 
//Close text file
fclose($filewrite);
 
echo "Done";

Make sure to change the MYSQL values for connecting to your database first. As well, you will need to change the PATH to your text file that you want randomized, which is at this line $query = "LOAD DATA INFILE '/longpath/tofile/myfile.txt' INTO TABLE `" . $db_table . "` (data, @var1) SET id = FLOOR(1000+ RAND()*".$randlines.")"; Also note that the table is truncated each time, to prevent any data from last run from being used. This would prevent the script from being run more then once, there ARE some ways around this, but it is a bit complex for a example to write. So we will leave it for another day.

To explain this, first you are connecting to the database, then selecting and truncating the table. You call the LOAD DATA INFILE, which takes the data from the file and places it into the database, with a randomized ID. You then open up the file you are writing to, call a simple SELECT function, and write to the text file. Calling the SELECT function with ORDER by ID allows you to get the randomized order from lowest to highest number, which works perfectly and quickly!

Using this function, I managed to reduce the time to execute to 1.3 seconds for 50k records! Not as fast as the previous example I provided when using just PHP, (about 0.35 seconds per 50k records.) However, there should be NO memory issues or blogging down after a certain amount of lines when processing it using this method! This means you should be able to process a infinity amount of records with minimal memory, while still getting great execution speeds.

And there you have it, processing large files quickly and efficiently! If you have any issues, I have added MYSQL error handling just in case, so you should be able to solve any issue you come across!

P.S. In your face.

Share

Related posts:

  1. Randomizing Large Text Files in PHP with low RAM usage and no database
  2. VB.NET/CSV File – How to add ‘Double quote text qualifiers’ quickly and easily!

About the Author

I mainly focus on Javascript/PHP/C++/.NET applications for everyday and work. I also am working on a remake of Stellar Frontier, an old 2D top down space battle game with a few fellow programmers.