How to compare two folders and find missing files

$ tree . |-- dir1 | |-- file1 | |-- file2 | |-- file3 | |-- file4 | `-- file5 `-- dir2 |-- file2 |-- file4 `-- file5 2 directories, 8 files $ for f1 in dir1/*; do f2="dir2/${f1#dir1/}"; [ ! -e "$f2" ] && printf '%s\n' "$f2"; done dir2/file1 dir2/file3

This loops through all the names in the first directory, and for each creates the corresponding name of a file expected to exist in the second directory. If that file does not exist, its name is printed.

The loop, written out more verbosely (and using basename rather than a parameter substitution to delete the directory name from the pathname of the files in the first directory):

for f1 in dir1/*; do f2="dir2/$( basename "$f1" )" if [ ! -e "$f2" ]; then printf '%s\n' "$f2" fi done

If the files in the two directories not only have the same names, but also the same contents, you may use diff (note: BSD diff used here, GNU diff may possibly say something else):

$ diff dir1 dir2 Only in dir1: file1 Only in dir1: file3

If the file contents of files with identical names differ, then this would obviously output quite a lot of additional data that may not be of interest. diff -q may quiet it down a bit in that case.

See also the diff manual on your system.

For comparing deeper hierarchies, you may want to use rsync:

$ rsync -r --ignore-existing -i -n dir1/ dir2 >f+++++++++ file1 >f+++++++++ file3

The above will output a line for each file anywhere under dir1 that does not have a corresponding file under dir2. The -n option (--dry-run) makes sure that no file is actually transferred to dir2.

The -r option (--recursive) makes the operation recursive and -i (--itemize-changes) selects the particular output format (the >f and the pluses indicates that the file is a new file on the receiving end).

See also the rsync manual.

Detecting the missing files when comparing two directories can be a tricky job to do. So this is my scenario:

I am trying to pass some image throw a small piece of software for batch processing and I always get timeout due to the big quantity of images. There are around 80.000 images that I am trying to process and the software get stuck to (let say) 10.000 images so I have to start all over again and I have no idea which are the missing files.

So my solution is to take out the images that have been processed from the folder and feed the program with the images that have not been processed.

So to find out this I have to compare the 2 directories for duplicated files. In other words “detect the missing files”


Comparing two directories for missing file is really an easy task. You just have to iterate through the first directory and see if the same file exists in the second. If the file exists, it means it has been processed, so I will move it to a third folder (I prefer this, just in case) or I can just delete it. This way I can detect the files that have been processed and leave untouched the files that have not yet processed.

Detect missing files while compare 2 folders – the php function

So to get the job done, I came up with this function:

/** * @param array $dir: this is an array with your custom paths * @param string $ext: File extension * @param boolean $rename: If false, it will delete the file. * @param boolean $output: If false, no message will be output to screen. * @return string */ function compare_two_directories($dir, $ext=".jpg", $move=true, $output=true){ $files = glob( $dir[1]."/*".$ext ); $count = 0; if($output) echo "<pre>I found this duplicate files:<br />"; foreach ($files as $file) { $file_name = basename($file); // check if file exists in the second directory if(file_exists($dir[2]."/".$file_name)){ if($output) echo "$file_name"; if($move) { rename($file, $dir[3]."/".$file_name); // move the image to folder 3. if($output) echo " <span style='color:green'>moved</span> to ".basename($dir[3])."<br />"; } else { unlink($file); // just delete the image if($output) echo " <span style='color:red'>deleted</span><br />"; } $count++; } } if($output) echo "</pre>"; return "Done processing and found <span style='color:green'>$count</span> duplicated <span style='color:red; font-weight:bold;'>$ext</span> files "; }

* @param array $dir: this is an array with your custom paths

* @param string $ext: File extension

* @param boolean $rename: If false, it will delete the file.

* @param boolean $output: If false, no message will be output to screen.

function compare_two_directories($dir, $ext=".jpg", $move=true, $output=true){

    $files = glob( $dir[1]."/*".$ext );

    if($output) echo "<pre>I found this duplicate files:<br />";

    foreach ($files as $file) {

        $file_name = basename($file);

        // check if file exists in the second directory

        if(file_exists($dir[2]."/".$file_name)){

            if($output) echo "$file_name";

                rename($file, $dir[3]."/".$file_name); // move the image to folder 3.

                if($output) echo " <span style='color:green'>moved</span> to ".basename($dir[3])."<br />";

                unlink($file); // just delete the image

                if($output) echo " <span style='color:red'>deleted</span><br />";

    if($output) echo "</pre>";

    return "Done processing and found <span style='color:green'>$count</span> duplicated <span style='color:red; font-weight:bold;'>$ext</span> files ";

Use function to compare files inside directories like this

You can call it like this. I like to include also the time that was used, just for statistics purpose, but you can omit it.

// this is the path to your script file and your directories are relative to it. define("MY_PATH", dirname(__FILE__)); // define("MY_PATH", "/var/www/my/custom/path/to/my/directories"); // set your custom paths $dir[1] = MY_PATH."/dir1"; $dir[2] = MY_PATH."/dir2"; $dir[3] = $dir[1]."_processed"; // please note the folder "dir1_processed" must exist if you want to move files to it // call the function and get the job done echo compare_two_directories($dir); // this next line is an example for .png images, DELETE files and output messages will be sent to screen // echo compare_two_directories($dir, ".png", true, false);

// this is the path to your script file and your directories are relative to it.

define("MY_PATH", dirname(__FILE__));

// define("MY_PATH", "/var/www/my/custom/path/to/my/directories");

$dir[1] = MY_PATH."/dir1";

$dir[2] = MY_PATH."/dir2";

$dir[3] = $dir[1]."_processed"; // please note the folder "dir1_processed" must exist if you want to move files to it

// call the function and get the job done

echo compare_two_directories($dir);

// this next line is an example for .png images, DELETE files and output messages will be sent to screen

// echo compare_two_directories($dir, ".png", true, false);

But if you only need to list the different files, just comment the rename function like this and file that are missing will only by displayed on screen.

// rename($file, $dir[3]."/".$file_name);

// rename($file, $dir[3]."/".$file_name);

Get more stats when comparing directories for duplicated files

I like to also know the processing time just for statistics purpose. So to do that you can just wrap the upper code like this:

$time = microtime(true); // Gets microseconds // the code here echo "<br />Processing took <span style='color:blue'>".round( (microtime(true) - $time), 2).'</span> seconds';

$time = microtime(true); // Gets microseconds

echo "<br />Processing took <span style='color:blue'>".round( (microtime(true) - $time), 2).'</span> seconds';

Note that all directories must be on the same level as the script file if you want your script to work out of the box. But if this is not your case, feel free to edit it so it adapts to your specific file structure or your server configuration.

See memory usage

Speaking about server configuration, you will normally need a lot of memory if you have to compare lots of files. I normally do this kind of jobs on a local machine using XAMPP, but you can also do it on your normal server. You can take a peek at your memory usage by using this little function:

function get_memory() { $size = memory_get_peak_usage (true); $unit = array('b','kb','mb','gb','tb','pb'); return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i]; } echo "<br />".get_memory()." of memory were used wile processing" ;

    $size = memory_get_peak_usage (true);

    $unit = array('b','kb','mb','gb','tb','pb');

    return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];

echo "<br />".get_memory()." of memory were used wile processing" ;

Just place this at the end of your file to see your memory usage when comparing the 2 folders

Get “compare directories for missing files” script

You can download a full working copy of this script from the Github repository and compare your directories for missing files. Here is a screenshot of it working. I agree with you that it needs some more style 😉

How to compare two folders and find missing files

There is also a second choice in which you can store both directories in 2 distinct arrays and then just compare the two arrays. If there is a match, then move the files to a third folder or delete them. I really did not test the two alternatives, but I think the first one is faster than the second because it doesn’t need to iterate the second directory. But I could be wrong since it has to do lots of single-file checks.

Please let me know if you try this second approach and with one worked best for you.