Ruby script: find duplicate files

A quick google for a script that would find duplicate files by name in a directory tree turned up two promising techniques, one a Ruby script posted to OnJava by Bill Siggelkow and the other a bash script using common Unix tools.

Here’s my attempt to reproduce the bash results in Ruby:

  1. #!/usr/bin/env ruby
  2. require ‘find’
  3.  
  4. files = {}
  5. found = {}
  6.  
  7. # read root directory from command line    
  8. ARGV.each do |arg|
  9.   Find.find(arg) do |f|
  10.     if File.file?(f)
  11.       # accumulate the file names
  12.       files[f] = File.basename(f)
  13.     end
  14.   end
  15. end
  16.  
  17. # count up the number of each file name
  18. files.each_value do |base|
  19.   # Ruby doesn’t allow this Perl idiom: found[base]++
  20.   found[base] = 0 if !found[base]
  21.   found[base] += 1
  22. end
  23.  
  24. # print the path of each file found more than once,
  25. # prepended with rm command commented out
  26. found.each do |name,count|
  27.   if count > 1
  28.     files.each do |path,filename|
  29.       if name == filename
  30.         puts "# rm #{path}"
  31.       end
  32.     end
  33.   end
  34. end

Given a directory structure containing files with duplicate names in different directories, the output looks something like this:

# rm /market/fruits/tomato.txt
# rm /market/vegetables/tomato.txt
# rm /market/fruits/pea.txt
# rm /market/vegetables/pea.txt

The output could be piped to a shell script, in which you’d uncomment the “rm” statements for the files that should be deleted (if that’s what you want).

This is all a bit clunky, if you’ve found a better or more Rubyesque way to do this, let me know!

2 Responses to “Ruby script: find duplicate files”

  1. Script para localizar arquivos duplicados : Ruby Brasil Says:

    [...] Philip Steiner escreveu um código em Ruby para buscar arquivos que tenham nome [...]

  2. George Says:

    I wonder how to add an ignore/exclude list with such things as: .svn, Makefile, build.xml, etc.

    By the way, it seems to run quite well.

Leave a Reply