To be continued

The project is closed! You can look at a new scripting language. It is available on GitHub.
Also, try our open source cross-platform automation software.


Installer and installation software
Commercial and Freeware installers.


This lesson focuses on file processing. Let's find the duplicate files on your computer. Moreover, we try to make this task enjoyable and useful, i.e. we will find the duplicate files using their contents, but their names are not required. Well, I suppose you will be astonished by the results, when these operations have been completed.

Example 1

Find all files which have the same contents either in the required folder or in a drive.

Actually, task performance takes much time. Let's think of the algorithm that will make this task easy to perform. If the files differ in size, they are not duplicate. So, from this assumption, first we can get names and sizes of all compared files, which will be sorted by size, after that it will be possible to compare them by size.

Let's declare the structure for data storage with help of the type command. We will store only a file name instead of its whole path in order to save memory. The index of the parent directory in the directory array will be stored in the field owner instead of full name.

type  finf
   str   name
   uint  size
   uint  owner

We need the following global variables:

   arr dirs  of finf
   arr files of finf
   arr sizes of uint
   str output

dirs - processed directory array.
files - files array.
sizes - array that contains files indices will be sorted.
output - string for result output.

The functions written below are responsible for appending directories and files to the appropriate arrays.

func uint newdir( str name, uint owner )
func uint newfile( str name, uint size owner )

To find out more about these functions, read the source code. Functions append an element to the array and fill the element fields.

The scanfolder function is used to find all directories and files by the specified path. If the directory has been found, the element will be appended to the dirs array; then this element is considered to be parent and the scanfolder function calls itself. If the file has been found, the element will be appended to the files array. To make this task easier, we don't take files which have size more than 4GB, condition !cur.sizehiserves this purpose particularly.

func scanfolder( str wildcard, uint owner )
      if cur.attrib & $FILE_ATTRIBUTE_DIRECTORY
         scanfolder( cur.fullname + "\\*.*", newdir( cur.name, owner ))

The scaninit function prepares the first call scanfolder using the starting path. Modifying these functions, you can use various masks and specify size limits of the compared files for file searching.

func scaninit( str folder )
   str wildcard

   @"Scanning \( folder )\n"
   scanfolder( (wildcard = folder ).faddname( "*.*" ), newdir( folder, 0 ))

After file scanning we will sort obtained data. So, instead of sorting the files array, we offer you the better way for process speed-up: to create a new array that contains indices of the files and sort it. Actually, while sorting elements get moved faster, thus elements of small size will be a better choice.

func int sortsize( uint left right )
   return int( files[ left->uint ].size ) - int( files[ right->uint ].size )

func sortfiles
   uint i
   sizes.expand( *files )
   fornum i, *sizes : sizes[ i ] = i

   sizes.sort( &sortsize ) 

The sortfiles function fills the sizes array with indices of the files. First, an index equals to the sequence number. Then, the sizes array will be sorted with help of the sortsize function. Such parameters as left and right are pointers to data. If elements of the array were structures, they would be used as objects; however, the element of the sizes array is uint, so we use the ->uint operation. This expression: files[ index ].size returns the size of a specified file. The function returns a positive number if the size of the left file is greater than the size of the right one, or a negative number if the size of the left file is less than the size of the right one, or zero if the sizes of both files are equal.

The getdir and getfile functions retrieve the full path of a file using the value of the owner field. getdir passes recursively through the first parent directory and makes up the full path as it comes back.

func str getdir( uint id, str ret )
func str getfile( uint id, str ret )

Let's jump to discussion of the main comparison function. In the loop we look through all sorted files, where the file of the least size is the first one.

func compare
   fornum i, *sizes - 1
      id = sizes[ i ]

      if !*files[ id ].name : continue

The files that are considered to be duplicate are ignored in this example. Names of duplicate files will be nulled.

found = 0            
      next = sizes[ j = i + 1 ]
      while files[ id ].size == files[ next ].size
In the given loop the current file is compared with the files of the same size that come next. Furthermore, we miss the obtained duplicate files. Comparison is made using the isequalfiles function from the standard library. In case of duplicate files, we output a message to the string output.

if *files[ next ].name &&
             isequalfiles( getfile( id, idname ), getfile( next, nextname ))
            if !found
               output @ "\lSize: \(files[ id ].size) ========\l\( idname )\l" 
            ( output @ nextname ) @"\l"
            found = 1
            files[ next ].name.clear()
         if ++j == *sizes : break
         next = sizes[ j ]

This fragment outputs partial results. i & 0x3F defines the output of the result after each 64th file.

if i && !( i & 0x3F ) 
         @ "\rApproved files: \(i) Found the same files: \(count)"

Using these functions

func init
func search
func main<main

is not a difficult task. There can be a great number of files and directories, so we reserve a place for some elements in the init function in advance. Moreover, we append one empty parent element to the dirs array in order to start directory numbering at 1. We consider that the owner field equals to zero either if the directory is root. In other words, the directory has no zero index.

Exercise 2

Write a program for searching duplicate files on all local hard drives.