ftpsync synchronizes a local directory to a directory on an FTP server.
More precisly ftpsync does bi-directional syncing between client and server with simple conflict resolution. Syncing is also not limited to one client and one server. ftpsync supports multiple nodes syncing to the same server and/or syncing to multiple servers.
This text explains some ideas behind the script and how it's configured and used.
Requirements
The following things are required for ftpsync:
The server side requirements are pretty common these days. You can expect your server to fulfill them.
On the client side you should first check with a "gawk --version
" if you have gawk-3.1.x already installed or not. If not download it from www.gnu.org an install it.
The other two pre-requisites can be found on www.awk-scripting.de. They are simple to install. After you have compiled awk.file.so copy it to /usr/local/lib to install. If you decide on another location you'll have to edit the ftpsync script. connect is compiled and installed under /usr/local/bin by a "make install
".
Let's look at the basic "theory of operation". There is a directory on the local machine and another on a remote server. In the first run all local files will be copied to the remote directory. In each later run only the files that have been modified after the previous run should be copied to the remote system. That is the syncer copies only those files that have to be copied to update the remote end.
Obviously our syncer has to keep track of each file's size and modification date after it has been copied to the remote server. If the syncer finds in a later run that a file's size or modification date has changed it was modified and has to be copied again.
If we store or current status information in a text file, with one line per file in the format "filename <tab> size <blank> mtime
" the following function readsyncinfo reads the previous (!) file information into an associative array.
function readsyncinfo(filename, list, line, x) { while (getline line <filename > 0) { split(line, x, /\t+/); list[x[1]] = x[2]; } close(filename); return (0); }
Computing the current file information means getting the file's information from it's directory entry. The following code reads this information for all files in the current directory into the currentnode array.
dirlist[""] = sbuf[""] = ""; n = scandir(".", dirlist); for (i=1; i<=n; i++) { if (stat(dirlist[i], sbuf) != 0) dirlist[i] = ""; else if (sbuf["type"] != "file") continue; if (excludefile(dirlist[i])) continue; currentnode[dirlist[i]] = sbuf["size"] " " sbuf["mtime"]; }
It's perhaps worth noting here that we do not have to compare the size and modification time values individually. We can compare directly a file's currentnode value against it's previous file information we obtained with the readsyncinfo function.
Consider we used "readsyncinfo(..., previousnode)
" (ignore the filename parameter for the moment) to read the stored file information then we can compute easily the files's sync status:
for (file in prevnode) { if (! (file in currentnode)) status[file] = "deleted"; else if (prevnode[file] == currentnode[file]) status[file] = "unchanged"; else status[file] = "changed"; }
The comparison "prevnode[file] == currentnode[file]
" decides if file needs to be updated or not. If file's size and/or modification time is different, file's values in the prevnode and currentnode arrays differ and therefore it's status is set to changed.
Another thing that the above code does is it determines if a file was deleted. In this case we have the file's entry in our previousnode array but not in currentnode. And, to be complete, we have also to do
for (file in currentnode) { if (! (file in prevnode)) status[file] = "changed"; }
to assign the changed status to new files.
Ok, let's wait a while and think about it. What we now have is something for mirroring. We can compute what happened to our files and if we have to update or delete them on our FTP server or not. This is interesting but it's not syncing.
For true synchronization we have to consider our other's end. Files might also be modified or deleted there. In this case we have to update our local files by either getting the updated file from the server or deleting our local copy.
Basically we do for the remote server the same thing we did for our local files. That is, we keep track of the file's sizes and modification times and we retrieve the current file information from the server to compute the remote file's status. Only the way how we get the current file information is different since we cannot simple stat() the files.
Since I didn't want to deal with the FTP server's LIST format I implemented this using NLST, SIZE and MDTM:
function readserverinfo(ftpd, dir, list, file, line, data, dirlen) { delete list; # # Retrieve the list of all files and directories ... # portcmd = doport(ftpd, ""); cfputc(ftpd, "NLST", ".", 150); portcmd |& getline line; dirlen = 0; if (dir != "") { dir = dir "/"; dirlen = length(dir); } while (portcmd |& getline file) { file = noctrl(file); if (dirlen > 0 && substr(file, 1, dirlen) == dir) file = substr(file, dirlen + 1); if (excludefile(file)) continue; list[file] = ""; } close (portcmd); cfputc(ftpd, "", "", 226); # # ... and collect SIZE and MDTM for each file. # for (file in list) { if ((line = cfputc(ftpd, "SIZE", file, -213))+0 != 213) { delete list[file]; continue; } else { sub(/^[^ \t]+[ \t]+/, "", line); data = line; } if ((line = cfputc(ftpd, "MDTM", file, -213))+0 != 213) { delete list[file]; continue; } else { sub(/^[^ \t]+[ \t]+/, "", line); data = data " " line; } list[file] = data; } return (0); }
Equipped with this information we can compute the status (changed, unchanged or deleted) for each of the remote files.
Now that we have all the information we need we can define what to do depending the the local's and remote's file status.
local/remote | unchanged | changed | deleted | doesn't exist |
unchanged | nothing | get | remove | put |
changed | put | duplicate | put | put |
deleted | remove | get | ignore | ignore |
doesn't exist | get | get | ignore | ignore |
To explain this table: the rows show the local and the cols the remote file status, the values inside the table are the actions as seen by the node running ftpsync. E.g. get means that the file is retrieved from the FTP server. The remove action's usage (twice) is not exact: is doesn't say if the file has to be deleted locally or on the server.
The "doesn't exist" status means that a certain file does not exist on one side, neither as file in the current directory nor in the previous status file. This happens if a file is created on one side the directories have not been syncronised. The usual action is then to copy the file to the other side.
It's perhaps more difficult if the file does not exist on one side and is on the delete list of the other side. Under usual circumstances this can not happen but the synchronizer has to deal with it. Well it's simple, we have to delete a file that is already deleted on one end and does not exist on the other. The right action is to ignore it. The other ignores refer also to situation where nothing has to be done but (in opposite to the unchanged/unchanged nothing) it's not clear how the system entered this state.
Version conflicts
More interesting is the duplicate action. In this case we have two changed copies, one local and one on the server. What now? How can this conflict be resolved, which copy wins? The answer is that both win. If a duplicate situation is recognized the server's file is retrieved but the server's node name is appended to the filename to show that this file is the server copy. The server receives the local file but again the name is modified. This time the peer's name is appended to the filename. In other words: both sides keep their copy and receive the other end's version with a different filename. It's then up to the user to decide which of the versions is better. These conflict resolution files are not versioned, they are overwritten on the next conflict situation.
Symmetry
The action table above is symmetric. This means that none of the sides is prefered. Basically both sides could run the synchroniser, changing client and server role. The conflict resolver is also symmetric, more than this: it's "multi-symmetric". If you have a given number of nodes syncronising with the same server each node has it's own conflict resolution which does not interfere with another nodes resolution. The only additional requirement is that each node has it's own unique name.
There are some possible modifications to the action table above. The unchanged/deleted/remove (abbreviated udR) could be changed to udP (put instead of remove) and ccD could become ccP. With this two changes the synchronizer becomes a simple backup program. Backup program because files that need to be stored on the server are uploaded (files that are deleted or changed on the server are refreshed) to the server and simple because we have no file versioning.
local/remote | unchanged | changed | deleted | doesn't exist |
unchanged | nothing | get | put | put |
changed | put | put | put | put |
deleted | remove | get | ignore | ignore |
doesn't exist | get | get | ignore | ignore |
Changing the symmetric entries duR and ccD to duG and ccG would make the system running ftpsync the FTP server's simple backup system. I call these two modes "master" and "slave" mode.
The synchronizer can also be downgraded to a mirror (FTP server to local) program by changing by changing the action table to the following.
local/remote | unchanged | changed | deleted | doesn't exist |
unchanged | nothing | get | remove | ignore |
changed | get | get | remove | ignore |
deleted | get | get | ignore | ignore |
doesn't exist | get | get | ignore | ignore |
I call this the "mirror" mode. If we apply the symmetric changes to the action table we get the "original" (mirror local to FTP server) mode.
Another thing that could be considered is how file removals are done. Instead of deleting them they could be moved into a .deleted folder, with or without versioning.
ftpsync needs a configuration file to synchronize a directory. This file is usually names .sync.conf and located in the directory that should be sync'ed.
The file has the typical UN*X-style: comments, starting with a "#
", are allowed, empty lines too. The other lines are of the form "key value" with whitespace between key and value.
The configuration parameters are:
key | Description |
nodename | The name of the local host running ftpsync. This doesn't have to the host's DNS hostname, it can be anything as long as it's unique among all nodes syncing to the same server location. You should use only letters, digits and dashes (minus signs) here. "name" and "node" are possible aliases for "nodename". |
peer | The FTP server's name. Again you don't have to enter the server's DNS name here (although you can). Choose any name you like as long as it contains only letters, digits and dashes. "peername" and "remote" are aliases for "peer". |
server | This is either the peer's full qualified domain name or it's IP number. |
login | The login on the FTP server. |
password | The password belonging to login.
|
Optional Parameters | |
dir | The directory on the FTP server to which you want to syncronize to. If unset the login's home directory is used. |
includedots | Can be "yes" or "no". If set to "yes" files beginning with a dot will be also subject to sychronization. File beginning with ".sync. " or ".sync_ " are still excluded. |
allowblanks | Can be "yes" or "no". If set to "yes" files that have blanks in their names are also synchronzed. |
mode | This value defines ftpsync's default operation mode. It can be one of "sync", "master", "slave", "original" or "mirror". The default value is "sync". |
symsync | Can be "yes" or "no". If set to "yes" the file states are copied and swapped to the server. With "syssync" set to "yes" client and server can swap roles in later ftpsync runs. |
An example for a configuration file is
# # .sync.conf - ftpsync configuration file. # nodename pc remote server server 192.168.0.4 login my-ftp-account password my-secret-password dir sync includedots yes allowblanks no mode sync symsync yes
Invocation
ftpsync [options] directory [server]synchronizes directory with the FTP server configured in directory/
.sync.conf
. If the optional server argument is given the file directory/.sync-
server.conf
is used instead.
ftpsync supports the following command line options:
ftpsync does not recurse into subdirectories. The reason for this (beyond the additional complexity which could be implemented) is that ftpsync is yet not able to determine if something on the FTP server is a regular file or a directory. The way how ftpsync decides that something is a normal file or not is simple. If the SIZE and MDTM calls succeed one a remote name it is a file, otherwise it's not. This assumes that either SIZE or MDTM fails for non-regular files and this seems to be true for the normal FTP server (but this assumption might be the reason why ftpsync does not work in your particular setup).
Now let's look at the case when either SIZE and/or MDTM fails. Is the remote object then a directory, a symbolic link, a device file or something else? This can't be determined with the current implementation. Ok, I know there are clever FTP clients that are able to parse the LIST format of a dozen of different FTP server types, but I don't think that this is the right way to deal with this question. ftpsync should be instead rewritten to work with the (hopefully) upcoming MLST/MLSD standard commands because with these the file type can be requested (together with other information) in a defined machine readable format. But notice that these rewrite implies that your server implements and understands these command which may be not true.
Originally I wrote ftpsync to synchronize files between two wiki server, running one on my linux computer at home and the other on my Internet server. Suprisingly it doesn't work out of the box. Not because ftpsync is not fully working. It's a problem of permissions and file ownership.
If I sync my local files to the remote server they are owned by me (or better: my remote FTP account). If then the HTTP server runs with a different user id, which is absolutly common, the HTTP server can't write the files. Or consider that the wiki script on the HTTP server creates a new file. When I log in with my regular FTP account I will not be able to overwrite or delete these files because of missing permissions. The only way how to really solve this situation is either log in to the FTP server under the HTTP server's account or changing the server's user id to my FTP account. But I don't expect that the average provider will configure either of these solutions.