A reasonable way to store a small multitude of files on a file system is through a hashed directory structure with details on the files kept in a database table. One advantage of this method is that it keeps directory reads and file access times quick by not putting too many files into a single directory. The requirements for a system like this are; spread files evenly throughout all possible directories, keep the hashed file names from colliding (ie having the same file name) and to make it harder for someone to guess a files name (especially usefull if you want all your files to be publicly accessable but prefer to share them on your terms).
To solve the problem of spreading files evenly we can hash something or everything about the file, it can be the full file itself, the files name, the date we received the file, anything. Since the purpose of this has is to decide where to store the file we dont need to worry about hash collisions and length of the hash other then making sure its long enough for the given directory depth structure. MD5 and SHA1 are perfectly reasonable algorithms for hashing here, MD5 being faster. As an example of what text Ive used for hashing is Original fileName + current datetime + something else random or unique like and id.
The problem of keeping the file names unique falls to the database where file details will be kept. Each file on the drive will have a unique id in the database in addition to some other usefull data about the file (new file name, original file name, mime type, datetime added, perhaps even who uploaded it, a checksum, description, etc). The Id for that record is the most important item and it will be appended to the files new name to keep the file names unique. It can be added as a plain numeric or can be encoded into base16 or some other base soas to keep it consistant with the encoding of the previous hash.
Making it harder to guess a file name is simple, just add random (printable character) noise to the end of the file name. The longer the string of randomness the harder it will be to get a lucky guess.
Personally I feel that directory names of length two and having a depth of two is more than reasonable for any regular storage requirements need. When the hash is base16 encoded and using directory names with length of two chars (AA, AB, AC, etc…) and a directory depth of two (AB/CD, AB/EF, etc…) this provides us with 256 possible folders in the root dir. In each of first level folders there can be 256 second level folders. This gives us a total of 256 * 256 = 65,536 folders to place files in. Asuming that we will put at max 1000 files into each directory, this structure can handle about 65 million files. The flavor your file system (ext3, ext4, zfs, ntfs, etc..) will inform a reasonable max number of files to keep the file system performant. If each file is about 2MB in size a full directory tree will take up 125TB of space.
If you need or think you’ll fill up even half of this file system structure you should probably look into implementing something with a higher complexity and resilience. Facebooks Haystack is where I pulled some basics for the above system.
Now for sample code to do the things I mentioned above.
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9 10 11 12 13
Figuring out the directory path from the hash or file name
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Parse the original file names extension to add it to our new file
1 2 3 4 5 6 7 8 9 10 11 12
Generate the new file name from an Id, a hash and an extension, notice that Im only keeping the first 8 characters of the hash, which is really twice as much as I need.
1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31