Managing Disk Space in Linux

Data storage is a major expense for the SSCC, as the performance and reliability required for research data makes "enterprise level" storage much more expensive than the hard drive of your average PC. Individual members can help reduce these costs by managing the files in their home and project directories. This article will discuss Linux tools for managing your disk space.

Viewing Your Large Files

Most files are so small (or rather, disk storage today is so large) that even large numbers of them take up trivial amounts of space. We don't want our members to spend their valuable time deciding what small files they can delete. Thus the first task is to identify your large files.

You can use the good old ls -l command, aliased as ll for most people, but du (disk usage) is better at it, especially when combined with other tools.

du directory -ha

where directory should be replaced by the name of the directory you want to examine, will give you a list of all files and subdirectories in that directory and their sizes. Sizes will be given in appropriate units for easy reading by humans. Note that this list includes all the contents of all the subdirectories of the directory you specify, so running this command on a high level directory will probably give you more text than you can use.

To view just the biggest files, you can send these results to the sort program and then list only the top results using head. The disadvantage is that sort can't understand different units, so tell du to list all the file sizes in megabytes.

du directory -ma | sort -n -r | head -n20

This will show the twenty biggest files and directories underneath the starting directory (you can choose how many to view by changing the number after -n). These are the files you should focus on.

Options for Large Files

Once you've identified the files worth paying attention to, then the question becomes what to do with them:

  • Compress large files that are not in active use. Using Compressed Data in Linux has instructions.
  • Share large data files among researchers rather than everyone making their own copy
  • Delete intermediate data files that can be reproduced at will, keeping just the raw data and the version of the data you're currently working on (along with all the code that gets you from one to the other)
  • Delete data files that are no longer needed (but only if you're sure it's no longer needed)

Using Temporary Space

One easy way to make sure you don't forget to delete a file when you're done with it is to put it in temporary space. In Linux, files stored in /temp30days are deleted after thirty days, but you are welcome to use as much space as you need during that time—just make yourself a directory there. If you store files you'll only need briefly in /temp30days, you'll never have you worry about going back to delete them. Keep in mind that /temp30days is not backed up.

Last Revised: 4/24/2019