Using Midway

What/why?

Midway is the cluster that we use for large computational jobs. The main Midway documentation is quite excellent, and that should be used as a primary resource. Nevertheless, in my experience some basic definitions remain a bit nebulous throughout the documentation. This page is where we can add bits of information and definitions that are useful in using Miday.

Extra bits of knowledge

  • A core is the CPU that runs a job. A processor may have multiple cores. A node is generally the same as a processor (can have multiple cores). All the cores in a node can see the same memory.
  • Using parallel is different than using job arrays (as per SLURM's definition). Using parallel, within a single job you could run 10 parallel commands, 100, or 1000. Regardless of how many commands your job runs, when you submit your job you will see the same number of allocated nodes/cores, based on the #SBATCH --ntasks=100 tag at the top of the sbatch script. It is your responsibility to ask for more cores if you want to use more parallel tasks and you want to finish them quickly. When you submit a job and ask for 30 cores, for example, you will see only one job in the system no matter how many tasks you are going to run within this job (you could use squeue -u $USER to see your current jobs in the system). The status of job will remain RUNNING as long as you have some tasks to run. The speed with which your job finishes will decrease if you ask for more cores, BUT beware of requesting many cores if some commands in the parallel command will take a long time while others will finish quickly. If you request 30 cores for 30 commands and 29 finish in 1 second, while 1 command finishes in 1 hour, we will be charged for 30 CPU-hours (30 SUs), or 45 SUs if this job was executed on Midway2.
  • The default memory per core on Midway1 is 2000MB. So if you just have --ntasks=41 you will get 2000MB per core. If you need more memory per core, you need to ask using --mem-per-cpu=xxx where xxx is the amount of memory in MB. If you ask more than 2000MB per core, some of cores on that node will be un-allocateable and will remain idle until your job is finished.
  • Job arrays (https://rcc.uchicago.edu/docs/running-jobs/array/index.html) offer another mechanism to submit large number of jobs.
  • The example sbatch script here (https://rcc.uchicago.edu/docs/running-jobs/srun-parallel/index.html) has comments about the meaning of --exclusive, -N1, and -n1. Collectively, they try to run one parallel command per core.
  • Shared memory: multiple cores are using the same memory
  • Distributed memory: Different CPUs have pieces of the memory, but are connected in a network, and can ask and send their memory to the others in the network.
  • On SandyB, each node has two processors, each has 10 cores, and the two processors share memory (shared not distributed)
  • For massively parallel jobs, consider using MPI. MPI is a message passing library specification, and is not a library and it’s not a compiler. To use MPI, one must use C, C++, or Fortran. You can use OpenMP within a node, or use MPI within a node OR between nodes.