Differently from most R objects, tensors created in torch have their
memory managed by LibTorch allocators. This means that functions like
object.size()
or lobstr::mem_used()
do
not correctly report memory used.
The R garbage collector is very lazy, and is only called when R needs more memory from the OS. Since R is not aware of large chunks of memory that might be in use by torch tensors, it might not call the garbage collector as often as it would if it knew that tensors are using more memory. It’s common then that, even tensors that are no longer in use by the R session are still alive in the R session (and thus using memory) because they still din’t get garbage collected.
To solve this problem, the torch package has implemented strategies to automatically call the R garbage collector when LibTorch is allocating more memory. The strategies are different depending on where the memory is being allocated: on the CPU, GPU (CUDA devices) or MPS (on Apple Silicon machines/ or equipped with AMD GPU’s).
CPU
On the CPU, torch will possibly call the R garbage collector in two moments:
-
Every 4GB of memory allocated by LibTorch we make a call to the R garbage collector so it cleans up dangling tensors. The 4GB threshold can be controled by setting the option
torch.threshold_call_gc
, for example using:options(torch.threshold_call_gc = 4000)
This option must be set before calling
library(torch)
or calling any torch function for the first time, as this setting is applied when torch starts up. -
If torch fails allocating enough memory for for creating a new tensor, the garbage collector is called and the allocation is retried. Note: in some OS’s (specially the UNIX based) it’s very hard for an allocation to fail if it’s not too large, because the system tries to use swap. If too much swaping is used it’s possible that the system hangs completely.
If your R session is hanging, and you are convinced that it should have enough memory for the operations, try setting a lower value for the
torch.threshold_call_gc
option, with this you will call the GC more often and make sure tensors are quickly released from memory. Note though, that calling the GC too often adds a lot of overhead, so this will probably slow down the program execution.
CUDA
CUDA memory tends to be scarcer than CPU memory, also, allocation
must be faster otherwise allocation overhead can counterbalance the
speed up of GPU. To make allocations very fast and to avoid
segmentation, LibTorch uses a caching allocator to manage the GPU
memory, ie. once LibTorch allocated CUDA memory it won’t give it back to
the operation system, instead it reuses that memory for future
allocations. This means that nvidia-smi
or
nvtop
will not report the amount of memory used by tensors,
but the memory LibTorch has reserved from the OS. You can use
torch::cuda_memory_summary()
to query exactly the memory
used by LibTorch.
Like the CPU allocator, torch’s CUDA allocator will also call the R garbage collector in some situations to cleanup tensors that might be dangling. In torch’s implementation the R garbage collector is called whenever reusing a cached block fails. In this case, GC is called and we retry getting a new block. However, unlike in the CPU case, that allocations failures are very rare, reusing a block is not common in programs where LibTorch never reserves a large chunk of memory, causing the GC to be called at almost every allocation (and calling GC is time consuming).
To fix this, the torch allocator will call a faster GC in some moments and make a full collection in others.
- We don’t make any collection if the current reserved memory (cached
memory) divided by the total GPU memory is smaller than 20%. This can be
controlled by the
torch.cuda_allocator_reserved_rate
and the default is 0.2. - We make a full collection if the current allocated memory (memory
used by tensors) divided by the total device memory is larger than 80%.
This can be controlled via the
torch.cuda_allocator_allocated_rate
and the default is 0.8. - We make a full collection if the current allocated memory is larger
divided by current reserved memory is larger than 80%. This is
controlled by the
torch.cuda_allocator_allocated_reserved_rate
and the default is 0.8. - In all other cases a light collection is made. Equivalent to calling
gc(full = FALSE
) in R.
These options can help tuning allocation performance depending on the program you are running.
Besides the R specific options you can set LibTorch options via
environment variables as described below. The behavior of caching
allocator can be controlled via environment variable
PYTORCH_CUDA_ALLOC_CONF. The format is
PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2><value2>..
.
Available options:
-
max_split_size_mb
prevents the allocator from splitting blocks larger than this size (in MB). This can help prevent fragmentation and may allow some borderline workloads to complete without running out of memory. Performance cost can range from ‘zero’ to ‘substatial’ depending on allocation patterns. Default value is unlimited, i.e. all blocks can be split. The memory_stats() and memory_summary() methods are useful for tuning. This option should be used as a last resort for a workload that is aborting due to ‘out of memory’ and showing a large amount of inactive split blocks. -
roundup_power2_divisions
helps with rounding the requested allocation size to nearest power-2 division and making better use of the blocks. In the current CUDACachingAllocator, the sizes are rounded up in multiple of blocks size of 512, so this works fine for smaller sizes. However, this can be inefficient for large near-by allocations as each will go to different size of blocks and re-use of those blocks are minimized. This might create lots of unused blocks and will waste GPU memory capacity. This option enables the rounding of allocation size to nearest power-2 division. For example, if we need to round-up size of 1200 and if number of divisions is 4, the size 1200 lies between 1024 and 2048 and if we do 4 divisions between them, the values are 1024, 1280, 1536, and 1792. So, allocation size of 1200 will be rounded to 1280 as the nearest ceiling of power-2 division. -
garbage_collection_threshold
helps actively reclaiming unused GPU memory to avoid triggering expensive sync-and-reclaim-all operation (release_cached_blocks), which can be unfavorable to latency-critical GPU applications (e.g., servers). Upon setting this threshold (e.g., 0.8), the allocator will start reclaiming GPU memory blocks if the GPU memory capacity usage exceeds the threshold (i.e., 80% of the total memory allocated to the GPU application). The algorithm prefers to free old & unused blocks first to avoid freeing blocks that are actively being reused. The threshold value should be between greater than 0.0 and less than 1.0.
Notice that the garbage collector refered below is not the R garbage collector but LibTorch’s collector, that releases memory from the cache to the OS.
MPS
Memory management in MPS devices is very similar to the strategy used in CUDA devices, except that here there’s currently no configuration or tuning possible. The R garbage collector will be called whenever there’s no more available memory for the GPU and thus, possibly deleting some Tensors. Allocation is then retried and if it fails, a OOM error is raised.