Question about a model training crash in a small lab
I was running a new vision model on a local server in Austin last Thursday, using about 40GB of VRAM. Halfway through a 72-hour training cycle, the whole system froze. The logs just showed a memory leak in a custom data loader I wrote. I had to hard reboot, losing nearly a day of progress. I fixed it by adding better garbage collection and cutting the batch size in half. Has anyone else hit a similar wall with long training jobs on limited hardware?