At Meta, I focused on enhancing GPU cluster performance through the development of monitoring frameworks and incident management processes. My role involved leading cross-functional team projects to debug complex NCCL failures, which significantly reduced training disruptions and improved system resilience.
๐ฏ
Focusing
Infrastructure Engineer and Python Enthusiast interested in infrastructure design, automation and security.
-
AI Systems at Meta
- Sunnyvale
- https://www.linkedin.com/in/mittalmak
- @mittalmak
Popular repositories Loading
-
automation_scripts
automation_scripts PublicA repo to keep some of the automation scripts that i have created
Python 1
-
python-practice
python-practice PublicCreated a repository to Push my code in GitHub when I was learning for python.
Python
-
-
-
-
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.