Connect with us

Tech

Hugging Face acquires XetHub from ex-Apple researchers for large AI model hosting

Published

on

Hugging Face acquires XetHub from ex-Apple researchers for large AI model hosting

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Hugging Face today announced it has acquired Seattle-based XetHub, a collaborative development platform founded by former Apple researchers to help machine learning teams work more efficiently with large datasets and models. 

While the exact value of the deal remains undisclosed, CEO Clem Delangue said in an interview with Forbes that this is the largest acquisition the company has made thus far.

The HF team plans to integrate XetHub’s technology with its platform and upgrade its storage backend, enabling developers to host more large models and datasets than currently possible — with minimal effort.

“The XetHub team will help us unlock the next 5 years of growth of HF datasets and models by switching to our own, better version of LFS as a storage backend for the Hub’s repos,” Julien Chaumond, the CTO of the company, wrote in a blog post.

What does XetHub bring to Hugging Face?

Founded in 2021 by Yucheng Low, Ajit Banerjee and Rajat Arya, who worked on Apple’s internal ML infrastructure, XetHub made a name for itself by providing enterprises with a platform to explore, understand and work with large models and datasets.

The offering enabled Git-like version control for repositories going up to TBs in size, allowing teams to track changes, collaborate and maintain reproducibility in their ML workflows.

During these three years, XetHub drew a sizeable customer base, including major names like Tableau and Gather AI, with its ability to handle complex scalability needs stemming from constantly growing tools, files and artifacts. It improved storage and transfer processes using advanced techniques like content-defined chunking, deduplication, instant repository mounting and file streaming.

Now, with this acquisition, the XetHub platform will cease to exist and its data and model handling capabilities will come to the Hugging Face Hub, upgrading the model and dataset sharing platform with a more optimized storage and versioning backend.

On the storage front, the HF Hub currently uses Git LFS (Large File Storage) as the backend. It launched in 2020, but Chaumond says the company has long known that the storage system would not be enough after one point given the constantly growing volume of large files in the AI ecosystem. It was a good point to start off, but the company needed an upgrade, which will come with XetHub.

Currently, the XetHub platform supports individual files larger than 1TB with the total repository size going well above 100TB, making a major upgrade over Git LFS which only supports a maximum of 5GB of file size and 10GB of repository. This will enable the HF Hub to host even larger datasets, models and files than currently possible.

On top of this, XetHub’s additional storage and transfer features will make the package even more lucrative.

For instance, the content-define chunking and deduplication capabilities of the platform will let users upload select chunks of new rows in case of a dataset update rather than re-uploading the whole set of files again (which takes a lot of time). The same will be the case for model repositories. 

“As the field moves to trillion parameters models in the coming months (thanks Maxime Labonne for the new BigLlama-3.1-1T ?) our hope is that this new tech will unlock new scale both in the community and inside of enterprise companies,” the CTO noted. He also added that the companies will work closely to launch solutions aimed at helping teams collaborate on their HF Hub assets and track how they are evolving. 

Currently, the Hugging Face Hub hosts 1.3 million models, 450,000 datasets and 680,000 spaces, totaling as much as 12PB in LFS.

It will be interesting to see how this number grows with the enhanced storage backend, allowing support for larger models and datasets, coming into play. The timeline for the integration and launch of other supporting features remains unclear at this stage.

Continue Reading