A quick announcement
On July 20th, I'll be hosting a digital class on Optimizing Large Language Models (LLMS). I'll be detailing cutting-edge techniques for faster and more efficient fine-tuning of LLMs, best practices for managing large datasets, and inference optimizations.
The half-day online conference will be hosted on the O'Reilly Media platform on July 20th from noon to 3pm ET.
As someone who has spent the past couple of years at Nebula.io neck-deep in fine-tuning and deploying LLMs, I am thrilled to share the tips, tricks, and cutting-edge best practices I've gathered.
If it sounds like something you’d be interested in, you can register here.
Vector search
As a newsletter that aims to assist you become a better NLP practitioner, I would be doing you a disservice if I only covered the latest modeling developments or algorithm advancement. Tools that enable us to train and deploy ML models are just as important. This week in Let’s Talk Text we’re building on previously covered topics relating to DuckDB, Parquet files, and vector search. You should read up on those if you’re not familiar.
Vector search: provides the ability to store vectors around certain algorithms, and an efficient way to compute similarity to determine which vectors are related.
Check out this link for a comprehensive review of the state of vector search options.
Vector search is on fire
Pinecone raised $100 million Series B investment on a $750 million post valuation April 27, 2023
Weaviate.io raised $50 million on April 21, 2023
Qdrant raised $7.5 million seed financing April 19, 2023
If you need to use one of these hosted vector databases, you already know who you are. Otherwise, I’d wager that the size of your data or the complexity of your use case isn’t large enough to spend tons of money on a hosted vector database.
Introducing LanceDB
LanceDB is a serverless, production-scale, and open-source vector search database that simplifies the retrieval, filtering, and management of vectors.
There’s a ton of nuance to unpack in that sentence:
Serverless
It’s serverless, which means that you can build and run applications without worrying about the underlying servers and infrastructure. Serverless doesn’t mean that there’s no servers, but that you pay for the resources used rather than the size of a pre-provisioned cluster. You don’t need to monitor the database load or capacity because a serverless database can scale elastically. These are the reasons you would want to use a serverless database:
No one needs to worry about scaling or monitoring the database
You don’t need to pre-provision a database that may be under or over-sized, you pay for what you use in a serverless database.
Production-scale
Fast
LanceDB is backed by the Lance file system, a modern columnar file format that is adapted from Parquet but up to 100x faster than Parquet. It was created to store vectors, a data type that is often critical for ML workloads but absolutely necessary for vector search. LanceDB is also written entirely in Rust with Python API wrappers, which combined with this blazingly quick file format, makes search extremely quick.
It’s so quick, in their documentation they mention that unless you have at least 100k data points with vectors of size 1k, you can perform a search in less than 20ms without even building an index.
Feature-rich
With LanceDB you can combine filters with your vector searches, enabling you to store metadata directly from Pandas dataframes within the database and search against those fields.
Ecosystem compatibility
Along with Pandas, LanceDB integrates directly with LangChain, DuckDB, and Apache-Arrow.
Open-source
You can check out all the source-code and documentation.