Lantern CLI
Product Quantization
With the Lantern CLI's pq-table
routine, you can apply product quantization (PQ) to the specified table column, which later can be used to save memory and storage for indexed vector search with a PQ Index.
Prerequisites
- Lantern CLI
- Postgres database with the Lantern extension installed
Run PQ
lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32
The job can be run both on a local instance and also using GCP batch jobs to parallelize the workload over hundreds of VMs to speed up clustering.
To run locally use:
lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32
The job will be run on the current machine utilizing all available cores.
For big datasets over 1M it is convenient to run the job using GCP batch jobs. Make sure to have GCP credentials set-up before running this command:
lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --run-on-gcp
If you prefer to orchestrate task on your own on-prem servers you need to do the following 3 steps:
-
Run setup job. This will create necessary tables and add
pqvec
column on target tablebashCopylantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --skip-codebook-creation --skip-vector-compression
-
Run clustering job. This will create codebook for the table and export to Postgres table
bashCopylantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --skip-table-setup --skip-vector-compression --parallel-task-count 10 --subvector-id 0
In this case this command should be run 32 times for each subvector in range [0-31] and
--parallel-task-count
means at most we will run 10 tasks in parallel. This is used to not exceed the max connection limit on Postgres. -
Run compression job. This will compress vectors using the generated codebook and export results under
pqvec
columnbashCopylantern-cli pq-table 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --skip-table-setup --skip-codebook-creation --parallel-task-count 10 --total-task-count 10 --compression-task-id 0
In this case this command should be run 10 times for each part of codebook in range [0-9] and
--parallel-task-count
means at most we will run 10 tasks in parallel. This is used to not exceed max connection limit on postgres.Table should have primary key, in order for this job to work. If primary key is different than
id
provide it using--pk
argumentProgress indicator is divided into 4 parts: 5% loading the data, 70% codebook creation, 15% vector quantization, 10% export results to target column
CLI parameters
Run bash lantern-cli pq-table --help
to get available CLI parameters
Quantize table
Usage: lantern-cli pq-table [OPTIONS] --uri <URI> --table <TABLE> --column <COLUMN>
Options:
-u, --uri <URI>
Fully associated database connection string including db name
-t, --table <TABLE>
Table name
-s, --schema <SCHEMA>
Schema name [default: public]
-c, --column <COLUMN>
Column name to quantize
--codebook-table-name <CODEBOOK_TABLE_NAME>
Name for codebook table
--dataset-limit <DATASET_LIMIT>
Dataset limit. Limit should be greater or equal to cluster count
--clusters <CLUSTERS>
Cluster count for kmeans [default: 256]
--splits <SPLITS>
Subvector count to split vector [default: 1]
--subvector-id <SUBVECTOR_ID>
Subvector part to process
--skip-table-setup
If true, codebook table will not be created and pq column will not be added to table. So they should be set up externally
--skip-vector-quantization
If true vectors will not be quantized and exported to the table
--skip-codebook-creation
If true codebook will not be created
--pk <PK>
Primary key of the table, needed for quantization job [default: id]
--total-task-count <TOTAL_TASK_COUNT>
Number of total tasks running (used in gcp batch jobs)
--parallel-task-count <PARALLEL_TASK_COUNT>
Number of tasks running in parallel (used in gcp batch jobs)
--quantization-task-id <QUANTIZATION_TASK_ID>
Task id of currently running quantization job (used in gcp batch jobs)
--run-on-gcp
If true job will be submitted to gcp
--gcp-cli-image-tag <GCP_CLI_IMAGE_TAG>
Image tag to use for GCR. example: 0.0.38-cpu
--gcp-project <GCP_PROJECT>
GCP project ID
--gcp-region <GCP_REGION>
GCP region. Default: us-central1
--gcp-image <GCP_IMAGE>
Full GCR image name. default: {gcp_region}-docker.pkg.dev/{gcp_project_id}/lanterndata/lantern-cli:{gcp_cli_image_tag}
--gcp-quantization-task-count <GCP_QUANTIZATION_TASK_COUNT>
Task count for quantization. default: calculated automatically based on dataset size
--gcp-quantization-task-parallelism <GCP_QUANTIZATION_TASK_PARALLELISM>
Parallel tasks for quantization. default: calculated automatically based on max connections
--gcp-clustering-task-parallelism <GCP_CLUSTERING_TASK_PARALLELISM>
Parallel tasks for quantization. default: calculated automatically based on max connections and dataset size
--gcp-enable-image-streaming
If image is hosted on GCR this will speed up the VM startup time
--gcp-clustering-cpu <GCP_CLUSTERING_CPU>
CPU count for one VM in clustering task. default: calculated based on dataset size
--gcp-clustering-memory-gb <GCP_CLUSTERING_MEMORY_GB>
Memory GB for one VM in clustering task. default: calculated based on CPU count
--gcp-quantization-cpu <GCP_QUANTIZATION_CPU>
CPU count for one VM in quantization task. default: calculated based on dataset size
--gcp-quantization-memory-gb <GCP_QUANTIZATION_MEMORY_GB>
Memory GB for one VM in quantization task. default: calculated based on CPU count
-h, --help
Print help