Product Quantization

With the Lantern CLI's pq-table routine, you can apply product quantization (PQ) to the specified table column, which later can be used to save memory and storage for indexed vector search with a PQ Index.

Prerequisites

Run PQ

bash

lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32

The job can be run both on a local instance and also using GCP batch jobs to parallelize the workload over hundreds of VMs to speed up clustering.

To run locally use:

bash

lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32

The job will be run on the current machine utilizing all available cores.

For big datasets over 1M it is convenient to run the job using GCP batch jobs. Make sure to have GCP credentials set-up before running this command:

bash

lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --run-on-gcp

If you prefer to orchestrate task on your own on-prem servers you need to do the following 3 steps:

  1. Run setup job. This will create necessary tables and add pqvec column on target table

    bash

    lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --skip-codebook-creation  --skip-vector-compression
    
  2. Run clustering job. This will create codebook for the table and export to Postgres table

    bash

    lantern-cli pq-table --uri 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --skip-table-setup  --skip-vector-compression --parallel-task-count 10 --subvector-id 0
    

    In this case this command should be run 32 times for each subvector in range [0-31] and --parallel-task-count means at most we will run 10 tasks in parallel. This is used to not exceed the max connection limit on Postgres.

  3. Run compression job. This will compress vectors using the generated codebook and export results under pqvec column

    bash

    lantern-cli pq-table 'postgresql://[username]:[password]@localhost:5432/[db]' --table "sift1m" --column "v" --clusters 256 --splits 32 --skip-table-setup --skip-codebook-creation --parallel-task-count 10 --total-task-count 10 --compression-task-id 0
    

    In this case this command should be run 10 times for each part of codebook in range [0-9] and --parallel-task-count means at most we will run 10 tasks in parallel. This is used to not exceed max connection limit on postgres.

    Table should have primary key, in order for this job to work. If primary key is different than id provide it using --pk argument

    Progress indicator is divided into 4 parts: 5% loading the data, 70% codebook creation, 15% vector quantization, 10% export results to target column

CLI parameters

Run bash lantern-cli pq-table --help to get available CLI parameters

bash

Quantize table

Usage: lantern-cli pq-table [OPTIONS] --uri <URI> --table <TABLE> --column <COLUMN>

Options:
  -u, --uri <URI>
          Fully associated database connection string including db name
  -t, --table <TABLE>
          Table name
  -s, --schema <SCHEMA>
          Schema name [default: public]
  -c, --column <COLUMN>
          Column name to quantize
      --codebook-table-name <CODEBOOK_TABLE_NAME>
          Name for codebook table
      --dataset-limit <DATASET_LIMIT>
          Dataset limit. Limit should be greater or equal to cluster count
      --clusters <CLUSTERS>
          Cluster count for kmeans [default: 256]
      --splits <SPLITS>
          Subvector count to split vector [default: 1]
      --subvector-id <SUBVECTOR_ID>
          Subvector part to process
      --skip-table-setup
          If true, codebook table will not be created and pq column will not be added to table. So they should be set up externally
      --skip-vector-quantization
          If true vectors will not be quantized and exported to the table
      --skip-codebook-creation
          If true codebook will not be created
      --pk <PK>
          Primary key of the table, needed for quantization job [default: id]
      --total-task-count <TOTAL_TASK_COUNT>
          Number of total tasks running (used in gcp batch jobs)
      --parallel-task-count <PARALLEL_TASK_COUNT>
          Number of tasks running in parallel (used in gcp batch jobs)
      --quantization-task-id <QUANTIZATION_TASK_ID>
          Task id of currently running quantization job (used in gcp batch jobs)
      --run-on-gcp
          If true job will be submitted to gcp
      --gcp-cli-image-tag <GCP_CLI_IMAGE_TAG>
          Image tag to use for GCR. example: 0.0.38-cpu
      --gcp-project <GCP_PROJECT>
          GCP project ID
      --gcp-region <GCP_REGION>
          GCP region. Default: us-central1
      --gcp-image <GCP_IMAGE>
          Full GCR image name. default: {gcp_region}-docker.pkg.dev/{gcp_project_id}/lanterndata/lantern-cli:{gcp_cli_image_tag}
      --gcp-quantization-task-count <GCP_QUANTIZATION_TASK_COUNT>
          Task count for quantization. default: calculated automatically based on dataset size
      --gcp-quantization-task-parallelism <GCP_QUANTIZATION_TASK_PARALLELISM>
          Parallel tasks for quantization. default: calculated automatically based on max connections
      --gcp-clustering-task-parallelism <GCP_CLUSTERING_TASK_PARALLELISM>
          Parallel tasks for quantization. default: calculated automatically based on max connections and dataset size
      --gcp-enable-image-streaming
          If image is hosted on GCR this will speed up the VM startup time
      --gcp-clustering-cpu <GCP_CLUSTERING_CPU>
          CPU count for one VM in clustering task. default: calculated based on dataset size
      --gcp-clustering-memory-gb <GCP_CLUSTERING_MEMORY_GB>
          Memory GB for one VM in clustering task. default: calculated based on CPU count
      --gcp-quantization-cpu <GCP_QUANTIZATION_CPU>
          CPU count for one VM in quantization task. default: calculated based on dataset size
      --gcp-quantization-memory-gb <GCP_QUANTIZATION_MEMORY_GB>
          Memory GB for one VM in quantization task. default: calculated based on CPU count
  -h, --help
          Print help