Retrieval#
1. Pipeline#
The retrieval method could solve the search or extreme multiclass classification problem.
generate data -> train -> eval
pretrained encoding -> build hard negative -> train -> eval -> indexing -> retrieval
pretrain -> fine tuning -> distill
2. Offline indexing#
QUERY_ENCODE_DIR=nq-queries
OUT_DIR=temp
MODEL_DIR="BAAI/bge-base-zh-v1.5"
QUERY=nq-test-queries.json
mkdir $QUERY_ENCODE_DIR
python -m retrievals.pipelines.embed \
--model_name_or_path $MODEL_DIR \
--output_dir $OUT_DIR \
--do_encode \
--fp16 \
--per_device_eval_batch_size 256 \
--data_name_or_path $QUERY \
--is_query true
3. Retrieval#
Faiss retrieval#
BM25 retrieval#
Elastic search retrieval#
Ensemble retrieval#
we can use RRF_fusion to ensemble multiple retrievals to improve the retrieval performance.