Text cleaning:
Strip HTML tags, remove special characters, normalize whitespace.
Chunking:
Divide texts into coherent chunks of approximately 100-400 words. Overlap slightly (~50 words) for context continuity.
all-MiniLM-L6-v2)# Example: OpenAI embeddings API call
import openai
response = openai.embeddings.create(
input=["chunk of text"],
model="text-embedding-3-large"
)
embedding_vector = response.data[0].embedding
{
"id": "unique_chunk_id",
"metadata": {
"title": "Article Title",
"url": "https://...",
"date": "YYYY-MM-DD",
"chunk_number": 3
},
"vector": [0.123, -0.456, ...],
"text": "The actual chunk text for reference"
}
query_embedding = model.encode("demographic trends 2025")