Excellent.
From OpenAI’s prompting guide, I’ve been using their suggested “inner monologue” approach as a means of building a thought through response over multiple steps. (It appears to work better with GPT-4). There are some other useful techniques in here too.
… is exceedingly useful as it lets you dump arbitrary Python objects to disk and then retrieve them easily, without thinking at all about wrangling them to/from another data format like JSON.
Perplexity is suggesting that there are potential issues with Pickle though – it is fairly vintage – and suggests some alternatives here:
Can’t attest to the veracity of its claims but it’s definitely worth a poke about.
… is becoming more of a thing, as a route to meaningful comparison across different prompts vs datasets, etc. If you were looking for a rigorous way of establishing a baseline for a particular dataset and pipeline combo then this points the way towards how you might start thinking about it:
Optimizing LLMs: Tools and Techniques for Peak Performance Testing - Semaphore
It’s a bit of a lightweight article (and geared towards CI) but there are some helpful links in there, including to Langchain Evaluators, which gives you a few different tools for LLM-powered evaluation – answer comparison, criteria comparison, etc.
<aside> ☀️ If we look busy then maybe the robots will spare us.
</aside>