@@ -16,7 +16,7 @@ and generating all different permutations of the sequence. In addition to permut
For this project, we are trying to enhance and evaluate performance on a mathematical reasoning task containing word problems, commonly used across literature, called GSM8k. After the generation of a set of permutations from an initial sequence of prompt phrases, each of these permutations is then evaluated to assess its performance. The prompt is evaluated against the ~1300 question-answer format of the GSM8k. First, the LLM fills in the `{QUESTION HERE}` spot with a specific question in the GSM8k, then runs the string into an LLM to generate a response. The response is then compared against the actual answer, and if the LLM scored correctly, it gets an answer of 1, otherwise scoring 0. The accuracy of the prompt is then evaluated as the average score across all question-answer pairs in the GSM8k dataset.
At the very end, an output file is generated with the LLM scores for every prompt, so that we can find the optimal prompt format for GSM8k-like mathematical reasoning tasks, and compare different trends between well-performing and basly-performing prompts.
At the very end, an output file is generated with the LLM scores for every prompt, so that we can find the optimal prompt format for GSM8k-like mathematical reasoning tasks, and compare different trends between well-performing and badly-performing prompts.
The LLM used in this project is llama3, hosted by Ollama as a backend, however any other Ollama-hosted model can be used pretty easily. To use a different LLM API, replace the OllamaClient class in `prompt_utils.py` to be something different.
Note that that the log and output files should probably be changed for the use-case.
#### Evaluation config.
See the contents of `config.json`. The evaluation `temperature` and `top_p` parameters are the respective parameters of the LLM when generating an output from a prompt. See [Ollama parameter docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#PARAMETER) for information on what each parameter means.