OpenAI Fine Tuning Feedback/Notes
I finally had a chance two weeks ago to deeply experiment with OpenAI's fine tuning API. It was a really fun experience which rekindled my passion for language models like GPT-3.
I had lots of notes about the Fine Tuning service, so, I thought it would be cool to share them on here. I’m interested in hearing your thoughts on it too. Please feel free to comment below.
OpenAI Fine Tuning API Feedback
Offer or link to public Datasets
What I found interesting about the whole thing was that I spent maybe 90% of my time finding a reliable source of data and compiling my dataset, rather than fine tuning with the API (that was the relatively easy part ... nice work). But I think it's worth updating the docs to reflect this. Beginners may not be aware that most of the onus is on them when it comes to fine tuning. It can get messy, but the general steps required are 1) find a data source, 2) extract relevant data, break it into bite sized examples 3) make data fit into the prompt/completion format 4) make sure the data fits hard limits like character limits/file sizes.
The big question the user is wondering: where can I find examples to populate my dataset? For business users who are non technical, you could tell them the data will likely come from their database/chat logs and to work with someone technical to get it. For people who don't know where to find data to assemble their dataset, I also think it might be worth linking to resources that often have datasets you can start with like Kaggle, common crawl, wiki data, or publicly available reddit data.
The actual benefits of fine tuning
I found that the existing docs do a good job describing the benefits of fine tuning from a technical perspective - ie. cost savings etc. But, really, I think fine tuning is about commercializing a GPT-3 use case or taking a single use case more seriously. Maybe it's worth saying this explicitly so that the user can self qualify. The expectation is you have an existing use case and now want commercially reliable completions. If they don't have a use case yet, the user is simply not ready for fine tuning and needs to experiment more in the playground.
Fine tuning case studies
I would love to see commercial examples of fine tuning, maybe even case studies. For example: here's a company, their problem, the actual dataset they used, how much improvement they saw in performance, how long/iterations, and how their company benefited from fine tuning (UX/money/time). I didn't know what to expect from the process, but seeing something tangible makes it that much easier for me to invest in compiling the dataset and seeing it through myself.
Existing model evaluation
I wanted a way to see curie's existing performance against my validation dataset before I finetune. I can follow up with the various reasons why, but really it was to establish some existing baseline. I wasn't really sure about how to do this, luckily, I emailed the fine tuning team and [redacted] has been following up with me to learn more about my feature request
Critical Path Example
I really think you guys could "give a taste" of fine tuning by creating some kind of onboarding flow. Have an existing model/dataset ready, walk them through the characteristics/format of the dataset, have the user push a button, watch the web UI command line fly, and see that an example model has been fine tuned for them. This could get them excited, especially if they can test and see improved completions/responses first hand.
Very clear checklist
I had to re-read the docs a few times and take notes before I understood the hard limitations like the 3 million token limit, 2048 prompt example limit, and 80 MB file size. Perhaps, some kind of "hard limit checklist" clearly stated in the docs would be good, so people have an outline of what's important here.
Hard limit "sweet spot"
Is there any way to check your token count without pushing your file to the API? I tried pasting my data into your token estimation tool online, but it was too many lines to have estimated, it gave me an error.
I ended up having to split my training data into multiple files and basically "guess" how many examples would meet the hard limit criteria you guys were looking for. I think it's worth saying in the docs that there is some right, "sweet spot" where your data can meet all the hard limit requirements. Maybe link to some "sed" command line examples, so people who don't know can learn how to easily split up their files.
Prompt format diagram
I would have liked a diagram showing the "format" of the prompt and completion example. Right now, it's inside of a code block with horizontal scrolling, which is too much work and easy to miss things. It took me a few takes to realize that there is a "space" before the completion text and I still wouldn't have noticed it unless I read about it in the paragraph earlier. The space is also counterintuitive because the playground always tells you not to leave any trailing spaces, but a leading space is needed here.
I think a simple diagram could really help describe the targeted format, anyone can understand it just by scanning the page. I made a sample diagram to show you what I mean.
The CLI data preparation tool is pretty cool :) I tested it out and can see it helping a lot of people.
Fine tuning the same model multiple times
I had assumed I could feed multiple training data files into the fine tuning API, but I'm still not sure how to do this. I'm also not sure how I can fine tune over a few stages. For example, the docs say that as you double your training dataset size, you'll see accuracy improve linearly. However, I couldn't figure out how to finetune my model over and over again. I wanted to feed 5K examples then 10K, then 20K split up over 12 or so files and record the results and put them on some kind of chart to see if there was any improvement.
Is fine tuning currently only meant to support a single training data file?
Qualitative aspects of the data
I'm hearing through various people on the community forums that some examples are worthless to fine tune. For example, feeding more poetry ... it has likely already seen more poems than we could provide unless it is a specific type like Haikus. I've also heard other people saying you want to focus on having a * diverse * set of examples so there isn't a risk of overfitting. I think some guidelines around qualitative aspects to a good dataset would be helpful for the documentation along with a breakdown of the data GPT-3 has already been trained on so people don't waste their time. I understand a lot of this is proprietary but maybe something can be done here.
Data privacy
I read the legal blurb at the top, but by my understanding, you still have to be careful with the data you fine tune with because the model could be manipulated to repeat its training/fine tuning data exactly. I'm sure you guys have thought about it, but maybe reiterating this somehow in the docs would be a good idea along with suggestions around anonymization.
Two Pane OpenAI Playground View
I would argue pretty much everyone who fine tunes a model will test both the original model and the fine tuned model side by side as soon as the fine tuning process has been completed. I think introducing some kind of "two pane" view of the playground where both models can be tested at the same time would help people evaluate, and see the benefits of fine tuning.
How to actually evaluate the model in OpenAI playground
It's good you guys have included a section on "using a fine tuned model", this was really helpful. I think this section could be expanded further. The important idea is that you want to evaluate your model by testing it by hand rigorously inside the playground. Give it unexpected prompts and observe their completions. Test it often to see if on average it behaves accordingly. Perhaps, there are some guidelines here on how to "shock test" your fine tuned model.
Helping people interpret evaluation results
I'm still confused about my own evaluation results. For many rows, how can I have a 0% validation_sequence_accuracy but a ~99% validation_token_accuracy? I don't have a good stats background, I just assume that's either an error or something wrong with how I'm interpreting the evaluation results. Perhaps, this section in the docs can be expanded to include case studies of evaluation results, how to interpret them, and then make fine-tuning decisions based on these results such as improving the dataset.