Forcing Github Copilot to Repeat My Own Open Source Code

Jul 12, 2021

GitHub Copilot auto-coder snags emerge, from seemingly spilled secrets to bad code, but some love it • The Register

Today, I was playing around with Github Copilot and tried to see if I could get it to spit back my own open source code verbatim.

A simple test I put up was to target the README file for a github repository I had open-sourced called Vue Style Transfer. The file includes a “philosophy” section, where I talk about the underlying philosophy of the project. I wanted to see how much of the text would autocomplete exactly coming from Github Copilot.

So, I created a new .md (markdown format) readme file inside of VS code and pasted in the earlier description from the repository’s readme file:

The white part is the text I had pasted it in and the dark grey part is its next suggestion.

But as you can see, the actual README.md file has slightly different text:

Lately, Style Transfer models have improved in their results and also decreased in size (down to a few KB). At the same time, users' machines and devices have also become more powerful, and for the first time, able to perform style transfer computations on-the-fly.
Vue Style Transfer was created with the goal of allowing you to seamlessly integrate state of the art style transfer neural network models in your website application.

I tried to autocomplete further:

But the actual text says:

Vue Style Transfer was created with the goal of allowing you to seamlessly integrate state of the art style transfer neural network models in your website application. This creates an opportunity to delight your users with new kinds of, refreshing, image graphics which are uniquely generated client-side every page visit.

In short, I was not able to get it to plagiarize open source code it was trained on which I had written early last year.

It’s a lot like GPT-3 in many ways. I often feel like GPT-3, “restates” things in its own unique way. I have found that if you google outputs GPT-3 comes with, you’ll find many of them are truly unique and no other utterances of it exist currently on the internet.

This has always been one of the most surprising things about GPT-3. Is it learning in similar ways to how humans learn? Or are we just going to continue calling it a form of, “compression”? Is compression what has always been going inside the mind, human, machine, or otherwise? These are all questions worth thinking about.

I have seen other people successfully get Github copilot to spit back training data verbatim, including sensitive information and contact info, so, it’s not like it has not ever been done before. However, with what I’m seeing so far, you must force it and ignore many of its suggestions along the way before you could potentially get it to repeat the exact training data it was given.

Many developers were deeply upset that github copilot was trained on repositories they had released open source. But it’s worth asking - if it’s mostly learning from the code, not necessarily repeating it exactly, how is it any different from how they learned to code? You can’t copyright facts, nor can you sue someone from learning from the environment and examples around them. Is hostility the answer? If you see yourself as a teacher, with human or AI students, I would argue it should feel quite rewarding, but a perspective shift needs to happen within you to see it this way.

But perhaps, this is way too optimistic. People’s livelihoods as programmers (including my own) are at stake, so it might be harder to look at the situation differently no matter what.

Multimodal by Bakz T. Future

Discussion about this post