A Data Scientist’s Guide to Open Source Licensing

[Cross Post from Authors Blog


First, as with any good article dealing with law, a disclaimer. I do have a law degree, but I did not sit for the bar exam and instead pursued a career in data science. So, while this can be thought of as an informed opinion, it should not be taken as legal advice. If you have a real concern involving licensing, contact an attorney for legal advice. On the other hand, if you want an overview of all these ‘license’ things you encounter when doing your work, read on.

I promise that was the most legalese you’ll see in this article.

If you’re reading this, then you’ve probably encountered a license (likely in the context of GitHub) and thought ‘huh, I should look up what that actually means.’ It may be your first time thinking that, or more likely your several hundredth time thinking it.

As data scientists, the choice of which open source license to use as we prepare to share a project can be . Not only are there a plethora of open source licensing standards (Wikipedia lists about three dozen distinct ones, of which my favorite is probably the Do What the F*ck You Want To Public License), but many of these deal with areas that don’t always map well or easily to the work done by data scientists.

Image for post

So if you looked at the dropdown list that Github offers you when you’re building a repo (see the image to the left) and gone “uh…what?”, me too.

Luckily, because I did the law school thing, the “uh…what?” became a non-rhetorical question, and I dove into the world of open source licensing.

It is…not as simple as you might hope. This article provides only the broadest strokes overview so that you can feel more confident picking and/or using a license.

If you take one thing from this article, it should be that the biggest distinction between open source licenses is whether they are ‘’ or ‘’.

Copyleft licenses require that all modification to the original software or software based, even in part, on the original must to be released under the same open source license as the original and must therefore be open source themselves. So if I write program B based on program A and program A has a copyleft-type license, then I must also release the source code of my program B as open source under the same license. This ensures that resources remain open source, but it can discourage adoption of your code.

On the other hand, permissive licenses allow someone to build whatever they want, including closed source commercial software, on top of the licensed code. This means that I can take program X, write program Y on top of it, make it closed source, and make a lot of money off of it without giving you any. While this might sound scary, there are often other incentives to have downstream software remain open source. Additionally, more people are going to feel comfortable using software with this type of license, and as a result, most projects are built on permissive licenses.

Among Github users, 72% use MIT, Apache 2.0 or BSD (the three big permissive licenses). That being said, some massive open-source projects use GPL, which is copyleft, most notably Linux, WordPress, and WikiMedia (the underlying software to Wikipedia).

Does it matter? Yes it does. Google the “BSD + Patents” debate that happened last year over Facebook’s use of a more restrictive license for products like React and you’ll get plenty of discussion on why it matters. There’s a lot of philosophy behind these debates, and it’s probably worth having an opinion on which you prefer, or at least when you might want to use one versus the other.

A Brief Overview of Intellectual Property

I’m going to briefly review the three major types of intellectual property — i.e. copyrights, trademarks, and patents — before we dive in because these are the three things that licenses actually cover. Intellectual property (IP) is why these licenses exists.

You’ve probably heard the term IP used. It gets referenced in start-up culture an awful lot. In common parlance, it simply means property that is a “creation of the mind.” In law, it gets a bit more specific. The former law student in me is legally required to quote Cornell’s law dictionary here:

IP is intangible and the results of someone’s brain. You can hold a purse, but you can’t hold a purse’s “.” Because people and companies spend a lot of time, energy, and resources bringing ideas to life, we allow them to “own” these ideas to reward them for sharing what they’ve created, to incentivize them to create more, and to protect them so they feel safe sharing these ideas in the .

In the case of copyright, we give these ownership rights for a very long time. Typically, a copyright is for the life of the author/sculptor/coder plus a number of years. In the case of patents, not so long, and in the case of trademark, potentially forever, but only if you’re using it. Again, this is  broad strokes and law is very much a game of details, but for the purpose of imparting a high-level conceptual understanding in a single article, this is good enough.

One final note, in law ‘-or’ designates the person doing something and ‘-ee’ designates the person it was done to. Licensor — giving a license. Licensee — receiving the license.

Copyright subsists at the moment of creation. What that means is that the copyright of a work comes into existence at the moment of creation. If what you’ve created it able to be protected by copyright and is original, then you don’t need to do anything else for the copyright to exist. The stories you’ve heard about mailing something to yourself are just stories. So long as you can prove you made it first, it’s copyrighted.

There are definite advantages to registering a copyright federally, such as a public record of ownership and the ability to sue in federal court, but it isn’t strictly speaking necessary. Once you decide to release something under an open source license, you are effectively granting someone the right to use your copyrighted material (your code).

Final note, copyright  protect facts or ideas, only the expression of ideas. A story about a wizard isn’t subject to copyright, but a story about a wizard boy named Harry with a lightning scar on his forehead probably is.

Patents on the other hand deal with functionality and are There’s an entire separate bar to be considered a patent attorney, for good reason. For our purposes, we’ll stick with utility patents. If you come up with new, ‘non-obvious’ functionality, you can potentially receive a patent on it.

Patents, unlike copyrights, must be applied for. And with patents, it’s whoever applies first, . So even if you create an entirely new piece of code, if it utilizes patented functionality, you’d be infringing on a patent. Patents are complex, and for a better overview of how they operate in the context of litigation, I recommend this article by a patent attorney, which also touches on the React controversy:

4 Lessons from the ‘React Patent License’ Controversy

Making sense out of the confusion…


Software patents have a convoluted history (which you can read more about here). Because of the uncertainty surrounding them and the and difficulty of getting a patent, copyright has emerged as the dominant way that a majority of the code you encounter will be protected. Software patents tend to exists only in the context of large or commercial-scale projects.

Additionally, in the U.S., the courts do not allow ‘laws of nature’ to be subject to patent. You can’t patent a mathematical concept or a discovery in physics, but you might be able to patent a particular, unique, novel usage of that law of nature. Think: the chemical formula for making a drug is patentable but not basic molecular structure itself.

The granting of trademarks is a topic beyond the scope of this article. Generally, trademarks refer to something that denotes a product’s . That’s why we think of trademarks as brand names, logos, and designs. They tell us where the product is coming from. Some licenses you’ll encounter may explicitly grant or forbid the use of associated trademarks.

Let’s say you’re going to buy a new briefcase or purse. Your friends have been raving about its amazing features, cool new design, and great quality. It features a new type of handle that makes it make easier to hold (patent), a distinctive color and tasteful pattern (copyright), and bears a big ol’ Gucci symbol right in the middle (trademark).

Goldilocks and the Three Licenses

Image for post

I’m going to explain the three major licenses you will encounter. There are more than three licenses, but there are only three bears in the Goldilocks story, so that’s all you’re getting.

You’ll notice that Apache gets the coveted spot of “Just Right.” This doesn’t mean Apache is always best. It’s not. This is a helpful memory technique, and if there’s a legitimate chance that your project may encounter IP concerns, you should think carefully about what’s best for your project.

Additionally, if you’re working with a company or group that you think might have specific rules, ask. It’ll make you look well informed, and most lawyers — despite how they’re played on TV — are happy to talk. A company’s lawyer will prefer talking to you before a problem occurs and will probably consider you more of a professional for having the forethought to ask.

The MIT License — “The Mama Bear”

Image for post

MIT carrying you because you were too lazy to pick out a license. Source.

The MIT license is the most popular license type. It’s a short, do what you want permissive license that pretty much just says “don’t sue me.”

It has the lowest barrier of entry to understand, and only requires that you include the original copyright and license somewhere in your project folder. It’s worth noting that someone could use your code to create a commercial closed-source software and profit from it. The hope is that you’ll give back to the community anyway.

The major difference between MIT and Apache is that MIT doesn’t explicitly mention patents. There’s an argument that this means that the MIT license gives an ‘implicit patent license’. That may be true, but the issue has not yet been litigated, so it also may not be true.

The GPL License — “The Papa Bear”

Image for post

GPL doesn’t like it when you try to close off its derivatives (cubs). 

GPL is not a permissive license. It’s firmly copyleft or ‘share-alike’. If you create a derivative work based on the code under certain circumstances, you must provide your source code. The ideas of ‘derivative work’ and ‘certain circumstances’ are legal concepts, and this type of license can add a lot more legal and technical complexity. For this reason, some companies will avoid any and all GPL-licensed projects period.

These types of licenses are occasionally thought of as having the possibility of ‘infecting’ code since once you use code under this license to build your own project, you must share your own software under the same license, even if its a small part of your overall project.

A way some have gotten around this GPL requirement is releasing the source code for the base version, but offering premium features and add-ons that cost money. This is the so-called ‘freemium’ model.

The Apache — “The Just Right”

Image for post

Was this article just an excuse to post a lot of cute pictures of bear cubs? Probably. Source.

Apache is also permissive and is similar to MIT in many ways. It’s a bit longer and deals with certain specific circumstances, which can occasionally help avoid conflicts, but also makes the license more complex.

Apache notably deals with the two additional classes of intellectual property: trademarks and patents. Apache explicitly states that it doesn’t grant permission for use of trademarks of the licensor (so don’t use trademark material except to say “Hey, this is derived from X.”)

As for patents, if you start patent litigation, you automatically terminate any patent licenses granted by the Apache license. This is an attempt to discourage the creation of patents down the line on software and to stymie patent trolls.

With Apache, you must include the copyright and license — like with MIT — but you must also state to your users when you’ve made significant changes to the software and if the library you’re borrowing from has a “NOTICE” file, you must include this file when you distribute your own work. You can also add to this file yourself for future contributors and spin-offs.

And Many More…

Of course, there are dozens of other licenses floating out there.

The Berkeley Software Distribution (BSD) is another popular permissive license you may run into. Additionally, some tech communities have their own preferred licenses. I’m not going to focus on the individual licenses, their version numbers, or their history, because honestly none of that matters. What matters is what you should consider when evaluating any license. Here are a few criteria to think through:

  • Type — Is this license copyleft or permissive? Which matches what I want for the long-term future of my code? Do I care more about wide adoption or ensuring open-source status? What’s my development philosophy? Am I worried about my project being forced to be open through use of a copyleft license? Am I worried someone might take my work and make money off of it without me?
  • Code Use, Modification & Distribution — What am I allowed to do with this code? What do I want others to be able to do with this code? Can I distribute it? Under what circumstances? Can others?
  • User Obligations — What do you have to do? Do you need to include a specific file in your library? Do you need to cite to anything or anyone? Do I want anyone else to have to do these things?
  • Patent Licensing — Do I anticipate needing to patent what I make? Am I building something commercial? Do I want to eventually have my product acquired by someone else?
  • Trademark Grants — Do I need access to the trademark of the licensor? Do I want someone to be able to use my trademarks?

Wrapping Things Up

The big distinction between most open source software licensing is whether its copyleft (if you use this, you’ll probably have to make your source code open source) and permissive (you can do what you want subject to explicit limitations). MIT is the most commonly used license, and probably serves your purposes, although Apache does have some language that explicitly deals with patents and trademarks. GPL, the copyleft license we talked about here, is good if you want to ensure that all the downstream developers will be open source, but this may discourage adoption of your code.

Andddddd one more for the road:

Image for post

Work that camera, bear!


Finally, some additional resources if you want to learn more or if you need help choosing a license. However, if you have a serious legal question you should always seek out qualified legal counsel.

I highly recommend this site for help in choosing a license:

Leave a Reply