Interest in deep learning is growing and growing and, with it at peak hype right now, a lot of people are looking to find the best deep learning library to build their new app or bring their company into the modern age. There are many deep learning toolkits to choose from ranging from the long used, supported, and robust academic libraries to the new state-of-the-art, industry backed platforms.

At Curalate, we’ve been working on deep learning problems since 2014, meaning we’ve had the chance to watch the deep learning community and its open source libraries grow. We have also had the fortunate (unfortunate?) experience of using a few of the deep learning libraries in our production services and applications, and along the way, we have learned a lot about what to look for in a deep learning library to build reliable, production-ready applications and services. In this post, I’ll share our lessons learned knowledge in hopes it will help you in your search for the perfect deep learning library match. You might even find that your best fit is using more than one!

Important factors

The specifics needs of your application/service

The platform you are developing on and deploying to.

Develop in OSX? Linux? Windows? Plan on having your application run in a web browser? A smart phone? A massive multi-node GPU cluster? It’s not surprising that each of the libraries have prioritized different environments and some will work much better for your specific situation.

The specific deep net architecture you are trying to implement

If you are just trying to implement a typical, pre-trained classification net, this factor may not be as important for you. Some libraries are more performant and appropriate for certain types of deep nets (LSTMs, RNNs), but more on this later.

API language requirements

If you already have a code base written in language A, you probably would like to keep it that way without having to figure out some convoluted way to fit a deep net interface in language B into it. Luckily, it seems that most of the common languages are covered at this point in at least one of the libraries, or in an external community project.

Codebase Quality

Is the code base actively maintained?

How healthy is the project in terms of maintainers? Is there a large group/company committing time and resources to the libraries development? If you find a bug or issue with the library, how long is it going to take for it to get addressed?

Release status of the library itself

Is the library or a certain feature/API you are going to need still considered to be in an Alpha or Beta state? Has the library been used in enough to have most of the kinks ironed out?

Ease of Use

Train to production pipeline

Your model training code and production code do not have to run in the same environments or even the same language. Can you train your model with a quick-to-prototype language in a documented, version-controlled, repeatable way so you can research new and different models for your application? Then can you deploy your saved model in a fairly quick and painless fashion? That may be through the same library with a different language API, using a library’s prebuilt production-serving framework, or even converting your model from one library to another that is better suited for your target platform.

Keras support

Does the library have support for being used as a backed for Keras? Keras is not a deep learning library per se, but a library that sits on top of other deep learning libraries and provides a single, easy to use, high-level interface to write and train deep learning models. Where it lacks in optimizations, it is great for beginners with great documentation and a modular, object oriented design.

Dynamic vs Static computation

Now we could write a whole blog post on this topic alone, but to keep it brief, do you want to work with a static computation graph API that follows a symbolic programming paradigm? Or do you want a dynamic computation graph API that follows an imperative programming paradigm?

  • Static Computation Graphing
    • You define the deep net once, and uses a session to execute ops in the net many times.
    • The library can optimize the net before you use it, so the nets end up being more efficient with memory and speed.
    • Good for fixed size net (feed-forward, CNNs)
    • Leads to the API being more verbose and harder to debug domain specific language (DSL)
    • Offers better over loading and model management in regards to system resources.
  • Dynamic Computation Graphing
    • Nets are built and rebuilt at runtime, and executed line by line how you define them. This lets you use standard imperative language (think Python) statements, features, and control structures.
    • Tends to be more flexible and useful for when the net structure needs to change at runtime, like in RNNs
    • Makes debugging easy since an error is not thrown in a single call to execute the net after its compiled, but at the specific line in the dynamic graph at run time.



How good is the documentation? Are there coding examples that cover most of the use cases you need? Are you used to getting your documentation in a certain style from a specific company?

Community support

How large is the community? Just because a deep learning library is really good does not mean people are actually using it. Are you going to be able to find 3rd party blog posts, code samples, and tutorials using the library? If you run into a problem, what is the chance you are going to find someone on Stack Overflow with the answer to your problem?


Does the research community actively use the library to develop state-of-the-art deep learning models and solutions? A lot of state-of-the-art discoveries made by the academic community require modification to the deep learning libraries themselves and it’s pretty common for research groups to release their source code for conference papers to the public. Most of these new models will be released as pretrained models and listed in a Model Zoo specific to the library. Porting these solutions between libraries is not a trivial task if you are not comfortable reimplementing the research paper.


Performance with specific network structures

How fast does your planned network structure run on each of the deep learning libraries? Will you be able to train and prototype your models faster on one vs another? If you are deploying to a service, how many requests per second can you expect to run through the library?


How well does the library scale when you start providing it with more resources to meet your production load? Can you save money by using a more efficient scaling library over another? (Cloud GPU instances can be really expensive)

The Libraries

Caffe, with its unparalleled performance and well-tested C++ codebase, was basically the first mainstream, production-grade deep learning library. Caffe is good for implementing CNNs, image processing, and for fine-tuning pre-trained nets. In fact, you can do all of these things with writing little to no code. You just place your training/validation data (mainly pictures) in a specific folder, set up config files for the deep net and its training parameters, and then call a precompiled Caffe binary that trains your net.

Being first to market means that a lot of early research and models were written with Caffe, and the research that built off of that forked and continued to use the same code base. Because of this, you will find a lot of state-of-the-art work, even to this day, still using Caffe despite its limitations. A lot of these models can be found in the Caffe Model Zoo, which is one of the first and largest (if not the largest) model zoos.

But now we have to start talking about its limitations. Caffe was built and designed around an original intended use case: conventional CNN applications. Because of this, Caffe is not very flexible. Overall, it’s not very good for RNNs and LSTM networks. Even with it’s adaption of CMake, building the library can still be a pain (especially for non Linux environments). It has little support for multiple GPUs (training only) and can only be deployed to a server environment. The configuration files to define the deep net structure are very cumbersome. The prototxt for ResNet-152 is 6775 lines long!

In Caffe, the deep net is treated as a collection of layers, as opposed to nodes of single tensor operations. Layers can be thought of as a composition of multiple tensor operations. These layers are not very flexible and there are a lot of them that duplicate similar logic internally. Because Caffe does not support auto differentiation, if you want to develop new layer types, you have to define the full forward and backwards gradient updates. You can define these layers in Caffe’s Python interface, but unlike other libraries where the Python interface is accelerated by their underling C implementations, Caffe Python layers run in Python.

So should you use Caffe? If you are looking to reimplement some specific model from a research paper from 2015 using existing, open source code, it is not a bad Library. If you are looking for raw performance and not opposed to using a C++ library and API on a GPU server for your service/app, Caffe is still one of the fastest libraries around for fully connected networks.

But because of its limitations and technical debt, a lot of the community and its efforts have moved on from Caffe in some form or another. Caffe is a special case when it comes to model converters, in that it is the best supported library with converters to almost all other deep learning libraries making it easier to move your work off of it. The creator of Caffe has been hired by Google to work on their deep learning library TensorFlow, and now by Facebook to create a successor to Caffe in the appropriately named Caffe2.

  • Torch
  • Lua, C++
  • Deepmind, NYU, IDIAP
  • Watches: 675 | Stars: 7,761
    Forks: 2,254
    Avg Issue Resolution: 55 Days
    Open issues: 33%*
  • No Keras support
  • Research Citations: 955
  • Model zoo

Torch and PyTorch are related by much more than just their name. Torch was one of the original, academic-created deep learning libraries. While it may not have as much research citing it for its use in the results, it still has a very large community around it. Many of the researchers who originally worked on Torch moved to Facebook. Unsurprisingly, Facebook has since developed the successor to Torch in the form of PyTorch. PyTorch and Torch use the same underlying C libraries, TN, THC, THNN, and THCUNN, which provide them with very similar performance characteristics. When it comes to typical deep learning architectures, Torch offers some of the fastest, but not the fastest, performance around with GPU scaling efficiency that matches the best.

Where Torch and PyTorch differ is in their interface, API, and graphing paradigms. Torch was written with a LUA API interface, which can be a major barrier of entry for most people. While you can do research and development in LUA, it doesn’t have the massive community backing and vast open source libraries like Python does, so it can be quite limiting. Torch uses a static graph paradigm like Caffe’s at the time. Also like Caffe, it does not have any auto-differentiation capabilities, meaning if you want to implement new tensor operations for your deep net you have to write the backwards gradient calculations, and it has a pretty substantial model zoo of pre-trained models.

PyTorch was made with the goal of fixing or modernizing various issues with Torch, to create probably one of the best currently available libraries for doing research and development. PyTorch, as the name suggests, has a very well designed Python API. It supports both dynamic graph programming and auto differentiation for all of the easy to debug and prototype goodness. PyTorch also has its own visualization dashboard called Visdom, which while more limited than TensorBoard (more on this later), is still very helpful for development.

So should you use Torch or PyTorch? For specifically research, and development of new models, PyTorch is probably currently the best option. Even though PyTorch is still very new, most people in the deep learning field would agree that you should use it over classic Torch. Not to say Torch does not have its advantages. Because of its age, it has a much larger backlog of research citing it for its use, and is more stable than PyTorch, but both of these advantages will be lost over time. If you are looking for a library to deploy into any kind of production environment, then you should probably look elsewhere.

TensorFlow, without a doubt, is currently the biggest player in the deep learning field and for good reason. TensorFlow is Google’s attempt to build a single deep learning framework for everything deep learning related. There is very little that TensorFlow does not do well. Because it was created by Google, it was built with massive distributed computing in mind, but it also had mobile development capabilities in the form of TensorFlow Mobile and TensorFlow Light. Its documentation is also considered one of the best. Their documentation covers multiple API languages that TensorFlow supports, and if you consider the interfaces made by 3rd parties in the community, it even has APIs for C#, Haskell, Julia, Ruby, Rust, and Scala. Speaking of that community, TensorFlow has the largest community out of any of the deep learning libraries and currently has the most research activity.

From the beginning, TensorFlow was made with a clear static graph API that was easy to use, but as interests and needs are changing in the machine learning field, it recently gained support for dynamic graph functionality in the form of TensorFlow Fold. TensorFlow has Keras support, making it very easy for beginners and even has its own custom version built into the Python API.

When Google first released TensorFlow, they also released TensorBoard. A data visualization tool that was created to help you understand the flow of tensors through your model for debugging, optimization, and just understanding the the complex and confusing nature of deep learning models. You can use TensorBoard to visualize your TensorFlow model, plot summary metrics about the execution of your model, and show additional data like images that pass through it.

Now what about deploying your models once you have finished training them? Well Google also has a solution for that in TensorFlow Serving, a flexible, high-performance serving system for ML models, designed for production environments. It comes in the form of modular C++ libraries, binaries, and docker/k8 containers that can be used as an RPC server or a set of libraries. There are even Google CloudML services set up with it to get your model up in production in no time. TensorFlow Serving’s main goal is to optimize for throughput with little to no down time. It includes a built-in scheduler that aims for the efficiency of mini-batching requests through the model and can manage multiple models at once running on shared hardware. Currently the API interface only supports prediction, but will support regression, classification, and multi-inference soon.

Now TensorFlow is not perfect. Both Serving and Fold are still in their early days of development, so they might not want to be something you would rely on. All of the APIs outside of the Python API are not covered by their API stability promises. But the biggest issue when it comes to TensorFlow when compared to the other libraries is performance.

There is no real way to get around the issue; TensorFlow is just slower and more of a resource hog when compared to the other libraries. Looking at performance across your typical deep net architectures you can expect to see other libraries perform up to twice as fast as TensorFlow at similar batch sizes. You should avoid TensorFlow in general if you need performant Recurrent nets (RNNs) or Long Short Term Memory nets (LSTMs). TensorFlow is even the worst at scaling efficiency when compared to the other libraries despite its focus on distributed computing.

So should you use TensorFlow? We wouldn’t blame you if you did and would probably suggest it for 80% of the possible use cases out there. Especially if you are new to the deep learning field and want to work with a library and ecosystem that has solutions for almost everything you could possibly need. But, if you are willing to put in the extra time and effort, you can find a much more performant and equally-featured experience with other libraries.

CNTK, the Microsoft Cognitive Tooklit, was originally created by MSR Speech researchers several years ago but has evolved into much more. It is a unified framework for building Deep nets, Recurrent net (RNNs), Long Short Term Memory nets (LSTMs), Convolution nets (CNNs), and Deep Structured Semantic Models (DSSMs). It can pretty much work for all types of deep learning applications from speech/text to vision.

CNTK supports distributed training like TensorFlow and Torch. It even supports a proprietary, commercially-licensed, 1-bit Stochastic Gradient Decent algorithm that significantly improves distributed performance. Thanks to CNTK’s early focus on language models, when it comes to running RNNS and LSTMs, it is 5-10 times better than the other libraries when running these dynamic network structures.

The biggest reason to use CNTK is if you or your company traditionally works with Microsoft software and products. CNTK is one of the few libraries to have first class support for running on Windows with additional support for running on Linux and NO support for OSX. It has direct support for deploying to a Microsoft Azure production environment and APIs that properly supports Microsoft’s languages of choice. Its model zoo is even set up in a very “MSDN documentation” fashion.

The main downside to CNTK is that it lacks support from both the general research and software dev community. Microsoft may be using it internally for a lot of their services and probably has the reliability to support it, but it is just having trouble gaining market share (like many of Microsofts recent endeavors).

So should you use CNTK? If you are used to developing in Visual Studio and need an API for your .NET application, there probably is no better fit. But there are better options out there for most OSX/Linux devs with better all-around support. Also, if you are trying to do research and development that is not specific to LSTMs or RNNs, there are more appropriate libraries.

MXNet is one of the newest players in the deep learning field but has been gaining ground fast. Originally created at the University of Washington and Carnegie Mellon University, it has been adopted by both The Apache Foundation and Amazon Web Services as their deep learning library of choice and has put their development efforts behind it.

MXNet supports almost all of the features the rest of the other libraries support. It has the largest selection of officially supported languages for its APIs, and it can run on everything from a web browser, a mobile phone, to a massive distributed server farm. In fact, Amazon has found that you can get up to an 85% scaling efficiency with MXNet. In most other cases, MXNet has some of the best performance when running with typical deep learning architectures.

MXNet supports both static graph programming and dynamic graph programming with the raw MXNet and Gluon APIs respectively. The Gluon API is also MXNet’s clear, concise, and simple API for deep learning created in collaboration with AWS and Microsoft in the same spirit as Keras, but MXNet also supports Keras if you prefer it. MXNet also has its own serving framework for getting your trained MXNet models into production and has extra support for running on AWS. It even has its own TensorBoard implementation that provides much of the same functionality as the TensorFlow equivalent.

MXNet does have notable weaknesses that make working with it a little more annoying. The documentation could be much better. The APIs have gone through a few changes before the first 1.0 release and the documentation reflects this, which can get a little confusing in some places. In terms of community support, its not the worst or the best, but somewhere in the middle. There is a notable amount of people using it and research, and there are plenty of usage examples for different net types along with their model zoo.

So should you use MXNet? If you are willing to put the time in and deal with some of the pain points from it being a younger deep learning library, it is probably the best option for 80% of use cases along with TensorFlow. Especially we would suggest it over TensorFlow if performance is a big concern of yours. If you are looking for the most flexible library to give you as many options as possible in your train to production pipeline with a native API for your production code, it’s probably the best option.

The Other Libraries

Now the previous 6 deep learning libraries covered are by no means that only options available to you. They are just the biggest players and arguably the most relevant for 2018. There are many more available to you to choose from that may better fit your specific needs (Deployment destination, non-english documentation/community, hardware, etc.). We will try to briefly cover them here and provide a jumping off point if you want to dig into one of them deeper.


  • Python API
  • University of Montreal
  • Future work on the project has stopped, May it rest in peace
  • Watches: 573, Star: 8041, Forks: 2426, Median Issue Resolution: 12 days, Open issues: 19%*
  • Research Citations: 1,080
  • Makes you do a lot of things from scratch, which leads to more verbose code.
  • Single GPU support only
  • Numerous open-source deep-libraries have been built on top of Theano, including Keras, Lasagne and Blocks
  • No real reason to use over TensorFlow unless you are working with old code.


  • C++, Python APIs
  • Facebook
  • Watches: 552, Stars: 7631, Forks: 1821, Median Issue Resolution: 55 Days, Open issues: 33%*
  • Caffe2 is facebooks second entry into the deep learning library ecosystem.
  • It is built with a focus more on mobile and industrial-strength production applications over development and research.
  • Where Caffe only supported single GPU training, Caffe2 is built to run utilizing both multiple GPUs on a single host and multiple hosts with single to multiple GPUs.


  • Swift, Objective-C APIs
  • Apple
  • Closed source
  • Not a full DL library (you can not use it to train models at the moment), but mainly focused on deploying pre-trained models optimized for Apple devices
    • If you need to train your own model, you will need to use one of the above libraries
    • Model converters available for Keras, Caffe, Scikit-learn, libSVM, XGBoost, MXNet, and TensorFlow


  • Python API
  • Baidu
  • Watches: 558, Star: 6580, Forks: 1756, Median Issue Resolution: 7 days, Open issues: 24%*
  • One of the newest libraries available
  • Chinese documentation with an English translation
  • Has the potential to become a big player in the market


  • Python API
  • Intel
  • Watches: 351, Stars: 3437, Forks: 778, Median Issue Resolution Time: 28 days, Open issues: 16%*
  • Written with Intel MKL accelerated hardware in mind (Intel Xeon and Phi processors)


  • Python API
  • Preferred Networks
  • Watches: 310, Stars: 3595, Forks: 949, Median Issue Resolution Time: 31 days, Open issues: 13%*
  • Research Citations: 207
  • Dynamic computation graph
  • Smaller company effort with a Japanese and English community


  • Java, Scala APIs
  • Skymind
  • Watches: 792, Stars: 8527, Forks: 4120, Median Issue Resolution Time: 19 days, Open issues: 21%*
  • Written with Java and the JVM in mind
  • Keras Support (Python API)
  • DL4J can take advantage distributed computing frameworks including Hadoop and Apache Spark.
  • On multi-GPUs, it is equal to Caffe in performance.
  • Can import models from Tensorflow
  • Uses ND4J (Numpy for the JVM)


  • C++, Python APIs
  • Carnegie Mellon University
  • Watches: 178, Stars: 2189, Forks: 527, Median Issue Resolution Time: 4 days, Open issues: 16%*
  • Dynamic computation graph
  • Small user community


  • Matlab APIs
  • Watches: 113, Stars: 959, Forks: 633, Median Issue Resolution Time: 96 days, Open issues: 53%*
  • a MATLAB toolbox implementing Convolutional Neural Networks (CNNs) for computer vision applications


  • Python, C APIs
  • Watches: 520, Stars: 6276, Forks 3072, Median Issue Resolution Time: 55 days, Open issues: 78%*
  • Very small open source effort with a laid back dev group
  • not useful for production environments


  • Rust API
  • autumnai
  • Watches: 195, Stars: 5229, Forks: 265, Median Issue Resolution Time: 131 days, Open issues: 58%*
  • Support for the lib looks to be dead


Choose either TensorFlow or MXNet for probably about 80% of use cases (TensorFlow if you prioritize community support and documentation, MXNet if you need performance). Look at PyTorch if you are mainly looking for something to develop/train new models. If you love Microsoft and are developing for a .NET environment in Windows and Visual Studio, try out CNTK. Look into OpenML for just deploying models to Apple devices specifically and Deeplearning4j if you really like to keep things JVM focused.

* Numbers taken at time of writing, expected to change.