Learning Machine LearningBlog posts on Machine Learning. Deep Learning for Computer Vision - libraries, models and ideas.
http://luizgh.github.io/
Wed, 05 Apr 2023 13:58:17 +0000Wed, 05 Apr 2023 13:58:17 +0000Jekyll v3.9.3On using double-backpropagation on pytorch<p>While doing some experiments that required double-backpropagation in pytorch (i.e. when you require the gradient of a gradient operation) I ran into some unexpected behavior. I found little information about it online, so I decided to write this short note.</p>
<p><strong>TL;DR</strong>: If you need to compute the gradients through another gradient operation, you need to set the option <code class="language-plaintext highlighter-rouge">create_graph=True</code> on <code class="language-plaintext highlighter-rouge">torch.autograd.grad</code>. This is described in the <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.grad">Pytorch documentation</a>.</p>
<h1 id="the-issue">The issue</h1>
<p>Suppose you need to train a model with gradient descent, and the loss function (or any other part of the computational graph) requires the usage of a derivative. For instance, in Contractive Autoencoders (CAE), where the gradient of the reconstruction loss w.r.t to the weights is part of the contractive loss itself. In this case, when computing the gradient of the contractive loss w.r.t to the weights (for training the model), you need to take the second order derivative of the reconstruction loss w.r.t the weights (see the <a href="http://www.icml-2011.org/papers/455_icmlpaper.pdf">paper</a>). This same problem happens in other tasks, such as meta-learning.</p>
<p>When implementing this in pytorch, you may use the autograd function <code class="language-plaintext highlighter-rouge">torch.autograd.grad</code> to compute the first-order gradients, use the result in the computation of the loss, and then backpropagate. Something along these lines:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">partial_loss</span> <span class="o">=</span> <span class="n">loss_function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">grad</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">partial_loss</span><span class="p">,</span> <span class="n">w</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">total_loss</span> <span class="o">=</span> <span class="n">partial_loss</span> <span class="o">+</span> <span class="n">torch</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">grad</span><span class="p">)</span>
<span class="n">total_loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span></code></pre></figure>
<p>Although this looks good, and it will <em>actually run</em>, it will not compute what you want: the <code class="language-plaintext highlighter-rouge">total_loss.backward()</code> operation will not back-propagate though the grad variable.</p>
<h1 id="a-simpler-example-that-we-can-use-to-identify-the-problem">A simpler example that we can use to identify the problem</h1>
<p>Let’s create a toy example with only a few variables, that we can check the math by hand. Lets consider the following variables:</p>
\[a = 1 \qquad b = 2 \qquad c = a^2 b \qquad d = \Big(a + \frac{\partial c}{\partial a}\Big) b\]
<p>Finally, let’s say we need to compute \(\frac{\partial d}{\partial a}\). We can do this analytically for this small problem:</p>
\[\frac{\partial c}{\partial a} = 2ab\]
\[\frac{\partial d}{\partial a} = \frac{\partial (a + 2ab) b}{\partial a} = b(1 + 2b) = 2 (1 + 4) = 10\]
<p>Now, let’s see what pytorch does for us:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">torch</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span>
<span class="n">dc_da</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">a</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">d</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">dc_da</span><span class="p">)</span> <span class="o">*</span> <span class="n">b</span>
<span class="n">dd_da</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">a</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'c: {}, dc_da: {}, d: {}, dd_da: {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">dc_da</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">dd_da</span><span class="p">))</span>
<span class="c1"># c: 2, dc_da: 4, d: 10, dd_da: 2</span></code></pre></figure>
<p>We were expecting the result of \(\frac{\partial d}{\partial a}\) to be 10, but pytorch computed it as 2. The reason is that by default, the torch.autograd.grad function will not create a node in the graph that can be backpropagated through. In this example, when computing \(\frac{\partial d}{\partial a}\), pytorch effectivelly considered \(\frac{\partial c}{\partial a}\) as a constant (with respect to a), and therefore took the gradient as \(\frac{\partial d}{\partial a} = \frac{\partial (a + \text{const}) b}{\partial a} = b = 2\).</p>
<p>To obtain the correct answer, we need to use the option <code class="language-plaintext highlighter-rouge">create_graph=True</code> on dc_da:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">torch</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span>
<span class="n">dc_da</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">create_graph</span><span class="o">=</span><span class="bp">True</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">d</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">dc_da</span><span class="p">)</span> <span class="o">*</span> <span class="n">b</span>
<span class="n">dd_da</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">a</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'c: {}, dc_da: {}, d: {}, dd_da: {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">dc_da</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">dd_da</span><span class="p">))</span>
<span class="c1"># c: 2, dc_da: 4, d: 10, dd_da: 10</span></code></pre></figure>
<h1 id="conclusion">Conclusion</h1>
<p>I found it a little tricky that Pytorch did not gave any errors, and simply assumed that when you compute a gradient w.r.t to a variable, you will not want to backpropagate through this node. This is counter-intuitive for me, since in all other cases, the default in pytorch <em>is</em> to backpropagate (e.g. in some iterative optimizations, you need to explicitly use <code class="language-plaintext highlighter-rouge">tensor.detach()</code> to avoid backpropagating through a node. I hope this note helps other people having issues with double-backpropagation in pytorch.</p>
Fri, 22 Jun 2018 00:00:00 +0000
http://luizgh.github.io/libraries/2018/06/22/pytorch-doublebackprop/
http://luizgh.github.io/libraries/2018/06/22/pytorch-doublebackprop/librariesTutorial em redes neurais convolucionais<p>Em outubro desse ano, fui convidado para apresentar um tutorial de redes neurais convolucionais para alunos da Universidade Federal do Paraná (UFPR) e PUC-PR. Criei um tutorial voltado para a parte prática: como definir e treinar esses modelos usando as bibliotecas Theano e Lasagne (os conceitos são os mesmo que para o uso de Tensorflow - em particular o uso de computação simbólica).</p>
<p>Os slides e exercícios em python (iPython notebooks) podem ser encontrados na minha <a href="https://github.com/luizgh/intro_to_cnns">página do github</a>.</p>
<p>O tutorial é dividido em três partes:</p>
<ul>
<li>Introdução à aprendizagem de máquina e ao Theano
<ul>
<li>Execício: definir e treinar regressão logística em uma base sintética, usando Theano</li>
</ul>
</li>
<li>Redes neurais convolucionais (CNNs)
<ul>
<li>Exercício: definir e treinar redes neurais na base mnist, usando Theano e Lasagne</li>
</ul>
</li>
<li>Transfer Learning usando CNNs
<ul>
<li>Exercício: usar uma rede pré-treinada na ImageNet para resolver um problema de classficação de imagens</li>
</ul>
</li>
</ul>
<p>Nota: Não é necessária a utilização de GPUs para fazer os tutoriais.</p>
Fri, 02 Dec 2016 00:00:00 +0000
http://luizgh.github.io/tutorials/2016/12/02/introcnns/
http://luizgh.github.io/tutorials/2016/12/02/introcnns/tutorialsNotes on R-CNN<p>In this post I review the artice <strong>Rich feature hierarchies for accurate object detection and semantic segmentation</strong> (<a href="http://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf">link</a>). This article was presented at CVPR 2014, and introduces the method R-CNN (Regions with CNN features).</p>
<h1 id="summary-of-the-paper">Summary of the paper</h1>
<p>The paper introduces a pipeline for object detection. The idea is to use Convolutional Neural Networks to detect objects in images in three steps: 1) Proposing multiple regions in an image (~ 2 thousand regions), 2) Classifying each region and 3) Filter the results using non-max suppression.</p>
<p>The region proposal step consists in selecting regions of interest (bounding boxes around objects), in a class-independent format. To compare with previous work, the authors use a method called “selective search” [1].</p>
<p>The classification of each region is done by first extracting features using a CNN (pre-trained on ImageNet and fine-tuned in the VOC dataset), and classifying with a linear SVM trained for each class (with “hard negative mining”). Since the CNN has a fixed-sized input (and the size of the regions vary), the authors adopted a simple approach of warping the region to the input size of the CNN.</p>
<p>The last step is a greedy non-maximum suppression, applied to each class indepedently, that rejects a region if the IoU (intersection-over-union) overlap with another region (with higher score) is larger than a learned threshold.</p>
<p>This method was tested on the VOC 2007 dataset, obtaining an mean average precision (mAP) of 58.5, compared to 34.3 in the state of the art.</p>
<h1 id="comments">Comments and opinions</h1>
<p>The overall idea of the paper is interesting, and it seems a natural extension of previous methods (propose regions -> extract features -> classify regions), using a CNN as a feature extractor instead of using hand-crafted features.</p>
<p>On the other hand, in this paper the authors make some choices that seemed strange to me: the first is the “warping” of the image region to the standard CNN size. This seems ackward to me, since it will distort the objects depending on how well the bounding box was selected. In my opinion, it would be more natural to crop a patch around the bounding box for classification, and keep the bounding box for actual reporting where the object is. Another decision I found strange was the usage of SVMs for classification. Since the authors are already fine-tuning the network in the target task (VOC), it seems more natural for me to use the softmax results directly.</p>
<h1 id="references">References</h1>
<p>[1] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013.</p>
Thu, 03 Mar 2016 00:00:00 +0000
http://luizgh.github.io/reviews/2016/03/03/rcnn/
http://luizgh.github.io/reviews/2016/03/03/rcnn/reviewsDeep Residual Learning for Image Recognition<p>In this post I review the artice <strong>Deep Residual Learning for Image Recognition</strong> (<a href="http://arxiv.org/abs/1512.03385">link to arXiv</a>) that won the 1st place in the ILSVRC 2015 classification task (ImageNet)[1] with 3.57% top-5 error, using a CNN containing <strong>152 layers</strong> (!), but with a reasonable computational time (less than previous models, such as VGG-19, that contained 19 layers[2]). This architecture was also used to achieve 1st place in the ImageNet detection and localization, as well as 1st place in the COCO 2015 competitions [3] (in detection and segmentation). That’s quite an impressive feat.</p>
<h1 id="summary-of-the-paper">Summary of the paper</h1>
<p>The central idea of this paper is to explore network depth, and in particular, how to be able to train networks with hundreds of layers. Experimental results (from previous research) demonstrate the benefits of depth in Convolutional Neural Networks. However, training deeper networks present some challenges. One problem is the vanishing/exploding gradients - which has been handled to a large extent with Batch Normalization [4]. However, even with gradients properly flowing to the first layers, naively training deeper and deeper networks does not usually increase performance. As the authors note, <strong>as we increase depth, accuracy saturates, and then degrades rapidly</strong>. Most surprisingly, not only the testing error gets worse, but the training error as well. However, consider the following insight: if we train a network with <strong>L</strong> layers, we could have a network with <strong>L + n</strong> layers, where the last n layers are the identity mapping. Clearly, this network has the same error as the the one with L layers, and optimizing it should get us a lower (or equal) training error. However, training a network with <strong>L + n</strong> layers from scratch often gives worse training performance than a network with <strong>L</strong> layers, showing that some networks are harder to optimize.</p>
<p>With this insight, the authors propose to address this “degradation problem” by letting the layers learn a <strong>residual mapping</strong>. That is, instead of the layers learning a transformation \(\mathcal{H}(\textbf{x})\), they consider that this transformation breaks down as the input plus a residual: \(\mathcal{H}(\textbf{x}) = \textbf{x} + \mathcal{F}(\textbf{x})\), and learn \(\mathcal{F}(\textbf{x})\) only. The authors argue that, in the extreme case, where the identity mapping is optimal, it is easier for the layers to learn \(\mathcal{F}(\textbf{x}) = \textbf{0}\) than to learn an identity tranformation (from a stack of non-linear layers).</p>
<p>In their experiments, the authors consider networks with 18, 34 , 50, 101 and 152 layers, trained on the ImageNet dataset. This residual computation is added after blocks of 2 or 3 layers (meaning that each 2 or 3 layers compute the residual of the transformation for the next level). Their results show that this strategy is quite effective in making the learning problem easier to optimize: without adding this residual architecture, the network with 34 layers perform worse (even in the training set) than the 18-layer network. With their strategy, increasing the number of layers continue to improve performance, not only in the training set, but on the testing set as well. Surprisingly, the 152-layer network achieves 4.49% top-5 error, which is <strong>lower than the results from ensembles of models used in previous years</strong>. The authors also performed experiments on the smaller dataset CIFAR-10, with tens of layers, and an extreme case of 1202 layers. With 110 layers, they achieved 6.43% error (state of the art for this dataset). More surprisingly, the model with over a thousand layers converged, having a train error lower than 0.1% (although it overfit the testing set, achieving 7.93% of error on the test set).</p>
<h1 id="comments">Comments and opinions</h1>
<p>Similarly to Batch Normalization, I found impressive that the idea for this paper is quite simple, yet it performs incredibly well in practice (for instance, I found this model much simpler than the ILSVRC 2014 winner - GoogLeNet, with the Inception modules).</p>
<p>On the other hand, I found it hard to interpret what the model is doing. When I think about what a network with multiple layers is computing, I picture each layer projecting the input to a different feature space, slowly disentangling the inputs, so that they can be linearly classified in the last layer (e.g. I imagine something like <a href="http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/">this</a>). However, in this formulation, we explicitly add back the input when computing the output of the layer (or, the stack of layers to be more precise): \(y = \textbf{x} + \mathcal{F}(\textbf{x})\). Intuitively, this seems to pull the output of the layer closer to the original feature space, which would make it harder to disentangle the factors of variation in the input. It would be interesting to generate visualizations for this network (e.g. <a href="http://yosinski.com/deepvis">of this kind</a>) to help understand what the network is learning.</p>
<h1 id="references">References</h1>
<p>[1] Russakovsky, Olga, et al. “Imagenet large scale visual recognition challenge.” International Journal of Computer Vision 115.3 (2015): 211-252.</p>
<p>[2] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).</p>
<p>[3] Lin, Tsung-Yi, et al. “Microsoft coco: Common objects in context.” Computer Vision–ECCV 2014. Springer International Publishing, 2014. 740-755.</p>
<p>[4] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).</p>
Mon, 22 Feb 2016 00:00:00 +0000
http://luizgh.github.io/reviews/2016/02/22/residual-learning/
http://luizgh.github.io/reviews/2016/02/22/residual-learning/reviewsNotes on Batch Normalization<p>In this post, I will briefly review the paper <strong>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</strong> (<a href="http://arxiv.org/abs/1502.03167">paper in arXiv</a>). This paper has been posted to arXiv on Feb-2015, and was well received and discussed in the community (with 145 citations as of Feb-2016, according to google scholar).</p>
<h1 id="summary-of-the-paper">Summary of the paper</h1>
<p>The central idea of the paper is to accelerate training by reducing the <em>Internal Covariate Shift</em> of the network. The authors argue that one of the problems for training deep neural networks is that the distribution of the layer’s inputs change over time (since they are the outputs of previous layers), which they call <em>Internal Covariate Shift</em>. The objetive of Batch Normalization is to reduce this problem, in order to accelerate training.</p>
<p>Previous research [1] has shown that the network converges faster when the inputs are <em>whitened</em> - that is, normalized to have zero mean, unit variance, and decorrelated (diagonal covariance). This paper brings this idea to the other layer’s inputs (i.e. outputs from previous layers in the network). The solution proposed by the authors is to normalize the units of a layer, before the activation, using the statistics from mini-batches of data. In practice, let’s consider an unit \(x\), a neuron in the the pre-activation output of a layer in the network. We first calculate a normalized version of this unit:</p>
\[\hat{x} = \frac{x - \text{E}[x]}{\sqrt{\text{Var}[x]}}\]
<p>Where the expectation and variance are calculated in a mini-batch of examples. Note that this formulation only normalizes the mean to zero and the variance to one, but do not de-correlate the units within the layer. The authors argue that calculating the covariance matrix for the units would not be pratical given the size of the mini-batches vs. the number of units in each layer.</p>
<p>One problem with simply normalizing the units is that it can change the representation power of the network (e.g. in the case of a sigmoid, this can lead the units to be only in the (near) linear part of the activation). In order not to lose representation power of the network, the authors introduce other two parameters (per neuron in the network): \(\gamma\), \(\beta\), that can “undo” this normalization:</p>
\[y = \gamma \hat{x} + \beta\]
<p>With these two parameters, the output has the same representation power, and the network can undo the normalization, if this is the optimal thing to do.</p>
<p>Besides the theoretical arguments for using this strategy, the authors report several advantages found during experiments, conducted using the ImageNet dataset, with modified versions of the Inception[2] model. Most notably, using Batch Normalization (BN), they were able to use a <strong>30x larger learning rate</strong> for training, obtaining the same level of performance with <strong>14 times fewer training steps</strong>. The authors also noted that using BN reduced the need for Dropout, showing that it can help regularize the network. Lastly, the authors used an ensemble of 6 models trained with BN to achieve the state-of-the-art results on the ImageNet LSVRC challenge (4.82% error in the test set).</p>
<h1 id="comments-and-opinions">Comments and opinions</h1>
<p>I found it very surprising that the benefits for using Batch Normalization are so large, for such a simple idea. In general, the paper is well written, the claims are well founded and the ideas are easy fo follow.</p>
<p>One thing I liked about this approach, besides speeding up training, is that it makes the network much more stable to the initial values assigned to the weights. In particular, the authors show that the back-propagation through a layer with Batch Normalization is invariant to the scale of its parameters.
This means that using Batch Normalization requires less time tweaking the initial parameter values.
I have recently implemented Batch Normalization for training a 7-layer CNN on the CIFAR-10 dataset. Initializing the network with small random weights (e.g. \(W \sim \text{U}(-a,a)\) for \(a = 0.001\)), without BN the network did not train at all. I noticed that with this (poor) initialization, the pre-activation outputs decreased for each layer in the network reaching the order of \(10^{-11}\) in the last layer, compromising training. Surprisingly, just adding BN on the last layer (right before applying softmax) was enough for the network to train properly, even with a bad initialization scheme.</p>
<p>There is one thing in particular that remained unclear to me after reading the article: why Batch Normalization helps regularizing the network. The authors simply state that “a training example is seen in conjunction with other examples, and the training network no longer producing deterministic values for a given training example”. In my opinion, this does not seem to properly explain why it is so good in regularizing the network (to the point of not requiring dropout). It seems that there is still a lot of room to explore and understand this idea in future research.</p>
<h1 id="references">References</h1>
<p>[1] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.</p>
<p>[2] Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.</p>
Fri, 19 Feb 2016 00:00:00 +0000
http://luizgh.github.io/reviews/2016/02/19/notes-on-batch-normalization/
http://luizgh.github.io/reviews/2016/02/19/notes-on-batch-normalization/reviewsUsing Lasagne for training Deep Neural Networks<p>There are a lot of <a href="http://deeplearning.net/software_links/">Deep Learning libraries</a> out there, and the <em>best</em> library really depends on what you are trying to do.</p>
<p>After using the libraries <strong>cuda-convnet</strong> and <strong>Caffe</strong> for a while, I found out that I needed more flexibility in the models, in terms of defining the objective functions and in controlling the way samples are selected / augmented during training.</p>
<p>Looking at alternatives, the best options to achieve what I wanted were <a href="http://torch.ch/">Torch</a> and <a href="https://github.com/Theano/Theano">Theano</a>. Both libraries are flexible and fast, and I chose Theano because of the language (Python vs Lua). There are several libraries built on top of Theano that make it even easier to specify and train neural networks. One that I found very interesting is <a href="http://lasagne.readthedocs.org/">Lasagne</a>.</p>
<p>Lasagne is a library built on top of Theano, but it does not hide the Theano symbolic variables, so you can manipulate them very easily to modify the model or the learning procedure in any way you want.</p>
<p>This post is intended for people who are somewhat familiar with training Neural Networks, and would like to know the lasagne library. Below we consider a simple example to get started with Lasagne, with some features I found useful in this library.</p>
<h2>CNN training with lasagne</h2>
<p>We will train a Convolutional Neural Network (CNN) on the MNIST dataset, and see how easy it is to make changes in the model / training algorithm / loss function using this library.
First, install the Lasagne library following <a href="http://lasagne.readthedocs.org/en/latest/user/installation.html">these instructions</a>. The actual <strong>code</strong> to accompany this blog post, as an iPython notebook, can be found here: <a href="https://github.com/luizgh/lasagne_basics">show me the code!</a></p>
<p>Now, let’s describe the problem at hand.</p>
<h3>The problem</h3>
<p>We will consider the MNIST classification problem. This is a dataset of handwritten digits, where the objective is to classify small images (28x28 pixels) as a digit from 0 to 9. The samples on the dataset look like this:</p>
<p><img src="http://luizgh.github.io/assets/lasagne_basics/mnist_samples.png" alt="mnist samples" class="centered" />
<em>Samples from the MNIST dataset</em></p>
<p>From a high level, what we want to do if define a model that identifies the digit (\(y \in [0,1,...,9]\)) from an image \(x \in \mathbb{R}^{28*28}\). Our model will have 10 outputs, each representing how confident the model is that the image is a particular number (the probability \(P(y \vert x)\)). We then consider a <strong>cost function</strong> that considers how <em>wrong</em> our model is, on a set of images - that is, we show a bunch of images, and check if the model is accurate in predicting \(y\). We start our model with random parameters (so in the beginning it will make a lot of mistakes), and we iteratively modify the parameters of the model so that it makes less errors.</p>
<h3>The model</h3>
<p>Let’s consider a Convolutional Neural Network model proposed by Yann Lecun in the early 90’s. In particular, we will consider a variant of the original architecture called LENET-5 [1]:</p>
<p><img src="http://luizgh.github.io/assets/lasagne_basics/lenet5.png" alt="LENET-5" class="centered" />
<em>The LENET-5 architecture</em></p>
<h3 id="defining-the-model-in-lasagne">Defining the model in lasagne</h3>
<p>We will start by defining the model using the Lasagne library. The first step is creating symbolic variables for input of the network (images) and the output - 10 neurons predicting the probability of each digit (0-9) given the image:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">data_size</span><span class="o">=</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">)</span> <span class="c1"># Batch size x Img Channels x Height x Width
</span><span class="n">output_size</span><span class="o">=</span><span class="mi">10</span> <span class="c1"># We will run the example in mnist - 10 digits
</span>
<span class="n">input_var</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">tensor4</span><span class="p">(</span><span class="s">'input'</span><span class="p">)</span>
<span class="n">target_var</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">ivector</span><span class="p">(</span><span class="s">'targets'</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>In this example, we named the inputs as <em>input_var</em> and the outputs as <em>target_var</em>. Notice that these are symbolic variables: they don’t actually contain the values. Instead, they represent these variables in a series of computations (called a computational graph). The idea is that you specify a series of operations, and later you <strong>compile</strong> a function, so that you can actually pass inputs and receive outputs.</p>
<p>This may be hard to grasp initially, but it is what allows Theano to automatically calculate gradients (derivatives), which is great for trying out new things, and it also enables the library to optimize your code.</p>
<p>Defining the model in Lasagne can be done very easily. The library implements most commonly used layer types, and their declaration is very straightforward:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre><span class="n">net</span> <span class="o">=</span> <span class="p">{}</span>
<span class="c1">#Input layer:
</span><span class="n">net</span><span class="p">[</span><span class="s">'data'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">InputLayer</span><span class="p">(</span><span class="n">data_size</span><span class="p">,</span> <span class="n">input_var</span><span class="o">=</span><span class="n">input_var</span><span class="p">)</span>
<span class="c1">#Convolution + Pooling
</span><span class="n">net</span><span class="p">[</span><span class="s">'conv1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Conv2DLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'data'</span><span class="p">],</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">6</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">net</span><span class="p">[</span><span class="s">'pool1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Pool2DLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'conv1'</span><span class="p">],</span> <span class="n">pool_size</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">net</span><span class="p">[</span><span class="s">'conv2'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Conv2DLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'pool1'</span><span class="p">],</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">net</span><span class="p">[</span><span class="s">'pool2'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Pool2DLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'conv2'</span><span class="p">],</span> <span class="n">pool_size</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="c1">#Fully-connected + dropout
</span><span class="n">net</span><span class="p">[</span><span class="s">'fc1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">DenseLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'pool2'</span><span class="p">],</span> <span class="n">num_units</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">net</span><span class="p">[</span><span class="s">'drop1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">DropoutLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'fc1'</span><span class="p">],</span> <span class="n">p</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="c1">#Output layer:
</span><span class="n">net</span><span class="p">[</span><span class="s">'out'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">DenseLayer</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'drop1'</span><span class="p">],</span> <span class="n">num_units</span><span class="o">=</span><span class="n">output_size</span><span class="p">,</span>
<span class="n">nonlinearity</span><span class="o">=</span><span class="n">lasagne</span><span class="p">.</span><span class="n">nonlinearities</span><span class="p">.</span><span class="n">softmax</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>Lasagne does not specify a “model” class, so the convention is to create a dictionary that contains all the layers (called <strong>net</strong> in this example).</p>
<p>The definition of each layer consists of the input for that layer, followed by the parameters for the layer. In line 7 we specify the first layer called <strong>conv1</strong>. It is a Convolutional Layer that receives input from the layer <strong>data</strong>, and has <strong>6</strong> filters of size <strong>5x5</strong>.</p>
<h3 id="defining-the-cost-function-and-the-update-rule">Defining the cost function and the update rule</h3>
<p>We now have our model defined. The next step is defining the cost (loss) function, that we want to optimize. For classification problems, the common loss is the cross entropy loss, which is also implemented in lasagne. We will also add some regularization in the form of L2 weight decay.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="c1">#Define hyperparameters. These could also be symbolic variables
</span><span class="n">lr</span> <span class="o">=</span> <span class="mf">1e-2</span>
<span class="n">weight_decay</span> <span class="o">=</span> <span class="mf">1e-5</span>
<span class="c1">#Loss function: mean cross-entropy
</span><span class="n">prediction</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">get_output</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'out'</span><span class="p">])</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">objectives</span><span class="p">.</span><span class="n">categorical_crossentropy</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">target_var</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">loss</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1">#Also add weight decay to the cost function
</span><span class="n">weightsl2</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">regularization</span><span class="p">.</span><span class="n">regularize_network_params</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'out'</span><span class="p">],</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">regularization</span><span class="p">.</span><span class="n">l2</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">+=</span> <span class="n">weight_decay</span> <span class="o">*</span> <span class="n">weightsl2</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>In line 6, we get a symbolic variable of the output layer (which is our prediction \(P(y \vert x)\)). We then obtain the cross entropy loss. Since we will be training with mini-batches, line 7 will return a vector of losses (one for each example). In line 8 we just consider the average of the losses in the mini-batch.</p>
<p>We then add regularization in line 11. It is worth noting how easy it is to add elements to the cost function. Looking at lines 11 and 12, in order to add weight decay we simply need to sum the weight decay to the loss variable.</p>
<p>For training the model, we need to calculate the partial derivatives of the loss with respect to the weights in our model. Here is where Theano really shines: since we defined the computations using symbolic math, it can automatically calculate the derivatives of an arbitrary loss function with respect to the weights.</p>
<p>Lastly, we need to select an optimization procedure, that defines how we will update the parameters of the model.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="c1">#Get the update rule for Stochastic Gradient Descent with Nesterov Momentum
</span><span class="n">params</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">get_all_params</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'out'</span><span class="p">],</span> <span class="n">trainable</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">updates</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">updates</span><span class="p">.</span><span class="n">sgd</span><span class="p">(</span>
<span class="n">loss</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="n">lr</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>Here we used standard Stochastic Gradient Descent (SGD), which is a very straightforward procedure, but we can also use more advanced methods, such as Nesterov Momentum and ADAM very easily (see the <a href="https://github.com/luizgh/lasagne_basics">code</a> for examples). Note that the classes in <em>lasagne.updates</em> also encapsulate the call to Theano to obtain the gradients (the partial derivatives of the loss with respect to the parameters).</p>
<h3 id="compiling-the-training-and-testing-functions">Compiling the training and testing functions</h3>
<p>We now have all the variables that define our model and how to train it, the next step if to actually compile the functions that we can run to perform training and testing.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="n">train_fn</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">([</span><span class="n">input_var</span><span class="p">,</span> <span class="n">target_var</span><span class="p">],</span> <span class="n">loss</span><span class="p">,</span> <span class="n">updates</span><span class="o">=</span><span class="n">updates</span><span class="p">)</span>
<span class="n">test_prediction</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">get_output</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'out'</span><span class="p">],</span> <span class="n">deterministic</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">test_loss</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">objectives</span><span class="p">.</span><span class="n">categorical_crossentropy</span><span class="p">(</span><span class="n">test_prediction</span><span class="p">,</span>
<span class="n">target_var</span><span class="p">)</span>
<span class="n">test_loss</span> <span class="o">=</span> <span class="n">test_loss</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">test_acc</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="p">.</span><span class="n">eq</span><span class="p">(</span><span class="n">T</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">test_prediction</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">target_var</span><span class="p">),</span>
<span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">floatX</span><span class="p">)</span>
<span class="n">val_fn</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">([</span><span class="n">input_var</span><span class="p">,</span> <span class="n">target_var</span><span class="p">],</span> <span class="p">[</span><span class="n">test_loss</span><span class="p">,</span> <span class="n">test_acc</span><span class="p">])</span>
<span class="n">get_preds</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">([</span><span class="n">input_var</span><span class="p">],</span> <span class="n">test_prediction</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>The first line compiles the training function <strong>train_fn</strong>, which has an “updates” rule. Whenever we call this function, it updates the parameters of the model.</p>
<p>\(\DeclareMathOperator*{\argmax}{arg\,max}\)
We have defined two functions for test: the first is <strong>val_fn</strong>, that returns the average loss and classification accuracy of a set of images and labels \((x,y)\), and <strong>get_preds</strong>, that returns the predictions \(P(y \vert x)\), given a set of images \(x\). The accuracy is calculated as follows: we consider that the model predicts the class \(y\) that has the largest value of \(P(y \vert x)\) for a given image \(x\). That is \(\hat{y} = \argmax_y{P(y \vert x)}\). We compare this prediction with the ground truth, and take the average value over the entire test set.</p>
<h3 id="training-the-model">Training the model</h3>
<p>To train the model, we need to call the training function <strong>train_fn</strong> for mini-batches of the training set, until a stopping criterion.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="c1">#Run the training function per mini-batches.
</span><span class="n">n_examples</span> <span class="o">=</span> <span class="n">x_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">n_batches</span> <span class="o">=</span> <span class="n">n_examples</span> <span class="o">/</span> <span class="n">batch_size</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
<span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">n_batches</span><span class="p">):</span>
<span class="n">x_batch</span> <span class="o">=</span> <span class="n">x_train</span><span class="p">[</span><span class="n">batch</span><span class="o">*</span><span class="n">batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">batch</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">batch_size</span><span class="p">]</span>
<span class="n">y_batch</span> <span class="o">=</span> <span class="n">y_train</span><span class="p">[</span><span class="n">batch</span><span class="o">*</span><span class="n">batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">batch</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">batch_size</span><span class="p">]</span>
<span class="n">train_fn</span><span class="p">(</span><span class="n">x_batch</span><span class="p">,</span> <span class="n">y_batch</span><span class="p">)</span> <span class="c1"># This is where the model gets updated</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>Here we simply run the model for a fixed number of epochs (iterations over the entire training set). In each epoch, we use mini-batches: a small set of examples that is used to calculate the derivatives of the loss with respect to the weights, and update the model. Since our training function returns the loss of the mini-batch, we could also track it to monitor progress (this is done in the <a href="https://github.com/luizgh/lasagne_basics">code</a>).</p>
<p>Running this code on a Tesla C2050 GPU takes around 10 seconds per epoch. I ran it for 50 epochs for a total of 490 seconds (a little over 8 minutes).</p>
<h3 id="testing-the-model">Testing the model</h3>
<p>Now that the model is trained, it is very easy to get predictions on the test set. Let’s now get the accuracy on the testing set:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">loss</span><span class="p">,</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">val_fn</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
<span class="n">test_error</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">acc</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Test error: %f'</span> <span class="o">%</span> <span class="n">test_error</span><span class="p">)</span></code></pre></figure>
<p>With the model trained for 50 epochs, the test error we achieve is \(0.8\%\). All 10000 test images are classified in 0.52 seconds.</p>
<p>And that is it. We have defined our model, trained it for a fixed number of epochs on the training set, and evaluated its performance on a testing set.</p>
<p>Here are some predictions made by this model:</p>
<p><img src="http://luizgh.github.io/assets/lasagne_basics/randompreds.png" alt="random predictions" class="centered" />
<em>Predictions of random images from the testing set</em></p>
<p>The model seems to be doing a pretty good job. Let’s now take a look on some cases where the model failed to predict the correct class:</p>
<p><img src="http://luizgh.github.io/assets/lasagne_basics/errors.png" alt="errors" class="centered" />
<em>Incorrect predictions in the testing set</em></p>
<p>There is certainly room for improvement in the model, but it is entertaining to see that the cases that the model gets wrong are mostly hard to recognize.</p>
<h3 id="making-changes">Making changes</h3>
<p>The nice thing about this library is that it is very easy to try out different things. For instance, it is easy to change the model architecture, by adding / removing layers, and changing their parameters. Other libraries (such as cuda-convnet) require that you specify the parameters in a file, which is harder to use if you want to, for instance, try out different numbers of neurons in a given layer (in an automated way).</p>
<p>Another thing that is easy to do in lasagne is using more advanced optimization algorithms. In the <a href="https://github.com/luizgh/lasagne_basics">code</a> I added an ipython notebook that trains the same network architecture using Stochastic Gradient Descent (SGD) and some more advanced techniques: RMSProp and ADAM. Here is a plot of the progress of the training error over time (in epochs - the number of passes through the training set):</p>
<p><img src="http://luizgh.github.io/assets/lasagne_basics/training_loss.png" alt="mnist samples" class="centered" />
<em>Training progress with different optimization algorithms</em></p>
<p>For this dataset and model, using ADAM was much superior than the classical Stochastic Gradient Descent - for instance, in the second pass on the training set (using ADAM), the performance was the same as doing 10 epochs using SGD. Testing out different optimization algorithms is very easy in Lasagne - changing a single line of code.</p>
<p>Other things you can easily do:</p>
<ul>
<li> Add terms to the cost function. Just add something to the "loss" variable that is used for defining the updates. Theano will take care of calculating the derivatives with respect to the inputs. For instance, you may want to penalize the weights on a given layer more than the others, or you may want to jointly optimize another criterion, etc.</li>
<li> It is very easy to obtain the representations on an intermediate layer (which can be used for Transfer Learning, for instance)
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">output_at_layer_fc1</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">get_output</span><span class="p">(</span><span class="n">net</span><span class="p">[</span><span class="s">'fc1'</span><span class="p">])</span>
<span class="n">get_representation</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">([</span><span class="n">input_var</span><span class="p">],</span> <span class="n">output_at_layer_fc1</span><span class="p">)</span></code></pre></figure>
</li>
<li>You can fine-tuned pre-trained models. By default, the weights are initialized at random (in a good way, following [2]), but you can also initialize the layers with pre-trained weights:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">net</span><span class="p">[</span><span class="s">'conv1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lasagne</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Conv2DLayer</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
<span class="n">W</span><span class="o">=</span><span class="n">pretrainedW</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="n">pretrainedB</span><span class="p">)</span></code></pre></figure>
</li>
</ul>
<p>There are some pre-trained models in ImageNet and other datasets in the <a href="https://github.com/Lasagne/Recipes/tree/master/modelzoo">Model Zoo</a>.</p>
<h3 id="references">References</h3>
<p>[1] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.</p>
<p>[2] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics (pp. 249-256).</p>
Tue, 08 Dec 2015 00:00:00 +0000
http://luizgh.github.io/libraries/2015/12/08/getting-started-with-lasagne/
http://luizgh.github.io/libraries/2015/12/08/getting-started-with-lasagne/libraries