chap1.html

<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
    <meta charset="utf-8">
    <meta name="citation_title" content="ニューラルネットワークと深層学習">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2014">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <link rel="icon" href="nnadl_favicon.ICO" />
    <title>ニューラルネットワークと深層学習</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS":
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="">CHAPTER 1</a></h1>
  <h1 class="chapter_title"><a href="">ニューラルネットワークを用いた手書き文字認識</a></h1></div><div class="section"><div id="toc">
<p class="toc_title"><a href="index.html">ニューラルネットワークと深層学習</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">ニューラルネットワークを用いた手書き文字認識</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">逆伝播の仕組み</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">ニューラルネットワークの学習の改善</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">ニューラルネットワークが任意の関数を表現できることの視覚的証明</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">ニューラルネットワークを訓練するのはなぜ難しいのか</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});</script>

<p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a><a href="chap6.html">深層学習</a><div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><a href="chap6.html#introducing_convolutional_networks"><li>Introducing convolutional networks</li></a><a href="chap6.html#convolutional_neural_networks_in_practice"><li>Convolutional neural networks in practice</li></a><a href="chap6.html#the_code_for_our_convolutional_networks"><li>The code for our convolutional networks</li></a><a href="chap6.html#recent_progress_in_image_recognition"><li>Recent progress in image recognition</li></a><a href="chap6.html#other_approaches_to_deep_neural_nets"><li>Other approaches to deep neural nets</li></a><a href="chap6.html#on_the_future_of_neural_networks"><li>On the future of neural networks</li></a></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() {
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});
});</script>


<p class="toc_not_mainchapter"><a href="sai.html">
Appendix: 知性のある <i>シンプルな</i> アルゴリズムはあるか?</a></p>
<p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<span class="sidebar_title">Sponsors</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>


<!--
<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible.
Thanks also to all the contributors to the <a
href="bugfinder.html">Bugfinder Hall of Fame</a>.  </p>

<p class="sidebar">The book is currently a beta release, and is still
under active development.  Please send error reports to
mn@michaelnielsen.org.  For other enquiries, please see the <a
href="faq.html">FAQ</a> first.</p>
-->

<p class="sidebar">著者と共にこの本を作り出してくださった<a
href="supporters.html">サポーター</a>の皆様に感謝いたします。
また、<a
        href="bugfinder.html">バグ発見者の殿堂</a>に名を連ねる皆様にも感謝いたします。
また、日本語版の出版にあたっては、<a
href="translators.html">翻訳者</a>の皆様に深く感謝いたします。

</p>


<p class="sidebar">この本は目下のところベータ版で、開発続行中です。
エラーレポートは mn@michaelnielsen.org まで、日本語版に関する質問は muranushi@gmail.com までお送りください。
その他の質問については、まずは<a
href="faq.html">FAQ</a>をごらんください。</p>


<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="http://eepurl.com/BYr9L">Mailing list for book announcements</a>
</p>

<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
</p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
  著：<a href="http://michaelnielsen.org">Michael Nielsen</a> / 2014年9月-12月 <br >  訳：<a href="https://github.com/nnadl-ja/nnadl_site_ja">「ニューラルネットワークと深層学習」翻訳プロジェクト</a>
</p>
</div>
</p><p>
  <!--The human visual system is one of the wonders of the world.  Consider the following sequence of handwritten digits:-->

人間の視覚がどんなに不思議なものか、考えたことはありますか？次の手書き数列を読んでみてください：
<a name="complete_zero"></a></p><p><center><img src="images/digits.png" width="160px"></center> </p><p>

  <!--  Most people effortlessly recognize those digits as 504192.  That ease
is deceptive.  In each hemisphere of our brain, humans have a primary
visual cortex, also known as V1, containing 140 million neurons, with
tens of billions of connections between them.  And yet human vision
involves not just V1, but an entire series of visual cortices - V2,
V3, V4, and V5 - doing progressively more complex image processing.
We carry in our heads a supercomputer, tuned by evolution over
hundreds of millions of years, and superbly adapted to understand the
visual world.  Recognizing handwritten digits isn't easy.  Rather, we
humans are stupendously, astoundingly good at making sense of what our
eyes show us.  But nearly all that work is done unconsciously.  And so
we don't usually appreciate how tough a problem our visual systems
solve.-->

大抵の人にとってはごく簡単に504192と読めると思います。でも、脳の中で起
こっていることは簡単どころではありません。脳のふたつの半球にはそれぞれ、
一次視覚野---V1とも呼ばれる、一億四千万のニューロンと何十億ものシナ
プスからなる領域が存在しています。さらに、人間の視覚に関わっている領域
はV1だけではなく、V1、V2、V3、V4、V5という一連の視覚野が、順次複雑な画
像処理に携わっています。
私たちの頭部には、数億年にわたる進化によって洗練され、視覚世界を理解するのに驚くべき適応をとげたスーパーコンピューターが内蔵されているのです。手書き数字認識が簡単なのではありません。どちらかというと、私たち人類が、目に見えるものを解釈するという作業をとても、とても得意としているのです。しかもその作業はほとんど無意識のうちに行われるのです。ですから、私たちは普段、自分の視覚系がいかに複雑な問題を解いてくれているかに、感謝を払うこともないのです。

</p><p>

<!--  The difficulty of visual pattern recognition becomes apparent if you
attempt to write a computer program to recognize digits like those
above.  What seems easy when we do it ourselves suddenly becomes
extremely difficult.  Simple intuitions about how we recognize shapes
- "a 9 has a loop at the top, and a vertical stroke in the bottom
right" - turn out to be not so simple to express algorithmically.
When you try to make such rules precise, you quickly get lost in a
morass of exceptions and caveats and special cases.  It seems
hopeless.-->

ひとたび、さっきの手書き数字を認識するプログラムを書こうとすれば、視覚パターン認識の困難さが明らかになります。自分でやればこんなに簡単に思えることが、突然ものすごく難しくなったように感じるでしょう。数字を認識するための直感的で単純なルール---「数字の9は、上に輪があって、右下から下に向かって線が生えている形」---をアルゴリズムで表現するのはけっして単純ではないことに気づくでしょう。このようなルールを正確にプログラムとして表現しようとすれば、すぐに膨大な例外、落とし穴、特殊ケースに気づくはずです。絶望的です。


</p><p></p><p>

<!--  Neural networks approach the problem in a different way.  The idea is
to take a large number of handwritten digits, known as training
examples,-->

ニューラルネットワークはこのような問題に違った角度から迫ります。ニューラルネットワークの発想は、手書き数字のデータをあらかじめ沢山用意して（このようなデータを訓練例といいます）


</p><p><center><img src="images/mnist_100_digits.png" width="440px"></center></p><p>

  <!--  and then develop a system which can learn from those training
examples. In other words, the neural network uses the examples to
automatically infer rules for recognizing handwritten digits.
Furthermore, by increasing the number of training examples, the
network can learn more about handwriting, and so improve its accuracy.
So while I've shown just 100 training digits above, perhaps we could
build a better handwriting recognizer by using thousands or even
millions or billions of training examples.-->

その上で、訓練例から学習することのできるシステムを開発する、というものです。言い換えれば、ニューラルネットワークは、訓練例をもとに、数字認識のルールを自動的に推論します。さらに、訓練例を増やすほど、ニューラルネットワークは手書き文字に関する知識をより多く獲得し、精度が向上します。上図ではわずか100個の訓練例を示しましたが、何千、何万、何億個という訓練例を与えることで、よりよい手書き数字認識機を作ることができるかもしれません。


</p><p>


<!--  In this chapter we'll write a computer program implementing a neural
network that learns to recognize handwritten digits.  The program is
just 74 lines long, and uses no special neural network libraries.  But
this short program can recognize digits with an accuracy over 96
percent, without human intervention.  Furthermore, in later chapters
we'll develop ideas which can improve accuracy to over 99 percent.  In
fact, the best commercial neural networks are now so good that they
are used by banks to process cheques, and by post offices to recognize
addresses.-->

この章では、手書き数字認識を学習するニューラルネットワークを実装することを目標とします。実装するプログラムはたったの74行に収まり、しかも特別なニューラルネットワークライブラリを使うわけではありません。それでも、この短いプログラムは人手の介入なしに、96%以上の精度で数字を認識することができます。2章以降で導入する新しいアイデアを組み込めば、この性能はさらに99%を上回るまで向上します。実は、現在最高レベルの商用ニューラルネットワークは、銀行での小切手の処理や郵便局での住所認識に使われるほどの高い性能に達しています。


</p><p>

<!--  We're focusing on handwriting recognition because it's an excellent
prototype problem for learning about neural networks in general.  As a
prototype it hits a sweet spot: it's challenging - it's no small
feat to recognize handwritten digits - but it's not so difficult as
to require an extremely complicated solution, or tremendous
computational power.  Furthermore, it's a great way to develop more
advanced techniques, such as deep learning.  And so throughout the
book we'll return repeatedly to the problem of handwriting
recognition.  Later in the book, we'll discuss how these ideas may be
applied to other problems in computer vision, and also in speech,
natural language processing, and other domains.-->

  手書き文字認識は、ニューラルネットワーク一般について解説するうえでうってつけの題材なので、まずは手書き文字認識に話を絞ることにします。というのも、手書き文字認識というのは、決して一筋縄ではゆかない歯ごたえのある課題です。それでいて例えば極めて複雑な解法を必要とするとか、莫大な計算資源を必要とするとかの、非常に困難な課題というわけでもなく、ちょうどいい難易度の課題なのです。さらに、手書き文字認識は、深層学習といった発展的な技術の題材としても適しています。というわけで、この本では繰り返し、手書き文字認識という課題に立ち戻ることにします。この本の後のほうでは、これらのアイデアのコンピュータ視覚、音声、自然言語処理、その他の分野への応用をあつかいます。


</p><p>

<!--  Of course, if the point of the chapter was only to write a computer
program to recognize handwritten digits, then the chapter would be
much shorter!  But along the way we'll develop many key ideas about
neural networks, including two important types of artificial neuron
(the perceptron and the sigmoid neuron), and the standard learning
algorithm for neural networks, known as stochastic gradient descent.
Throughout, I focus on explaining <em>why</em> things are done the way
they are, and on building your neural networks intuition.  That
requires a lengthier discussion than if I just presented the basic
mechanics of what's going on, but it's worth it for the deeper
understanding you'll attain.  Amongst the payoffs, by the end of the
chapter we'll be in position to understand what deep learning is, and
why it matters.-->

もちろん、この章の目的がただ手書き数字を認識するプログラムを書くことだけだったなら、この章はもっと短くなったでしょう！しかし、この章では、手書き文字認識を実装する過程で、ニューラルネットワークの鍵となるアイデアをいくつも開発します。その中には、二種類の重要な人工ニューロン（パーセプトロンと、シグモイドニューロン）や、ニューラルネットワークの標準的な学習アルゴリズムである確率的勾配降下法が含まれます。本書を通じて、私は現行の手法を紹介するだけでなく*なぜ*その手法が選ばれたのかについて解説することで、あなたのニューラルネットワークにまつわる直感を鍛えてゆければと思います。おかげで、本章は流行りのトピックをただ並べた解説などよりはもずいぶん長くなってしまいますが、あなたがより深い理解に達することを思えばその価値はあると思います。とりわけ、本章を読み終わるころには、私たちは、深層学習とは何なのか、なぜ重要なのか、の理解に到達するでしょう。


</p><p>

  <!--  <h3><a name="perceptrons"></a><a href="#perceptrons">Perceptrons</a></h3>-->
   <h3><a name="perceptrons"></a><a href="#perceptrons">パーセプトロン</a></h3>


</p><p><!--What is a neural network?  To get started, I'll explain a type of
artificial neuron called a <em>perceptron</em>.
Perceptrons were
<a href="http://books.google.ca/books/about/Principles_of_neurodynamics.html?id=7FhRAAAAMAAJ">developed</a>
in the 1950s and 1960s by the scientist
<a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank
  Rosenblatt</a>, inspired by earlier
<a href="http://scholar.google.ca/scholar?cluster=4035975255085082870">work</a>
by <a href="http://en.wikipedia.org/wiki/Warren_McCulloch">Warren
  McCulloch</a> and
<a href="http://en.wikipedia.org/wiki/Walter_Pitts">Walter
  Pitts</a>.  Today, it's more common to use other
models of artificial neurons - in this book, and in much modern work
on neural networks, the main neuron model used is one called the
<em>sigmoid neuron</em>.  We'll get to sigmoid neurons shortly.  But to
understand why sigmoid neurons are defined the way they are, it's
worth taking the time to first understand perceptrons. -->

  ニューラルネットワークとは何か、という解説を始めるにあたり、まずは<em>パーセプトロン</em>と呼ばれる種類の人工ニューロンから話を始めたいと思います。
  パーセプトロンは、1950年代から1960年代にかけて、
  <a href="http://en.wikipedia.org/wiki/Warren_McCulloch">Warren McCulloch</a>と
  <a href="http://en.wikipedia.org/wiki/Walter_Pitts">Walter Pitts</a>らの
  <a href="http://scholar.google.ca/scholar?cluster=4035975255085082870">先行研究</a>に触発された
  <a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank Rosenblatt</a>によって
  <a href="http://books.google.ca/books/about/Principles_of_neurodynamics.html?id=7FhRAAAAMAAJ">開発されました</a>。
  今日では、パーセプトロン以外の人工ニューロンモデルを扱うことが一般的です。
  この本では、そして現代のニューラルネットワーク研究の多くでは、<em>シグモイドニューロン</em>と呼ばれるモデルが主に使われています。
  この本でも、もうすぐシグモイドニューロンが登場します。
  しかし、なぜシグモイドニューロンが今の姿をしているのか知るためにも、
  まずはパーセプトロンを理解することに時間をさく価値があると言えるでしょう。

</p><p>

<!--
  So how do perceptrons work?  A perceptron takes several binary inputs,
  $x_1, x_2, \ldots$, and produces a single binary output:-->
さて、パーセプトロンとは何でしょうか？パーセプトロンは複数の二進数($0$または$1$)
$x_1, x_2, \ldots$ を入力にとり、ひとつの二進数($0$または$1$)を出力します。
<center>
<img src="images/tikz0.png"/>
</center>
<!--
In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.
In general it could have more or fewer inputs.  Rosenblatt proposed a
simple rule to compute the output.  He introduced
<em>weights</em>, $w_1,w_2,\ldots$, real numbers
expressing the importance of the respective inputs to the output.  The
neuron's output, $0$ or $1$, is determined by whether the weighted sum
$\sum_j w_j x_j$ is less than or greater than some <em>threshold
  value</em>.  Just like the weights, the
threshold is a real number which is a parameter of the neuron.  To put
it in more precise algebraic terms:
-->

上図の例では、パーセプトロンは三つの入力 $x_1, x_2, x_3$ をとっています。
一般的には、入力はいくつでも構いません。
ローゼンブラット（カタカナ表記は正しい？）は、出力を計算する簡単なルールを提案しました。
彼は<em>重み</em>, $w_1,w_2,\ldots$という概念を導入しました。
重みとは、それぞれの入力が出力に及ぼす影響の大きさを表す実数です。
パーセプトロンの出力が$0$になるか$1$になるかは、入力の重みつき和
$\sum_j w_j x_j$と<em>閾値</em>の大小比較で決まります。
重みと同じく、閾値もパーセプトロンの挙動を決める実数パラメータです。
より正確に、数式で表現するなら、

<a class="displaced_anchor" name="eqtn1"></a>\begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\
      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}
      \end{array} \right.
\tag{1}\end{eqnarray}
<!--
That's all there is to how a perceptron works!</p><p>That's the basic mathematical model.  A way you can think about the
perceptron is that it's a device that makes decisions by weighing up
evidence.  Let me give an example.  It's not a very realistic example,
but it's easy to understand, and we'll soon get to more realistic
examples.  Suppose the weekend is coming up, and you've heard that
there's going to be a cheese festival in your city.  You like cheese,
and are trying to decide whether or not to go to the festival.  You
  might make your decision by weighing up three factors:
-->

パーセプトロンを動かすルールは、たったこれだけです！
</p><p>
まずは基礎的となる数学モデルをご紹介しましたが、直感的にいえば、パーセプトロンとは、
複数の情報に、重みをつけながら決定をくだす機械だと言えます。例を出しましょう。
今から出すのは簡単な例ですが、あまり現実的な例ではありません。すぐに、もっと現実的な例が出てきます。
</p><p>
  週末が近づいているとしましょう。週末には、あなたの住んでいる街で「チーズ祭り」が催されるそうです。あなたはチーズが好物で、チーズ祭りに行くかどうか決めようとしています。あなたの判断に影響を及ぼしそうなファクターは、三つあります。
<!--
<ol>
<li> Is the weather good?
<li> Does your boyfriend or girlfriend want to accompany you?
<li> Is the festival near public transit? (You don't own a car).
</ol>
-->
<ol>
<li> 天気はいいか？
<li> あなたの恋人も一緒に行きたがっているか？
<li> 祭りの会場は駅から近いか？（あなたは自家用車を持っていません。）
</ol>

<!--
We can represent these three factors by corresponding binary variables
$x_1, x_2$, and $x_3$.  For instance, we'd have $x_1 = 1$ if the
weather is good, and $x_1 = 0$ if the weather is bad.  Similarly, $x_2
= 1$ if your boyfriend or girlfriend wants to go, and $x_2 = 0$ if
not.  And similarly again for $x_3$ and public transit.
-->

これらの三つのファクターは、対応する二進数値$x_1, x_2$ $x_3$で表現することができます。
例えば、天気が良いなら$x_1 = 1$、天気が悪いなら$x_1 = 0$と決めましょう。
同じく、$x_2 = 1$ならあなたの恋人は行きたがっており、$x_2 = 0$ならそうではありません。
$x_3$と駅も同様です。


</p><p>
<!--
  Now, suppose you absolutely adore cheese, so much so that you're happy
to go to the festival even if your boyfriend or girlfriend is
uninterested and the festival is hard to get to.  But perhaps you
really loathe bad weather, and there's no way you'd go to the festival
if the weather is bad.  You can use perceptrons to model this kind of
decision-making.  One way to do this is to choose a weight $w_1 = 6$
for the weather, and $w_2 = 2$ and $w_3 = 2$ for the other conditions.
The larger value of $w_1$ indicates that the weather matters a lot to
you, much more than whether your boyfriend or girlfriend joins you, or
the nearness of public transit.  Finally, suppose you choose a
threshold of $5$ for the perceptron.  With these choices, the
perceptron implements the desired decision-making model, outputting
$1$ whenever the weather is good, and $0$ whenever the weather is bad.
It makes no difference to the output whether your boyfriend or
  girlfriend wants to go, or whether public transit is nearby.
-->
さて、あなたはチーズが大好物で、あなたの大切な人が何と言おうが、会場が駅から遠かろうが、喜んでチーズ祭りに行くつもりだとしましょう。
いっぽう、あなたは雨が何より苦手で、もし天気が悪かったら絶対に行くつもりはありません。
パーセプトロンは、このような意思決定を表現することができます。一つの方法は、
天気の条件の重みを $w_1 = 6$、他の重みを $w_2 = 2$ と $w_3 = 2$ にすることです。
$w_1$の値が大きいことは、あなたにとって天気がとても重要であること---恋人の意思や駅からの距離よりもずっとずっと重要であることを表しています。
最後に、パーセプトロンの閾値を $5$ に設定します。
以上のパラメータ設定により、パーセプトロンであなたの意思決定モデルを実装できました。このパーセプトロンは、天気が良ければ必ず$1$を出力し、天気が悪ければ必ず$0$を出力します。あなたの恋人の意思や、駅からの距離によって結論が変わることはありません。

</p><p>

  <!--
By varying the weights and the threshold, we can get different models
of decision-making.  For example, suppose we instead chose a threshold
of $3$.  Then the perceptron would decide that you should go to the
festival whenever the weather was good <em>or</em> when both the
festival was near public transit <em>and</em> your boyfriend or
girlfriend was willing to join you.  In other words, it'd be a
different model of decision-making.  Dropping the threshold means
you're more willing to go to the festival.
-->
重みと閾値とを変化させることで、様々に異なった意思決定モデルを得ることができます。
たとえば、閾値を$5$から$3$に変えましょうか。
すると、パーセプトロンが「祭りにいくべき」と判断する条件は
「天気が良い」<em>または</em>「会場が駅から近く、<em>かつ</em>あなたの恋人が一緒に行きたがっている」となります。つまり、意思決定モデルが変化したのです。
閾値を下げることは、あなたが祭りに行きたがっていることを意味します。

</p><p>
<!--
  Obviously, the perceptron isn't a complete model of human
decision-making!  But what the example illustrates is how a perceptron
can weigh up different kinds of evidence in order to make decisions.
And it should seem plausible that a complex network of perceptrons
  could make quite subtle decisions:
-->
もちろん、パーセプトロンは人間の意思決定モデルの完全なモデルというわけではありません！
とはいえ、パーセプトロンは異なる種類の情報を考慮し、重みをつけたうえで判断を下す能力があることを、先ほどの例は示しています。となれば、パーセプトロンを複雑に組み合わせたネットワークなら、かなり微妙な判断も扱えそうです：

<center>
<img src="images/tikz1.png"/>
</center>
<!--
In this network, the first column of perceptrons - what we'll call
the first <em>layer</em> of perceptrons - is making three very simple
decisions, by weighing the input evidence.  What about the perceptrons
in the second layer?  Each of those perceptrons is making a decision
by weighing up the results from the first layer of decision-making.
In this way a perceptron in the second layer can make a decision at a
more complex and more abstract level than perceptrons in the first
layer.  And even more complex decisions can be made by the perceptron
in the third layer.  In this way, a many-layer network of perceptrons
can engage in sophisticated decision making.
-->
上図のネットワークでは、まず一列目の三つのパーセプトロン - 第一層のパーセプトロンと呼ぶことにしましょう - が、入力情報に重みをつけて、とても単純な判断を行っています。それでは、第二層のパーセプトロンは何をしているのでしょう？これらのパーセプトロンは、第一層のパーセプトロンの下した判断に重みをつけることで、判断を下しています。これら第二層のパーセプトロンは、第一層のパーセプトロンよりも複雑で、抽象的な判断を下しているといえそうです。第三層のパーセプトロンは、さらに複雑な判断を行っています。このように、多層のニューラルネットワークは高度な意思決定を行うことができるのです。


</p><p>
  <!--
  Incidentally, when I defined perceptrons I said that a perceptron has
just a single output.  In the network above the perceptrons look like
they have multiple outputs.  In fact, they're still single output.
The multiple output arrows are merely a useful way of indicating that
the output from a perceptron is being used as the input to several
other perceptrons.  It's less unwieldy than drawing a single output
  line which then splits.
  -->
先ほどパーセプトロンを定義した時には、パーセプトロンは出力をひとつしか持たないと言いました。ところが上図のネットワークの中のパーセプトロンは、複数の出力を持つように描かれていますね。でも、あくまでもパーセプトロンの出力はひとつなんです。出力の矢印が複数あるのは、ただ、あるパーセプトロンの出力が複数のパーセプトロンへの入力として使われることを示しているにすぎません。こうすれば、一つの出力矢印を描いてから分岐させるよりも、若干見やすくなりますからね。


</p><p>

  <!--
  Let's simplify the way we describe perceptrons.  The condition $\sum_j
w_j x_j > \mbox{threshold}$ is cumbersome, and we can make two
notational changes to simplify it.
 The first change is to write
$\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$,
where $w$ and $x$ are vectors whose components are the weights and
inputs, respectively.  The second change is to move the threshold to
the other side of the inequality, and to replace it by what's known as
the perceptron's <em>bias</em>, $b \equiv
-\mbox{threshold}$.  Using the bias instead of the threshold, the
perceptron rule can be
  rewritten:
-->
  パーセプトロンの記法をもっと簡潔にしましょう。
  パーセプトロンが$1$を出力する条件式、$\sum_j w_j x_j > \mbox{threshold}$ は何だか煩雑です。そこで、これをもっと簡単に書ける記法を導入することにします。
  まず、$\sum_j w_j x_j$ という和は内積を使って、$w \cdot x \equiv \sum_j w_j x_j$と書くことにします。ここで、$w$と$x$はそれぞれ重みと入力を要素にもつベクトルです。
  次に、閾値を不等式の左辺に移項し、パーセプトロンの<em>バイアス</em> $b \equiv-\mbox{threshold}$と呼ばれる量に置き換えます。閾値の代わりにバイアスを使うと、パーセプトロンのルールはこのように書き換えられます：
<a class="displaced_anchor" name="eqtn2"></a>\begin{eqnarray}
  \mbox{output} = \left\{
    \begin{array}{ll}
      0 & \mbox{if } w\cdot x + b \leq 0 \\
      1 & \mbox{if } w\cdot x + b > 0
    \end{array}
  \right.
\tag{2}\end{eqnarray}
<!--
You can think of the bias as a measure of how easy it is to get the
perceptron to output a $1$.  Or to put it in more biological terms,
the bias is a measure of how easy it is to get the perceptron to
<em>fire</em>.  For a perceptron with a really big bias, it's extremely
easy for the perceptron to output a $1$.  But if the bias is very
negative, then it's difficult for the perceptron to output a $1$.
Obviously, introducing the bias is only a small change in how we
describe perceptrons, but we'll see later that it leads to further
notational simplifications.  Because of this, in the remainder of the
book we won't use the threshold, we'll always use the bias.
-->
バイアスは、パーセプトロンが$1$を出力する傾向の高さを表す量だとみなすことができます。
あるいは、生物学の例えを使えば、バイアスとは、パーセプトロンというニューロンが<em>発火</em>
する傾向の高さを表すといえます。もし、あるパーセプトロンのバイアスがとても大きければ、
パーセプトロンが$1$を出力するのはとても簡単なことでしょう。逆に、パーセプトロンのバイアスが負の数なら、$1$を出力させるのは骨が折れそうです。
見てのとおり、閾値の代わりにバイアスを使うのは、パーセプトロンの表記をほんの少し変更するにすぎません。しかし、バイアスを使ったほうがもっとシンプルになる場合がのちほど出てきます。
というわけで、この本では今後、閾値ではなくバイアスを使うことにします。

</p><p>

<!--
  I've described perceptrons as a method for weighing evidence to make
decisions.  Another way perceptrons can be used is to compute the
elementary logical functions we usually think of as underlying
computation, functions such as <CODE>AND</CODE>, <CODE>OR</CODE>, and
<CODE>NAND</CODE>.  For example, suppose we have a perceptron with two
inputs, each with weight $-2$, and an overall bias of $3$.  Here's our
perceptron:
-->
ここまでの解説では、パーセプトロンを入力情報に重みをつけて判断を行う手続きとして用いてきました。パーセプトロンには他の用途もあります。それは、論理関数を計算することです。
あらゆる計算は、<CODE>AND</CODE>、<CODE>OR</CODE>、そして<CODE>NAND</CODE>といった基本的な論理関数から構成されている、とみなすことができます。パーセプトロンは、こういった論理関数を表現できるのです。例えば、二つの入力をとり、どちらも重みが$-2$で、全体のバイアスが$3$であるようなパーセプトロンを考えてみましょう。図にすると、こうなります：

<center>
<img src="images/tikz2.png"/>
</center>
<!--
Then we see that input $00$ produces output $1$, since
$(-2)*0+(-2)*0+3 = 3$ is positive.  Here, I've introduced the $*$
symbol to make the multiplications explicit.  Similar calculations
show that the inputs $01$ and $10$ produce output $1$.  But the input
$11$ produces output $0$, since $(-2)*1+(-2)*1+3 = -1$ is negative.
And so our perceptron implements a <CODE>NAND</CODE>
gate!
-->

このパーセプトロンは、$00$ を入力されると $1$ を出力することがわかります。
なぜなら、$(-2)*0+(-2)*0+3 = 3$ は正の数だからです。
（紛らわしくないように、掛け算を記号 $*$ で表しました。）
同じように計算すると、このパーセプトロンは $01$ や $10$ を入力してもやっぱり$1$ を出力することがわかります。ところが、$11$ を入力した場合だけは $0$ が出力されます。
なぜなら $(-2)*1+(-2)*1+3 = -1$ は正の数ではないからです。
ということは、このパーセプトロンは<CODE>NAND</CODE>ゲートを実装していることになります！


</p><p>
  <a name="universality"></a>
</p><p>
<!--
  The <CODE>NAND</CODE> example shows that we can use perceptrons to
  compute simple logical functions.  In fact, we can use networks of
  perceptrons to compute <em>any</em> logical function at all.  The
  reason is that the <CODE>NAND</CODE> gate is universal for
  computation, that is, we can build any computation up out of
  <CODE>NAND</CODE> gates.  For example, we can use <CODE>NAND</CODE>
  gates to build a circuit which adds two bits, $x_1$ and $x_2$.  This
  requires computing the bitwise sum, $x_1 \oplus x_2$, as well as a
  carry bit which is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e.,
  the carry bit is just the bitwise product $x_1 x_2$:
-->
  <CODE>NAND</CODE> ゲートの例から、パーセプトロンが単純な論理関数を計算できることが分かります。それどころか、パーセプトロンのネットワークさえあれば<em>任意の</em>論理関数を計算できることまで分かるのです。なぜなら <CODE>NAND</CODE> ゲートは論理計算において万能だからです。万能だ、とは、 <CODE>NAND</CODE> ゲートさえあればどんな計算でも構成できる、という意味です。たとえば、 <CODE>NAND</CODE> ゲートを使って1ビットの二進数どうしを加算する回路を作ることができます。
  入力の二進数を $x_1$ と $x_2$ としましょう。これらの和を表現するには二進数で二桁が必要です。一桁目は入力の排他的論理和 $x_1 \oplus x_2$ になります。二桁目は $x_1$ と $x_2$ がともに $1$の場合だけ $1$ になる繰り上がりビットです。繰り上がりビットは、ただの論理積 $x_1$ <CODE>AND</CODE> $x_2$ である、ともいえます。

<center>
 <img src="images/tikz3.png"/>
</center>

<!--
To get an equivalent network of perceptrons we replace all the
<CODE>NAND</CODE> gates by perceptrons with two inputs, each with weight
$-2$, and an overall bias of $3$.  Here's the resulting network.  Note
that I've moved the perceptron corresponding to the bottom right
<CODE>NAND</CODE> gate a little, just to make it easier to draw the arrows
on the diagram:-->

この論理回路と等価なパーセプトロンを得るには、回路内の
<CODE>NAND</CODE>ゲートをすべて、重み
$-2$の入力を二つ持ち、バイアスが$3$であるパーセプトロンに置き換えます。
この置き換えを施すと、以下のようなネットワークができます。
ただし、右下にあった<CODE>NAND</CODE>ゲートに対応するパーセプトロンだけは、矢印が見やすいように少し動かしてあります：


<center>
 <img src="images/tikz4.png"/>
</center>

<!--
One notable aspect of this network of perceptrons is that the output
from the leftmost perceptron is used twice as input to the bottommost
perceptron.  When I defined the perceptron model I didn't say whether
this kind of double-output-to-the-same-place was allowed.  Actually,
it doesn't much matter.  If we don't want to allow this kind of thing,
then it's possible to simply merge the two lines, into a single
connection with a weight of -4 instead of two connections with -2
weights.  (If you don't find this obvious, you should stop and prove
to yourself that this is equivalent.)  With that change, the network
looks as follows, with all unmarked weights equal to -2, all biases
equal to 3, and a single weight of -4, as marked: -->

このニューラルネットワークの中でひとつ注目すべき点は、一番左のパーセプトロンからの出力が一番下のパーセプトロンの入力として二度使われている点です。
パーセプトロンの定義を与えたとき、このような、同じ箇所に同一の出力が二回入力される場合が許されるのか否かについては言及しませんでした。
実のところ、このような重複入力を許すかどうかは問題になりません。
仮に、重複入力を許さないことにしたとしても、二つの入力をくっつけて、
重みが-2である入力をふたつ用いるかわりに、
重みが-4の入力をひとつ使えばいいのです。
（もし、あなたがこれを自明に思えないなら、ここで立ち止まって、等価性を自分で証明してみるべきです。）
この変更によって、ニューラルネットワークは以下のようになります。ここで、重みが書いてない矢印の重みはすべて-2で、すべてのバイアスは3で、ひとつだけ重みが書いてある矢印の重みは-4です。


<center>
 <img src="images/tikz5.png"/>
</center>


<!--
Up to now I've been drawing inputs like $x_1$ and $x_2$ as variables
floating to the left of the network of perceptrons.  In fact, it's
conventional to draw an extra layer of perceptrons - the <em>input
  layer</em> - to encode the inputs:
-->
ここまで、$x_1$ や $x_2$ といった入力はパーセプトロンネットワークの左に浮いている変数として描いてきました。実は、入力を表現するには、<em>入力層</em>と呼ばれる追加の層を設けるやり方が標準的です。


<center>
 <img src="images/tikz6.png"/>
</center>
<!--
This notation for input perceptrons, in which we have an output, but
no inputs,-->

このような、入力がなく、出力がひとつしかない記法は、
<center>
<img src="images/tikz7.png"/>
</center>
<!--is a shorthand.-->
<!--
It doesn't actually mean a perceptron with no inputs.
To see this, suppose we did have a perceptron with no inputs.  Then
the weighted sum $\sum_j w_j x_j$ would always be zero, and so the
perceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.  That
is, the perceptron would simply output a fixed value, not the desired
valued ($x_1$, in the example above). It's better to think of the
input perceptrons as not really being perceptrons at all, but rather
special units which are simply defined to output the desired values,
$x_1, x_2,\ldots$.
--->
入力パーセプトロンを表す省略記法です。本気で入力をもたないパーセプトロンを意味しているわけではありません。このことを見るには、入力をもたないパーセプトロンが本当にあったとしましょう。すると、重み付き和$\sum_j w_j x_j$
は常に0ですから、そのようなパーセプトロンは
$b > 0$であれば常時1を、
$b \leq 0$であれば常時0を出力することになります。
つまり、そのようなパーセプトロンは単に定数を出力するだけで、望みの値
(上図の例では$x_1$)を出力するものではないことがわかります。
入力パーセプトロンは、実際まったくパーセプトロンではなく、望みの値$x_1, x_2,\ldots$を出力するよう定義された特殊ユニットであると考えたほうがよいのです。


</p><p>

  <!--
  The adder example demonstrates how a network of perceptrons can be
used to simulate a circuit containing many <CODE>NAND</CODE> gates.  And
because <CODE>NAND</CODE> gates are universal for computation, it follows
  that perceptrons are also universal for computation.
-->
  加算機の例は、パーセプトロンのネットワークが<CODE>NAND</CODE>ゲートを多数含む回路をシミュレートすることに使える、ということの実例でした。
  そして、<CODE>NAND</CODE>ゲートの万能性（それさえあればどんな関数でも計算できるという性質）から、パーセプトロンもまた万能である、ということが導けます。

</p><p>
<!--
  The computational universality of perceptrons is simultaneously
reassuring and disappointing.  It's reassuring because it tells us
that networks of perceptrons can be as powerful as any other computing
device.  But it's also disappointing, because it makes it seem as
though perceptrons are merely a new type of <CODE>NAND</CODE> gate.
  That's hardly big news!
-->
  パーセプトロンが計算論的万能性を持つということは、心強いと同時に残念な事実です。
  心強い、というのは、パーセプトロンが他のどの計算装置にも負けない強力さを持つことを、この事実は教えてくれるからです。残念だ、というのは、パーセプトロンは<CODE>NAND</CODE>ゲートの亜種、四角い車輪の再発明に過ぎない、と感じられるからです。これでは、とても大したニュースとはいえません！

</p><p>
<!--
  However, the situation is better than this view suggests.  It turns
out that we can devise <em>learning
  algorithms</em> which can
automatically tune the weights and biases of a network of artificial
neurons.  This tuning happens in response to external stimuli, without
direct intervention by a programmer.  These learning algorithms enable
us to use artificial neurons in a way which is radically different to
conventional logic gates.  Instead of explicitly laying out a circuit
of <CODE>NAND</CODE> and other gates, our neural networks can simply learn
to solve problems, sometimes problems where it would be extremely
  difficult to directly design a conventional circuit.
  -->

ところが、現実はそれほど残念ではないのです。なぜなら我々は、ニューラルネットワークの重みとバイアスを自動的に最適化するような、<em>学習アルゴリズム</em>を開発することができるからです。
この最適化は、プログラマの直接介入なしに、外部刺激に反応して勝手に起こるものです。
これらの学習アルゴリズムのおかげで、人工ニューロンは、従来の論理ゲートとは全く異なった使い方ができます。<CODE>NAND</CODE>ゲートや他の種類の論理ゲートはすべて手動で配線してやる必要があったのに対し、ニューラルネットワークは問題の解き方を自発的に学習してくれます。ときには、従来型の回路を設計するのが非常に難しいような問題に対してさえも。

</p><p>

  <!--<h3><a name="sigmoid_neurons"></a><a href="#sigmoid_neurons">Sigmoid neurons</a></h3>-->
  <h3><a name="sigmoid_neurons"></a><a href="#sigmoid_neurons">シグモイドニューロン</a></h3>

</p><p>

<!--
  Learning algorithms sound terrific.  But how can we devise such
algorithms for a neural network?  Suppose we have a network of
perceptrons that we'd like to use to learn to solve some problem.  For
example, the inputs to the network might be the raw pixel data from a
scanned, handwritten image of a digit.  And we'd like the network to
learn weights and biases so that the output from the network correctly
classifies the digit.  To see how learning might work, suppose we make
a small change in some weight (or bias) in the network.  What we'd
like is for this small change in weight to cause only a small
corresponding change in the output from the network.  As we'll see in
a moment, this property will make learning possible.  Schematically,
here's what we want (obviously this network is too simple to do
  handwriting recognition!):
-->
  学習アルゴリズムとは大変すばらしい。でも、ニューラルネットワークに対してそのようなアルゴリズムをどう設計すればいいのでしょう？
  かりに、私たちがある種の問題をパーセプトロンのネットワークを使って解こうとしている、としましょう。例えば、入力は手書き文字のスキャン画像の生ピクセルデータである、とか。そして、ニューラルネットワークには、数字を正しく分類できるよう、重みとバイアスを学習してほしいわけです。学習がどのように働くのか知るために、ネットワークの中のいくつかの重みやバイアスを少しだけ変更するとしましょう。私たちとしては、このような小さな変更に対応する、ニューラルネットワークからの出力の変化もまた小さなものであってほしいわけです。まもなく出てきますが、この性質こそが学習を可能にするのです。図示すれば、こんな感じです（もちろん、図のニューラルネットワークは手書き文字認識をするには小規模すぎます！）


</p><p><center>
<img src="images/tikz8.png"/>
</center></p><p>
<!--
  If it were true that a small change in a weight (or bias) causes only
a small change in output, then we could use this fact to modify the
weights and biases to get our network to behave more in the manner we
want.  For example, suppose the network was mistakenly classifying an
image as an "8" when it should be a "9".  We could figure out how
to make a small change in the weights and biases so the network gets a
little closer to classifying the image as a "9".  And then we'd
repeat this, changing the weights and biases over and over to produce
  better and better output.  The network would be learning.
-->
もし、重みやバイアスを微小に変化させた場合の出力の変化もまた微小である、という性質が本当に成り立っていれば、その性質をつかって、ニューラルネットワークがより自分の思ったとおりの挙動を示すように重みとバイアスを修正できます。たとえば、ニューラルネットワークがある「9」であるべき手書き文字を、間違って「8」に分類したとします。私たちは重みやバイアスに小さな変化を与えて、どうすればこのニューラルネットワークがこの画像を正しく「9」と分類する方向に近づくか探ることができます。この過程を繰り返し、重みとバイアスを変化させ続ければ、生成される結果は次第に改善されてゆくでしょう。ニューラルネットワークはこうして学習するのです。

</p><p>

<!--
  The problem is that this isn't what happens when our network contains
perceptrons.  In fact, a small change in the weights or bias of any
single perceptron in the network can sometimes cause the output of
that perceptron to completely flip, say from $0$ to $1$.  That flip
may then cause the behaviour of the rest of the network to completely
change in some very complicated way.  So while your "9" might now be
classified correctly, the behaviour of the network on all the other
images is likely to have completely changed in some hard-to-control
way.  That makes it difficult to see how to gradually modify the
weights and biases so that the network gets closer to the desired
behaviour.  Perhaps there's some clever way of getting around this
problem.  But it's not immediately obvious how we can get a network of
  perceptrons to learn.
-->
問題は、ニューラルネットワークがパーセプトロンで構成されていたとすると、このような学習は起こらない、ということです。実際、ニューラルネットワーク内のパーセプトロンのうち、どれか１つの重みやバイアスを少し変えてやると、そのパーセプトロンの出力は、変化がないか、もしくは$0$から$1$へというようにすっかり反転してしまいます。このように出力が反転すれば、ニューラルネットワーク内の他の部分の挙動も、連動して複雑に変わっていってしまいます。
つまり、先程の手書き文字の「9」を、なんとか正しく数字の「9」に分類させることができたとしても、今度は「9」以外の全ての手書き文字に対するニューラルネットワークの挙動までもが完全に変わってしまい、その変化をコントロールすることは困難となります。
もしかしたら、この問題を回避することのできる何らかの賢い方法があるかもしれませんが、今のところ、パーセプトロンで構成されたニューラルネットワークに上手に学習させる方法は明らかになっていません。

</p><p>

<!--
  We can overcome this problem by introducing a new type of artificial
neuron called a <em>sigmoid</em> neuron.
Sigmoid neurons are similar to perceptrons, but modified so that small
changes in their weights and bias cause only a small change in their
output.  That's the crucial fact which will allow a network of sigmoid
  neurons to learn.
-->
この問題は、<em>シグモイド</em>ニューロンと呼ばれる、新しいタイプの人工ニューロンを導入することによって克服することができます。
シグモイドニューロンはパーセプトロンと似ていますが、シグモイドニューロンの重みやバイアスに微小な変化を与えたとき、それに応じて生じる出力の変化も微小なものに留まるように調整されています。このことは、シグモイドニューロンで構成されているニューラルネットワークの学習を可能にする、決定的な違いとなります。

</p><p>

<!--
  Okay, let me describe the sigmoid neuron.  We'll depict sigmoid
neurons in the same way we depicted perceptrons:
-->
よし、それではシグモイドニューロンをご説明しましょう。シグモイドニューロンは、パーセプトロンと同じような見た目で描くことにします：

<center>
<img src="images/tikz9.png"/>
</center>

<!--
Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2,
\ldots$.  But instead of being just $0$ or $1$, these inputs can also
take on any values <em>between</em> $0$ and $1$.  So, for instance,
$0.638\ldots$ is a valid input for a sigmoid neuron. Also just like a
perceptron, the sigmoid neuron has weights for each input, $w_1, w_2,
\ldots$, and an overall bias, $b$.  But the output is not $0$ or $1$.
Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the
<em>sigmoid function</em>*<span class="marginnote">
*Incidentally, $\sigma$ is sometimes
  called the <em>logistic
    function</em>, and this
  new class of neurons called <em>logistic
    neurons</em>.  It's useful
  to remember this terminology, since these terms are used by many
  people working with neural nets.  However, we'll stick with the
  sigmoid terminology.</span>, and is defined
by:
-->
ちょうどパーセプトロンがそうであるように、シグモイドニューロンは$x_1, x_2,\ldots$といった入力を取ります。しかし、これらの入力値は、単に$0$や$1$だけではなく、$0$から$1$<em>の間</em>のあらゆる値をとることができます。そのため、たとえば、$0.638\ldots$といった値も、シグモイドニューロンにとっては有効な入力値となります。
パーセプトロンがそうであるように、シグモイドニューロンはそれぞれの入力に対して、重み($w_1, w_2,\ldots$)を持ち、またニューロン全体に対するバイアスと呼ばれる値($b$)を持っています。しかし、出力は、$0$や$1$だけではありません。
代わりに、出力としては$\sigma(w \cdot x+b)$という値をとります。$\sigma$は<em>シグモイド関数</em>*<span class="marginnote">
*たまに、 $\sigma$ を
   <em>ロジスティック関数</em>と呼び、この新しいニューロンを
   <em>ロジスティック・ニューロン</em>と呼ぶことがあります。
   こちらの用語を使うニューラルネット研究者も大勢いますので、
   この用語を覚えておくと便利です。
とはいえ、私たちは一貫してシグモイド関数という用語の方を使うことにします。</span>
  と呼ばれており、次の式で定義されます：

<a class="displaced_anchor" name="eqtn3"></a>\begin{eqnarray}
  \sigma(z) \equiv \frac{1}{1+e^{-z}}.
\tag{3}\end{eqnarray}
<!--
To put it all a little more explicitly, the output of a sigmoid neuron
with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is
-->
より明確に表現すると、シグモイドニューロンの出力は、入力が$x_1,x_2,\ldots$で、重みが$w_1,w_2,\ldots$で、そしてバイアスが$b$のとき、次の形を取ります。

<a class="displaced_anchor" name="eqtn4"></a>\begin{eqnarray}
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.
\tag{4}\end{eqnarray}</p><p>

<!--
At first sight, sigmoid neurons appear very different to perceptrons.
The algebraic form of the sigmoid function may seem opaque and
forbidding if you're not already familiar with it.  In fact, there are
many similarities between perceptrons and sigmoid neurons, and the
algebraic form of the sigmoid function turns out to be more of a
technical detail than a true barrier to understanding.
-->
一見すると、シグモイドニューロンはパーセプトロンとは大きく異なるように見えます。シグモイド関数の数式は、こういった表現方法に慣れていない人にとっては、理解困難で近づき難く感じられるかもしれません。
しかし実は、パーセプトロンとシグモイドニューロンには多くの共通点があり、シグモイド関数が代数形式で表現されていることは、真の理解の妨げになるどころか、技術的な細部を伝えてくれるものとなるでしょう。

</p>

<!--To understand the similarity to the perceptron model, suppose $z
\equiv w \cdot x + b$ is a large positive number.  Then $e^{-z}
\approx 0$ and so $\sigma(z) \approx 1$.  In other words, when $z = w
\cdot x+b$ is large and positive, the output from the sigmoid neuron
is approximately $1$, just as it would have been for a perceptron.
Suppose on the other hand that $z = w \cdot x+b$ is very negative.
Then $e^{-z} \rightarrow \infty$, and $\sigma(z) \approx 0$.  So when
$z = w \cdot x +b$ is very negative, the behaviour of a sigmoid neuron
also closely approximates a perceptron.  It's only when $w \cdot x+b$
is of modest size that there's much deviation from the perceptron
model.</p>
<p>What about the algebraic form of $\sigma$?  How can we understand
that?  In fact, the exact form of $\sigma$ isn't so important - what
really matters is the shape of the function when plotted.  Here's the
shape:</p>
<p>-->

パーセプトロンとの共通点を理解するために、$z \equiv w \cdot x + b$を大きな正の数としてみましょう。
このとき、$e^{-z} \approx 0$、つまり$\sigma(z) \approx 1$となります。
言い換えると、$z = w \cdot x+b$を大きな数であるとき、シグモイドニューロンの出力はほぼ$1$となり、パーセプトロンと同じになります。
逆に、$z = w \cdot x+b$は大きな負の数とします。そのとき$e^{-z} \rightarrow \infty$であり、$\sigma(z) \approx 0$になります。
つまり、$z = w \cdot x +b$が大きな負の数となるときも、シグモイドニューロンはパーセプトロンとほぼ同じ動きをします。
ただし、$w \cdot x+b$がそこまで大きな数でない場合はパーセプトロンと同じにはなりません。
</p>
<p>
$\sigma$についてですが、 代数的視点から私達はこれをどう理解すればいいのでしょうか？
実は、$\sigma$がなんであるかはそこまで重要ではありません。重要なのはどういう形のグラフになるかです。これがそのグラフの形です。
</p>

<div id="sigmoid_graph"><a name="sigmoid_graph"></a></div>
<script src="http://d3js.org/d3.v3.min.js"></script>
<script>
function s(x) {return 1/(1+Math.exp(-x));}
var m = [40, 120, 50, 120];
var height = 290 - m[0] - m[2];
var width = 600 - m[1] - m[3];
var xmin = -5;
var xmax = 5;
var sample = 400;
var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);
var data = d3.range(sample).map(function(d){ return {
        x: x1(d),
        y: s(x1(d))};
    });
var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);
var y = d3.scale.linear()
                .domain([0, 1])
                .range([height, 0]);
var line = d3.svg.line()
    .x(function(d) { return x(d.x); })
    .y(function(d) { return y(d.y); })
var graph = d3.select("#sigmoid_graph")
    .append("svg")
    .attr("width", width + m[1] + m[3])
    .attr("height", height + m[0] + m[2])
    .append("g")
    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");
var xAxis = d3.svg.axis()
                  .scale(x)
                  .tickValues(d3.range(-4, 5, 1))
                  .orient("bottom")
graph.append("g")
    .attr("class", "x axis")
    .attr("transform", "translate(0, " + height + ")")
    .call(xAxis);
var yAxis = d3.svg.axis()
                  .scale(y)
                  .tickValues(d3.range(0, 1.01, 0.2))
                  .orient("left")
                  .ticks(5)
graph.append("g")
    .attr("class", "y axis")
    .call(yAxis);
graph.append("path").attr("d", line(data));
graph.append("text")
     .attr("class", "x label")
     .attr("text-anchor", "end")
     .attr("x", width/2)
     .attr("y", height+35)
     .text("z");
graph.append("text")
        .attr("x", (width / 2))
        .attr("y", -10)
        .attr("text-anchor", "middle")
        .style("font-size", "16px")
        .text("シグモイド関数");
</script>
</p>
<!--
<p>This shape is a smoothed out version of a step function:</p>
-->
<p>このグラフはステップ関数のなめらか版です:</p>
<p>
<div id="step_graph"></div>
<script>
function s(x) {return x < 0 ? 0 : 1;}
var m = [40, 120, 50, 120];
var height = 290 - m[0] - m[2];
var width = 600 - m[1] - m[3];
var xmin = -5;
var xmax = 5;
var sample = 400;
var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);
var data = d3.range(sample).map(function(d){ return {
        x: x1(d),
        y: s(x1(d))};
    });
var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);
var y = d3.scale.linear()
                .domain([0,1])
                .range([height, 0]);
var line = d3.svg.line()
    .x(function(d) { return x(d.x); })
    .y(function(d) { return y(d.y); })
var graph = d3.select("#step_graph")
    .append("svg")
    .attr("width", width + m[1] + m[3])
    .attr("height", height + m[0] + m[2])
    .append("g")
    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");
var xAxis = d3.svg.axis()
                  .scale(x)
                  .tickValues(d3.range(-4, 5, 1))
                  .orient("bottom")
graph.append("g")
    .attr("class", "x axis")
    .attr("transform", "translate(0, " + height + ")")
    .call(xAxis);
var yAxis = d3.svg.axis()
                  .scale(y)
                  .tickValues(d3.range(0, 1.01, 0.2))
                  .orient("left")
                  .ticks(5)
graph.append("g")
    .attr("class", "y axis")
    .call(yAxis);
graph.append("path").attr("d", line(data));
graph.append("text")
     .attr("class", "x label")
     .attr("text-anchor", "end")
     .attr("x", width/2)
     .attr("y", height+35)
     .text("z");
graph.append("text")
        .attr("x", (width / 2))
        .attr("y", -10)
        .attr("text-anchor", "middle")
        .style("font-size", "16px")
        .text("ステップ関数");
</script>
</p>

<!--<p>If $\sigma$ had in fact been a step function, then the sigmoid neuron
would <em>be</em> a perceptron, since the output would be $1$ or $0$
depending on whether $w\cdot x+b$ was positive or
negative*<span class="marginnote">
*Actually, when $w \cdot x +b = 0$ the perceptron
  outputs $0$, while the step function outputs $1$.  So, strictly
  speaking, we'd need to modify the step function at that one point.
  But you get the idea.</span>.  By using the actual $\sigma$ function we
get, as already implied above, a smoothed out perceptron.  Indeed,
it's the smoothness of the $\sigma$ function that is the crucial fact,
not its detailed form.  The smoothness of $\sigma$ means that small
changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will
produce a small change $\Delta \mbox{output}$ in the output from the
neuron.  In fact, calculus tells us that $\Delta \mbox{output}$ is
well approximated by
<a class="displaced_anchor" name="eqtn5"></a>\begin{eqnarray}
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\tag{5}\end{eqnarray}
where the sum is over all the weights, $w_j$, and $\partial \,
\mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial
b$ denote partial derivatives of the $\mbox{output}$ with respect to
$w_j$ and $b$, respectively.  Don't panic if you're not comfortable
with partial derivatives!  While the expression above looks
complicated, with all the partial derivatives, it's actually saying
something very simple (and which is very good news): $\Delta
\mbox{output}$ is a <em>linear function</em> of the changes $\Delta w_j$
and $\Delta b$ in the weights and bias.  This linearity makes it easy
to choose small changes in the weights and biases to achieve any
desired small change in the output.  So while sigmoid neurons have
much of the same qualitative behaviour as perceptrons, they make it
much easier to figure out how changing the weights and biases will
change the output.</p>-->

<p>
もし$\sigma$が実際にステップ関数であれば、シグモイドニューロンはパーセプトロンと<em>等しくなります</em>。
これは、$w\cdot x+b$が負か正かになることで出力が$1$か$0$となるからです。
<span class="marginnote">
  実は、$w \cdot x +b = 0$のとき、ステップ関数の出力が$1$に対して、パーセプトロンの出力は$0$です。
  正確に言うと、この一点においてステップ関数を変更する必要があります。しかし、わかりますよね。
</span>
本当の$\sigma$関数を使うことによって、上にあるように、なめらかなパーセプトロンになります。
確かに、$\sigma$関数の滑らかさは重大な事実ですが、本質ではありません。
$\sigma$の滑らかさは、重みについて$\Delta w_j$、バイアスについて$\Delta b$の小さな変化は、ニューロンの出力について$\Delta \mbox{output}$の小さな変化を生み出す、ということを意味しています。
実際下記の計算から、$\Delta \mbox{output}$は大体上手くいっているとわかります。
<a class="displaced_anchor" name="eqtn5"></a>\begin{eqnarray}
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\tag{5}\end{eqnarray}
ここではsumは全ての重み$w_j$の和、$\partial \, \mbox{output} / \partial w_j$と$\partial \, \mbox{output} /\partial b$は$\mbox{output}$の偏微分を表し、それぞれに$w_j$と$b$をかけています。
偏微分について知らなくてもパニックにならないでください！この記法は複雑に見えますが、全ての偏微分は実は非常にシンプルなことを表現しています（そしてとてもいいことです）。つまり、$\Delta \mbox{output}$は重みとバイアスにおいて、$\Delta w_j$と$\Delta b$の変化に対して<em>線形</em>である、と言っているのです。
この線形性は、欲しいoutputがどんな小さな変化でも、重みとバイアスを小さく変化させることで簡単に得られることを示しています。
このことから、シグモイドニューロンはパーセプトロンとほぼ同等な動きをするにも関わらず、より容易に重みとバイアスの変化がoutputを変化させるかがわかります。
</p>

<!--<p>If it's the shape of $\sigma$ which really matters, and not its exact
form, then why use the particular form used for $\sigma$ in
Equation <span id="margin_128354118695_reveal" class="equation_link">(3)</span><span id="margin_128354118695" class="marginequation" style="display: none;"><a href="chap1.html#eqtn3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}</a></span><script>$('#margin_128354118695_reveal').click(function() {$('#margin_128354118695').toggle('slow', function() {});});</script>?  In fact, later in the book we will
occasionally consider neurons where the output is $f(w \cdot x + b)$
for some other <em>activation function</em> $f(\cdot)$.  The main thing
that changes when we use a different activation function is that the
particular values for the partial derivatives in
Equation <span id="margin_231995366761_reveal" class="equation_link">(5)</span><span id="margin_231995366761" class="marginequation" style="display: none;"><a href="chap1.html#eqtn5" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}</a></span><script>$('#margin_231995366761_reveal').click(function() {$('#margin_231995366761').toggle('slow', function() {});});</script> change.  It turns out that when we
compute those partial derivatives later, using $\sigma$ will simplify
the algebra, simply because exponentials have lovely properties when
differentiated.  In any case, $\sigma$ is commonly-used in work on
neural nets, and is the activation function we'll use most often in
this book.</p>
<p>How should we interpret the output from a sigmoid neuron?  Obviously,
one big difference between perceptrons and sigmoid neurons is that
sigmoid neurons don't just output $0$ or $1$.  They can have as output
any real number between $0$ and $1$, so values such as $0.173\ldots$
and $0.689\ldots$ are legitimate outputs.  This can be useful, for
example, if we want to use the output value to represent the average
intensity of the pixels in an image input to a neural network.  But
sometimes it can be a nuisance.  Suppose we want the output from the
network to indicate either "the input image is a 9" or "the input
image is not a 9".  Obviously, it'd be easiest to do this if the
output was a $0$ or a $1$, as in a perceptron.  But in practice we can
set up a convention to deal with this, for example, by deciding to
interpret any output of at least $0.5$ as indicating a "9", and any
output less than $0.5$ as indicating "not a 9".  I'll always
explicitly state when we're using such a convention, so it shouldn't
cause any confusion.</p>
-->

<p>もし、本当に重要であるのが$\sigma$のグラフの形であり、その式自体でないのであれば、なぜ$\sigma$で使われるような特定の等式を使うのでしょうか
<span id="margin_128354118695_reveal" class="equation_link">(3)</span><span id="margin_128354118695" class="marginequation" style="display: none;"><a href="chap1.
html#eqtn3" style="padding-bottom: 5px;" onMouseOver="this.
style.
borderBottom='1px solid #2A6EA6';" onMouseOut="this.
style.
borderBottom='0px';">\begin{eqnarray}  \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}</a></span><script>$('#margin_128354118695_reveal').
click(function() {$('#margin_128354118695').
toggle('slow', function() {});});</script>?
実際、後に時折別の<em>アクティベーション関数</em>$f(\cdot)$を使った$f(w \cdot x + b)$の出力を持つニューロンを考えます。
別のアクティベーション関数を使った時の主な違いは、等式における偏微分の特定の値です。
  <span id="margin_231995366761_reveal" class="equation_link">(5)</span><span id="margin_231995366761" class="marginequation" style="display: none;"><a href="chap1.
html#eqtn5" style="padding-bottom: 5px;" onMouseOver="this.
style.
borderBottom='1px solid #2A6EA6';" onMouseOut="this.
style.
borderBottom='0px';">\begin{eqnarray}  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}</a></span><script>$('#margin_231995366761_reveal').
click(function() {$('#margin_231995366761').
toggle('slow', function() {});});</script>
  後々これらの偏微分を計算するとき、$\sigma$を使うことで代数的考えが楽になります。これは単に、指数関数の値は指数によって決まるからです。
  とにかく、$\sigma$はニューラルネットでよく使われており、この本で最もよく使うアクティベーション関数です。
</p>
<p>
シグモイドニューロンからの出力をどう変換すべきでしょうか？明らかなことですが、パーセプトロンとシグモイドニューロンの大きな違いの一つは、シグモイドニューロンの出力がちょうど$0$または$1$ではないことです。
シグモイドニューロンは$0$から$1$の間のあらゆる実数を出力することが出来ます。例えば$0.173\ldots$ や $0.689\ldots$ は正当な出力と言えます。
このことはとても有用となりえます。例えば、出力値をニューラルネットワークに対する入力画像のピクセル平均色度合いとして表したい時です。
しかし、時々厄介なものともなりえます。
ネットワークからの出力を「入力画像が９」もしくは「入力画像が９でない」として示したいとします。
明らかに、最も簡単は方法はパーセプトロンのように出力を$0$もしくは$1$とすることです。
しかし実際のところ、この例を扱うためにルールを設定することが出来ます。例えば、0.5より大きな出力は"9"とみなし、0.5以下の出力は"9ではない"とみなす方法です。
混乱がないように、このようなルールを使うときは常に明示することにします。
</p>

<p>
<h4><a name="exercises_191892"></a><a href="#exercises_191892">Exercises</a></h4><ul>
<li><strong>シグモイドニューロンのシミュレーション パート I</strong> $\mbox{}$ <br/>
  今、あるパーセプトロンのネットワークのすべての重みとバイアスをとって、
  ある正の定数$c > 0$で定数倍するとします。
  このときネットワークの振る舞いは変わらないことを示してみてください。
<!--Suppose we take all the weights and biases in a network of
  perceptrons, and multiply them by a positive constant, $c > 0$.
  Show that the behaviour of the network doesn't change.--></p><p><li><strong>シグモイドニューロンのシミュレーション パート II</strong> $\mbox{}$
  <br/>
  先ほどの問題(=パーセプトロンのネットワーク)と同じ設定で考えます。
  さらにパーセプトロンネットワークへの入力全体はすでに選ばれているとします。
  ここで、入力の具体的な値は必要ではなく、入力値が固定されてさえいれば問題ありません。
  重みとバイアスは、ネットワーク内の任意のパーセプトロンにおいて、
  入力$x$に対し$w \cdot x + b \neq 0$を満たしているものとします。
  今、ネットワーク内の全てのパーセプトロンをシグモイドニューロンで置き換え、
  重みとバイアスを全て$c > 0$となるような正の数で定数倍するとします。
  このとき$c \rightarrow \infty$の極限において、
  このシグモイドニューロンのネットワークはパーセプトロンの場合と全く同じように振る舞うことを示して下さい。
  またパーセプトロンのうちの1つが$w \cdot x + b = 0$を満たす場合には
  この性質は成り立ちません。なぜでしょうか？
  <!--Suppose we have the same setup as the last problem - a
  network of perceptrons.  Suppose also that the overall input to the
  network of perceptrons has been chosen.  We won't need the actual
  input value, we just need the input to have been fixed.  Suppose the
  weights and biases are such that $w \cdot x + b \neq 0$ for the
  input $x$ to any particular perceptron in the network.  Now replace
  all the perceptrons in the network by sigmoid neurons, and multiply
  the weights and biases by a positive constant $c > 0$. Show that in
  the limit as $c \rightarrow \infty$ the behaviour of this network of
  sigmoid neurons is exactly the same as the network of perceptrons.
  How can this fail when $w \cdot x + b = 0$ for one of the
  perceptrons?-->
</ul></p><!-- <p><h3><a name="the_architecture_of_neural_networks"></a><a href="#the_architecture_of_neural_networks">The architecture of neural networks</a></h3></p> -->
<p><h3><a name="the_architecture_of_neural_networks"></a><a href="#the_architecture_of_neural_networks">ニューラルネットワークのアーキテクチャ</a></h3></p><p>
<!-- In the next section I'll introduce a neural network that can do a
pretty good job classifying handwritten digits.  In preparation for
that, it helps to explain some terminology that lets us name different
parts of a network.  Suppose we have the network: -->
次の章では、手書きの数字の分類においてとても上手く働くニューラルネットワークを紹介します。その準備として、いくつかの専門用語を説明するためにニューラルネットワークのそれぞれの部分に名前をつけておきましょう。
<center>
<img src="images/tikz10.png"/>
</center>
<!-- As mentioned earlier, the leftmost layer in this network is called the
input layer, and the neurons within the
layer are called <em>input neurons</em>.
The rightmost or <em>output</em> layer
contains the <em>output neurons</em>, or,
as in this case, a single output neuron.  The middle layer is called a
<em>hidden layer</em>, since the neurons in
this layer are neither inputs nor outputs.  The term "hidden"
perhaps sounds a little mysterious - the first time I heard the term
I thought it must have some deep philosophical or mathematical
significance - but it really means nothing more than "not an input
or an output".  The network above has just a single hidden layer, but
some networks have multiple hidden layers.  For example, the following
four-layer network has two hidden layers: -->

以前述べた通り、一番左の層は<em>入力層(input layer)</em>と呼ばれ、その中のニューロンを<em>入力ニューロン(input neurons)</em>と言います。一番右の層または<em>出力層(output layer)</em>は、<em>出力ニューロン(output neurons)</em>から構成されています。上の場合では出力ニューロンは1つですね。中央の層は入力でも出力でもないことから、<em>隠れ層(hidden layer)</em>と呼ばれます。この"隠れ"という用語は少し不思議に聞こえるでしょう。私が初めてこの用語を聞いた時、何か哲学的または数学的意味があるのだと思いました。しかしながらこれは"入出力以外"ということを意味しているにすぎません。上記のニューラルネットワークはただ1つの隠れ層からできていますが、複数の隠れ層をもったニューラルネットワークも存在します。例として、下の4層ネットワークは2つの隠れ層をもっています。
<center>
<img src="images/tikz11.png"/>
</center>
<!-- Somewhat confusingly, and for historical reasons, such multiple layer
networks are sometimes called <em>multilayer perceptrons</em> or
<em>MLPs</em>, despite being made up of sigmoid neurons, not
perceptrons.  I'm not going to use the MLP terminology in this book,
since I think it's confusing, but wanted to warn you of its existence. -->
紛らわしいことに、歴史的理由から、このような複数層のネットワークをときおり<em>多層パーセプトロン(multilayer perceptrons)</em>、または<em>MLPs</em>と呼びます。しかしこれらはパーセプトロンではなく、シグモイドニューロンです。これらの用語は混乱を招くためこの本では用いませんが、その存在は知っておいてください。

</p><p>
<!-- The design of the input and output layers in a network is often
straightforward.  For example, suppose we're trying to determine
whether a handwritten image depicts a "9" or not.  A natural way to
design the network is to encode the intensities of the image pixels
into the input neurons. If the image is a $64$ by $64$ greyscale
image, then we'd have $4,096 = 64 \times 64$ input neurons, with the
intensities scaled appropriately between $0$ and $1$.  The output
layer will contain just a single neuron, with output values of less
than $0.5$ indicating "input image is not a 9", and values greater
than $0.5$ indicating "input image is a 9 ". -->
ニューラルネットワークの入出力層の設計はしばしば単純です。例えば、手書きの画像が9かそうでないかを判断したいとします。設計の自然な方法は、その画像のピクセルあたりの色の度合いを入力ニューロンにエンコードすることです。もしその画像が$64 \times 64$の白黒画像であれば、入力ニューロンの数は$4,096 = 64 \times 64$になり、色の度合いは明度を$0$から$1$の適切な値で表します。出力層は1つのニューロンからなり、出力値が$0.5$以上なら"入力画像は9である"ということを示し、$0.5$以下なら"入力画像は9ではない"ということを示します。
</p><p></p><p></p><p>
<!-- While the design of the input and output layers of a neural network is
often straightforward, there can be quite an art to the design of the
hidden layers.  In particular, it's not possible to sum up the design
process for the hidden layers with a few simple rules of thumb.
Instead, neural networks researchers have developed many design
heuristics for the hidden layers, which help people get the behaviour
they want out of their nets.  For example, such heuristics can be used
to help determine how to trade off the number of hidden layers against
the time required to train the network.  We'll meet several such
design heuristics later in this book.  -->

入出力層の設計が単純なのに対し、隠れ層の設計はかなり創造的なものになり得ます。とりわけ、隠れ層の設計の過程をいくつかの単純で大雑把な方法で行うのは不可能です。そのかわり、ニューラルネットワークの研究者らは、多くの隠れ層の設計ヒューリスティクスを開発してきました。そしてそれらは人々を解放しました。例として、これらのヒューリスティクスは学習時間と隠れ層の数とのトレードオフに折り合いをつけることができます。私たちも後に、この本の中でそのいくつかの設計に触れることになるでしょう。
</p><p>
<!-- Up to now, we've been discussing neural networks where the output from
one layer is used as input to the next layer.  Such networks are
called <em>feedforward</em>
neural networks.  This means there are no loops in the network -
information is always fed forward, never fed back.  If we did have
loops, we'd end up with situations where the input to the $\sigma$
function depended on the output.  That'd be hard to make sense of, and
so we don't allow such loops. -->
これから、ある層の出力が次の層の入力になるようなニューラルネットワークについて考察してみましょう。このようなネットワークは<em>フィードフォワードニューラルネットワーク(feedforward neural networks)</em>と呼ばれます。これはネットワーク内にループがないということを意味しています。情報は常に前へ伝わり、後ろへは戻りません。もしループするならば、σ関数の入力はその出力に依存した状態になってしまうでしょう。そうなってはわけがわかりません。そのために私たちはそのようなループを許さないのです。
</p><p>
<!-- However, there are other models of artificial neural networks in which
feedback loops are possible.  These models are called
<a href="http://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent
  neural networks</a>. The idea in these models is to have neurons which
fire for some limited duration of time, before becoming quiescent.
That firing can stimulate other neurons, which may fire a little while
later, also for a limited duration.  That causes still more neurons to
fire, and so over time we get a cascade of neurons firing.  Loops
don't cause problems in such a model, since a neuron's output only
affects its input at some later time, not instantaneously. -->
しかしながら、フィードバックループを用いることが可能な、人工ニューラルネットワークモデルも存在します。それらのモデルは<em>再帰型ニューラルネットワーク(<a href="http://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent neural networks</a>)</em>と呼ばれます。これらのモデルの着想は、静止するまでの限られた時間に発火するようなニューロンをもったモデルというものです。その発火が他のニューロンを刺激し、そのニューロンもまた限られた時間の中で少し遅れて発火します。そうやってさらなる発火を引き起こし、私たちは発火の連なりを得ることができます。これらのモデルにおいてループは問題にはなりません、なぜならその出力は即時ではなく、いくぶんか遅れてその入力に影響されるからです。
</p><p></p><p>
<!-- Recurrent neural nets have been less influential than feedforward
networks, in part because the learning algorithms for recurrent nets
are (at least to date) less powerful.  But recurrent networks are
still extremely interesting.  They're much closer in spirit to how our
brains work than feedforward networks.  And it's possible that
recurrent networks can solve important problems which can only be
solved with great difficulty by feedforward networks.  However, to
limit our scope, in this book we're going to concentrate on the more
widely-used feedforward networks. -->
再帰型ニューラルネットワークはフィードフォワードニューラルネットワークに比べてあまり影響力がありませんでした。その理由の一つは、再帰型ネットワークの学習アルゴリズムが非力だったことです。それでも再帰型ネットワークは非常に興味深いと言えます。それらはフィードフォワードネットワークに比べ、私たちの脳の働き方に近いのです。そしてフィードフォワードネットワークでは困難な問題を、再帰型ネットワークでは解くことができるという可能性は十分にあります。しかしながらこの本では、より広く使われているフィードフォワードニューラルネットワークに焦点を当てたいと思います。
</p>
<p><h3><a name="a_simple_network_to_classify_handwritten_digits"></a><a href="#a_simple_network_to_classify_handwritten_digits">
<!-- A simple network to classify handwritten digits -->
手書き数字を分類する単純なネットワーク
</a></h3></p>
<p>

<!-- Having defined neural networks, let's return to handwriting
recognition.  We can split the problem of recognizing handwritten
digits into two sub-problems.  First, we'd like a way of breaking an
image containing many digits into a sequence of separate images, each
containing a single digit.  For example, we'd like to break the image -->
ニューラルネットワークの定義を終えて、いよいよ手書き数字の認識に戻ります。私たちは手書き数字認識の問題を2つの下位問題にわけることができます。1つは、複数桁の数字からなる画像を、それぞれの数字からなる分かれた画像の列にすることです。例えばこの画像を、
</p><p><center><img src="images/digits.png" width="300px"></center></p><p>
<!-- into six separate images, -->
6つの分かれた画像にします。
</p>
<p><center><img src="images/digits_separate.png" width="440px"></center> </p><p>
<!-- We humans solve this <em>segmentation
  problem</em> with ease, but it's challenging
for a computer program to correctly break up the image.  Once the
image has been segmented, the program then needs to classify each
individual digit.  So, for instance, we'd like our program to
recognize that the first digit above, -->
私たち人間はこの<em>分割問題(segmentation problem)</em>を容易に解くことができますが、コンピュータプログラムが正確に画像を分解することは容易ではありません。一度画像を分けてしまえば、あとは個々の数字を分類するだけです。つまり上記の数字で、最初にプログラムに認識させるのは、
</p>
<p><center><img src="images/mnist_first_digit.png" width="64px"></center></p>
<p>
<!-- is a 5. -->
5です。
</p>
<p>
<!-- We'll focus on writing a program to solve the second problem, that is,
classifying individual digits.  We do this because it turns out that
the segmentation problem is not so difficult to solve, once you have a
good way of classifying individual digits.  There are many approaches
to solving the segmentation problem.  One approach is to trial many
different ways of segmenting the image, using the individual digit
classifier to score each trial segmentation.  A trial segmentation
gets a high score if the individual digit classifier is confident of
its classification in all segments, and a low score if the classifier
is having a lot of trouble in one or more segments.  The idea is that
if the classifier is having trouble somewhere, then it's probably
having trouble because the segmentation has been chosen incorrectly.
This idea and other variations can be used to solve the segmentation
problem quite well.  So instead of worrying about segmentation we'll
concentrate on developing a neural network which can solve the more
interesting and difficult problem, namely, recognizing individual
handwritten digits. -->
これから2つ目の問題、つまり各数字の分類問題を解くプログラムに焦点を当てます。なぜなら、この問題を解く良い方法がわかれば、1つ目の問題、つまり分割問題はそれほど難しくなくなるからです。分割問題へのアプローチは多数あります。そのひとつは、多くの異なる方法で画像を分割して、その画像の分類の結果から、それぞれの分割法を評価する方法です。すべての分割された画像で分類がうまくいけば、その分割法は高いスコアを得ます。逆にうまくいかなければそれはスコアの低い分割法となります。これは、分類でなにか問題が起これば、おそらくそれは誤った分割法を用いているからだ、というアイディアです。このアイディアやその他の派生した方法は、分割問題をうまく解くことができます。要するに私たちは、分割法に悩む代わりに、もっと面白くて難しい問題、つまり個々の手書き数字の認識問題を解くニューラルネットワークを開発していきます。
</p>
<p>
<!-- To recognize individual digits we will use a three-layer neural
network: -->
それぞれの数字を認識するために、3層のニューラルネットワークを用います。
</p>
<p><center>
<img src="images/tikz12.png"/>
</center></p>
<p>
<!-- The input layer of the network contains neurons encoding the values of
the input pixels.  As discussed in the next section, our training data
for the network will consist of many $28$ by $28$ pixel images of
scanned handwritten digts, and so the input layer contains $784 = 28
\times 28$ neurons.  For simplicity I've omitted most of the $784$
input neurons in the diagram above.  The input pixels are greyscale,
with a value of $0.0$ representing white, a value of $1.0$
representing black, and in between values representing gradually
darkening shades of grey. -->
ネットワークの入力層はピクセルの値をエンコードするニューロンを持っています。次の章で論じますが、私たちが使う訓練用データは、手書き数字の$28 \times 28$ピクセルの画像です。つまり入力層は$28 \times 28 = 784$ニューロンからなるということです。簡単のため、上記の図ではニューロンの数を省略して書いています。入力ピクセルはグレースケールで、$0.0$は白を、$1.0$は黒を表し、その間の値はそれに応じた濃さのグレーを表します。
</p>
<p>
<!-- The second layer of the network is a hidden layer.  We denote the
number of neurons in this hidden layer by $n$, and we'll experiment
with different values for $n$.  The example shown illustrates a small
hidden layer, containing just $n = 15$ neurons. -->
二番目の層は隠れ層です。この隠れ層のニューロンの数を$n$とし、$n$の値を変えて実験します。この例では$n = 15$ニューロンだけもつ小規模な隠れ層を表しています。
</p>
<p>
<!-- The output layer of the network contains 10 neurons.  If the first
neuron fires, i.e., has an output $\approx 1$, then that will indicate
that the network thinks the digit is a $0$.  If the second neuron
fires then that will indicate that the network thinks the digit is a
$1$.  And so on.  A little more precisely, we number the output
neurons from $0$ through $9$, and figure out which neuron has the
highest activation value.  If that neuron is, say, neuron number $6$,
then our network will guess that the input digit was a $6$.  And so on
for the other output neurons. -->
出力層は10ニューロンから構成されています。もし最初のニューロンが発火(出力 $\approx 1$)したら、それは、ネットワークがその数字を$0$だと思っていることを示しています。もし二番目のニューロンなら$1$、その他も同様です。もう少し正確に言えば、私たちは$0$から$9$の出力ニューロンをもっていて、どのニューロンが最も高く活性化するかを計算します。例えばそのニューロンが$6$だとすると、ネットワークは入力の数字が$6$であると推測していることになります。他の出力ニューロンについても同様です。
</p>
<p>
<!-- You might wonder why we use $10$ output neurons.  After all, the goal
of the network is to tell us which digit ($0, 1, 2, \ldots, 9$)
corresponds to the input image.  A seemingly natural way of doing that
is to use just $4$ output neurons, treating each neuron as taking on a
binary value, depending on whether the neuron's output is closer to
$0$ or to $1$.  Four neurons are enough to encode the answer, since
$2^4 = 16$ is more than the 10 possible values for the input digit.
Why should our network use $10$ neurons instead?  Isn't that
inefficient?  The ultimate justification is empirical: we can try out
both network designs, and it turns out that, for this particular
problem, the network with $10$ output neurons learns to recognize
digits better than the network with $4$ output neurons.  But that
leaves us wondering <em>why</em> using $10$ output neurons works better.
Is there some heuristic that would tell us in advance that we should
use the $10$-output encoding instead of the $4$-output encoding? -->
あなたは、なぜ私たちが10個の出力ニューロンを用いるか疑問に思ったでしょう。結局のところ、このネットワークのゴールは、入力の画像に対してそれがどの数字($0, 1, 2, \ldots, 9$)なのか示すことなのです。これを行うのに自然だと思われる方法は、4つの出力ニューロンを用いて、それぞれのニューロンで出力が0または1に近いかどうかに応じてバイナリの値をとることです。答えをエンコードするには4つのニューロンで十分です。なぜなら$2^4 = 16$は入力の数字の$10$より多くの値をとることができるからです。では<em>なぜ</em>私たちはその代わりに10個のニューロンを用いているのでしょうか。非効率的ではないのでしょうか。その究極の正当化は経験に基づくものです。私たちはそのどちらの方法も試し、この特定の問題においては、10個の出力ニューロンをもったネットワークの方が4つのそれよりうまく学習するということがわかったのです。しかし同時に、なぜ10個の出力ニューロンを用いた方がうまくいくのかという疑問が残りました。なにか、私たちが4-出力エンコーディングの代わりに10-出力エンコーディングを用いるべきだというヒューリスティクスがあるのでしょうか。
</p>
<p>
<!-- To understand why we do this, it helps to think about what the neural
network is doing from first principles.  Consider first the case where
we use $10$ output neurons.  Let's concentrate on the first output
neuron, the one that's trying to decide whether or not the digit is a
$0$.  It does this by weighing up evidence from the hidden layer of
neurons.  What are those hidden neurons doing?  Well, just suppose for
the sake of argument that the first neuron in the hidden layer detects
whether or not an image like the following is present: -->
なぜこうするのか理解するために、ニューラルネットワークが何をしているか、その原理から考えます。最初のケース、つまり$10$個の出力ニューロンを用いた場合を考察してください。最初の出力ニューロンに焦点を当ててみると、これはその数字が$0$かどうかを決めようとしていることがわかります。これは隠れ層からの情報を考量して行います。それらの隠れ層は何をしているのでしょうか。そうですね、議論のためにとりあえず、隠れ層の最初のニューロンが、下記のような画像が存在するかどうかを検出すると思ってください。
</p>
<p><center><img src="images/mnist_top_left_feature.png" width="130px"></center></p>
<p>
<!-- It can do this by heavily weighting input pixels which overlap with
the image, and only lightly weighting the other inputs.  In a similar
way, let's suppose for the sake of argument that the second, third,
and fourth neurons in the hidden layer detect whether or not the
following images are present: -->

その検出は、その画像と重なった入力ピクセルに重く重み付けし、他の入力には軽い重み付けをすることで、行うことができます。同様の方法で、二番目、三番目、四番目の隠れニューロンも、下記のような画像が存在するかどうかを検出すると思ってください。

</p>
<p><center><img src="images/mnist_other_features.png" width="424px"></center></p>
<p>
<!-- As you've may have guessed, these four images together make up the $0$
image that we saw in the line of digits shown -->
おそらくもうあなたが気付いているように、これらの4つの画像は合わせると、私たちが<a href="#complete_zero">前に</a>見た、数字の列の$0$の画像になります。
<!-- <a href="#complete_zero">earlier</a>: -->
</p><p><center><img src="images/mnist_complete_zero.png" width="130px"></center></p>
<p>
<!-- So if all four of these hidden neurons are firing then we can conclude
that the digit is a $0$.  Of course, that's not the <em>only</em> sort
of evidence we can use to conclude that the image was a $0$ - we
could legitimately get a $0$ in many other ways (say, through
translations of the above images, or slight distortions).  But it
seems safe to say that at least in this case we'd conclude that the
input was a $0$. -->

つまり、もしその4つ全ての隠れニューロンが発火したら、私たちはその数字が$0$であると結論づけることができます。もちろんそれ<em>だけ</em>が、その画像が$0$であると結論づく証拠ではありません。私たちは、その他の多くの方法で(例えば上記の画像の変換や僅かな歪みによって)合理的に$0$を得ることができます。しかし少なくともこの場合では入力は$0$だと結論づけて差し支えないでしょう。
</p><p></p><p></p><p></p>
<p>
<!-- Supposing the neural network functions in this way, we can give a
plausible explanation for why it's better to have $10$ outputs from
the network, rather than $4$.  If we had $4$ outputs, then the first
output neuron would be trying to decide what the most significant bit
of the digit was.  And there's no easy way to relate that most
significant bit to simple shapes like those shown above.  It's hard to
imagine that there's any good historical reason the component shapes
of the digit will be closely related to (say) the most significant bit
in the output. -->

ニューラルネットワークがこの方法で機能するとすれば、なぜ$10$出力ニューロンの方が$4$よりも良いのかについての尤もな説明を得ることができます。もし$4$つの出力だとすると、最初の出力ニューロンはその数字の最上位ビットが何なのか決めようとするでしょう。しかしその最上位ビットを、上に示したような単純な形状に関連づける方法はありません。数字を構成する要素の形状が出力の最上位ビットに深く関係する、という歴史的な理由は想像しがたいでしょう。
</p><p>
<!-- Now, with all that said, this is all just a heuristic.  Nothing says
that the three-layer neural network has to operate in the way I
described, with the hidden neurons detecting simple component shapes.
Maybe a clever learning algorithm will find some assignment of weights
that lets us use only $4$ output neurons.  But as a heuristic the way
of thinking I've described works pretty well, and can save you a lot
of time in designing good neural network architectures. -->

今言及したこれらのことは、すべてただのヒューリスティクスです。三層ニューラルネットワークは必ずしもこの方法、つまり私が説明した、隠れ層が単純な構成要素を検出するような方法で行う必要はありません。おそらく、巧妙な学習アルゴリズムが、$4$出力ニューロンを使うような重みの割当てを見つけるでしょう。しかし経験則的に、私が説明してきた考え方はとてもうまくいき、良いニューラルネットワークアーキテクチャを設計する上で、あなたの時間を節約することができます。
</p><p><h4><a name="exercise_513527"></a><a href="#exercise_513527">
<!-- Exercise -->
エクササイズ
</a></h4><ul>
<li>
<!-- There is a way of determining the bitwise representation of a
digit by adding an extra layer to the three-layer network above.
The extra layer converts the output from the previous layer into a
binary representation, as illustrated in the figure below.  Find a
set of weights and biases for the new output layer.  Assume that the
first $3$ layers of neurons are such that the correct output in the
third layer (i.e., the old output layer) has activation at least
$0.99$, and incorrect outputs have activation less than $0.01$. -->

3層ネットワークの上にもう一つ層を追加することで、数字のビットワイズ表現を定める方法があります。追加した層は、下記の図のように、前の層からの出力を二進数の表現に変換します。新しい出力層のための重みとバイアスを見つけてください。ただし、最初の3層は、3層目(すなわち古い出力層)の正しい出力が少なくとも$0.99$で活性化し、誤った出力が$0.01$以下で活性化するようなものと仮定してください。
</li>
</ul></p><p><center>
<img src="images/tikz13.png"/>
</center></p><p></p><p></p><p></p><p>
<!-- <h3><a name="learning_with_gradient_descent"></a><a href="#learning_with_gradient_descent">Learning with gradient descent</a></h3> -->
<h3><a name="learning_with_gradient_descent"></a><a href="#learning_with_gradient_descent">勾配降下法を用いた学習</a></h3>
</p><p></p><p>
<!--
Now that we have a design for our neural network, how can it learn to
recognize digits?  The first thing we'll need is a data set to learn
from - a so-called training data set.  We'll use the
 <a href="http://yann.lecun.com/exdb/mnist/">MNIST
  data set</a>, which contains tens of thousands of scanned images of
handwritten digits, together with their correct classifications.
MNIST's name comes from the fact that it is a modified subset of two
data sets collected by
<a href="http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology">NIST</a>,
the United States' National Institute of Standards and
Technology. Here's a few images from MNIST:
-->
今や私たちはニューラルネットワークのデザインを手に入れましたが、それはどのように数字の認識を学習することができるのでしょうか。最初に必要になるものはそれを用いて学習するための所謂トレーニングデータセットです。私たちは数万件の手書き数字スキャン画像とその正しい分類からなる<a href="http://yann.lecun.com/exdb/mnist/">MNISTデータセット</a>を用います。MNISTという名称は、それが<a href="http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology">アメリカ国立標準技術研究所（NIST）</a>によって収集および修正（Modify）された二つのデータセットから成り立っていることに由来しています。以下にMNISTの画像をいくつか示します。
</p><p><center><img src="images/digits_separate.png" width="420px"></center> </p><p>

実はご覧になられている数字は<a href="#complete_zero">beginning of this chapter</a>で用いたものです。もちろん、私たちのネットワークのテストには訓練用ではないものを用います！</p><p>

MNISTは二つの要素からなっています。一つ目は60,000個の訓練用の画像です。これらの画像は250人の手書きの標本からスキャンされたものであり、250人のうち半数はCensus Bureauの従業員で残り半数は高校生です。これらの画像は28×28ピクセルのグレースケールとなっています。二つ目は10,000個のテスト用画像です。これらの画像も同様に28×28ピクセルのグレースケールとなっています。これらのテストデータを使ってニューラルネットワークが数字の認識についてどれくらい学習できているかを評価します。テストの精度を良くするため、テストデータは訓練用データとは<em>異なる</em>250人から採取されています(既にCensus Bureauの従業員と高校生とでグループ分けされているにも関わらずです)。これによりシステムが認識できる数字を訓練中に経験していないと確信できます。
<!--
As you can see, these digits are, in fact, the same as those shown
at the <a href="#complete_zero">beginning of this chapter</a> as a challenge
to recognize.  Of course, when testing our network we'll ask it to
recognize images which aren't in the training set!</p><p>The MNIST data comes in two parts.  The first part contains 60,000
images to be used as training data.  These images are scanned
handwriting samples from 250 people, half of whom were US Census
Bureau employees, and half of whom were high school students.  The
images are greyscale and 28 by 28 pixels in size.  The second part of
the MNIST data set is 10,000 images to be used as test data.  Again,
these are 28 by 28 greyscale images.  We'll use the test data to
evaluate how well our neural network has learned to recognize digits.
To make this a good test of performance, the test data was taken from
a <em>different</em> set of 250 people than the original training data
(albeit still a group split between Census Bureau employees and high
school students).  This helps give us confidence that our system can
recognize digits from people whose writing it didn't see during
training.
-->
</p><p>
ここで訓練入力を $x$ と定義します。これで各入力 $28 \times 28 =
784$-次元ベクトルを $x$ とみなせ好都合です。ベクトルの各成分は一つのピクセルの濃淡値を表しています。ここで出力を $y = y(x)$ と定義し、この $y$ を10次元のベクトルとします。
仮に訓練用画像の$x$が$6$を示している場合 $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ が期待されるネットワークからの出力です。
ここで$T$は転置演算子であり行ベクトルと列ベクトルを入れ替えます。
<!--
We'll use the notation $x$ to denote a training input.  It'll be
convenient to regard each training input $x$ as a $28 \times 28 =
784$-dimensional vector.  Each entry in the vector represents the grey
value for a single pixel in the image.  We'll denote the corresponding
desired output by $y = y(x)$, where $y$ is a $10$-dimensional vector.
For example, if a particular training image, $x$, depicts a $6$, then
$y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output from
the network.  Note that $T$ here is the transpose operation, turning a
row vector into an ordinary (column) vector.
!-->
</p><p>
私たちが得たいもの、それは全訓練入力 $x$ について、ネットワークの出力が $y(x)$ になるべく近くなるような重みとバイアスを見つけるアルゴリズムです。この目標をどれだけ達成できたか測るため<em>コスト関数</em>を定義します*<span class="marginnote">
*しばしば
  <em>損失</em> 関数 または <em>目的</em> 関数とも呼ばれます。  私たちはこの本では一貫してコスト関数という用語を用いますが、ニューラルネットワークの論文や議論では他方の用語も頻繁に使われるので心に留めておいて下さい。</span>:
<a class="displaced_anchor" name="eqtn6"></a>\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2.
\tag{6}\end{eqnarray}

<!--What we'd like is an algorithm which lets us find weights and biases
so that the output from the network approximates $y(x)$ for all
training inputs $x$.  To quantify how well we're achieving this goal
we define a <em>cost function</em>*<span class="marginnote">
*Sometimes referred to as a
  <em>loss</em> or <em>objective</em> function.  We use the term cost
  function throughout this book, but you should note the other
  terminology, since it's often used in research papers and other
  discussions of neural networks. </span>:
<a class="displaced_anchor" name="eqtn6"></a>\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2.
\tag{6}\end{eqnarray}

-->

ここで $w$ はネットワーク中の全ての重み、 $b$ は全バイアス、 $n$ は訓練入力の総数、 $a$ は入力が $x$ の時にネットワークから出力されるベクトル、和は全ての訓練入力 $x$ です。もちろん出力 $a$ は $w$ と $b$ そして $x$ に依存しますが表記をシンプルにするためここでは敢えて明示しません。$\| v \|$はベクトル $v$ の距離関数を示す記号です。 $C$ は<em>2次</em>コスト関数と呼びましょう。これはしばしば<em>平均二乗誤差</em>あるいは単に<em>MSE</em>(mean squared error)としても知られるものです。2次コスト関数の式を見てみると総和の中の全項目が非負であるため $C (w,b)$ は非負になることが分かります。また、 $C(w,b)$ が小さくなる時、すなわち $C(w,b) \approx 0$ の時は全訓練入力において $y(x)$ と出力がほぼ等しくなると分かります。つまり、$C(w,b) \approx 0$ となるような重みとバイアスを見つけられれば、私たちの訓練アルゴリズムは上手く機能した、と言えます。対照的に $C(w,b)$ が大きいとき-大多数の入力において $y(x)$ と出力が近似しない場合は上手く機能ているとは言えません。したがって、訓練アルゴリズムの狙いは重みとバイアスの関数 $C(w,b)$ の最小化だと言えます。言い換えれば可能な限りコストを小さくできる重みとバイアスの組を見つけたいのです。それを私たちは<em>勾配降下法</em>というアルゴリズムを使って行います。

<!--
Here, $w$ denotes the collection of all weights in the network, $b$
all the biases, $n$ is the total number of training inputs, $a$ is the
vector of outputs from the network when $x$ is input, and the sum is
over all training inputs, $x$.  Of course, the output $a$ depends on
$x$, $w$ and $b$, but to keep the notation simple I haven't explicitly
indicated this dependence.  The notation $\| v \|$ just denotes the
usual length function for a vector $v$.  We'll call $C$ the
<em>quadratic</em> cost function; it's also
sometimes known as the <em>mean squared error</em> or just <em>MSE</em>.
Inspecting the form of the quadratic cost function, we see that
$C(w,b)$ is non-negative, since every term in the sum is non-negative.
Furthermore, the cost $C(w,b)$ becomes small, i.e., $C(w,b) \approx
0$, precisely when $y(x)$ is approximately equal to the output, $a$,
for all training inputs, $x$.  So our training algorithm has done a
good job if it can find weights and biases so that $C(w,b) \approx 0$.
By contrast, it's not doing so well when $C(w,b)$ is large - that
would mean that $y(x)$ is not close to the output $a$ for a large
number of inputs.  So the aim of our training algorithm will be to
minimize the cost $C(w,b)$ as a function of the weights and biases.
In other words, we want to find a set of weights and biases which make
the cost as small as possible.  We'll do that using an algorithm known
as <em>gradient descent</em>.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p>
-->

<p>
しかし、なぜ2次コストを導出するのでしょうか？結局のところ私たちが知りたいのはどれだけの画像がネットワークによって正しく分類されたかではないでしょうか？直接分類の正解数を最大化せずに2次コストを最小化するのはなぜでしょうか？その理由は分類の正解数がネットワーク中の重みとバイアスの滑らかな関数にならないことです。重みとバイアスに小さな変更を加えても正解数が変化することがほとんどないため、コストを改善するのに重みとバイアスをどう変更したら良いか分からないのです。代わりに2次コストのような滑らかなコスト関数を用いた場合、重みとバイアスに対してどう小変更を加えればコストを改善できるのかが簡単に分かるようになります。これが2次コストの最小化を用いる理由であり、2次コストの最小化をした後ではじめて分類の精度を調べることにします。

<!--
Why introduce the quadratic cost?  After all, aren't we primarily
interested in the number of images correctly classified by the
network?  Why not try to maximize that number directly, rather than
minimizing a proxy measure like the quadratic cost?  The problem with
that is that the number of images correctly classified is not a smooth
function of the weights and biases in the network.  For the most part,
making small changes to the weights and biases won't cause any change
at all in the number of training images classified correctly.  That
makes it difficult to figure out how to change the weights and biases
to get improved performance.  If we instead use a smooth cost function
like the quadratic cost it turns out to be easy to figure out how to
make small changes in the weights and biases so as to get an
improvement in the cost.  That's why we focus first on minimizing the
quadratic cost, and only after that will we examine the classification
accuracy.
-->

</p><p></p>

<p>
たとえ、滑らかなコスト関数を用いたいとしても、あなたは等式<span id="margin_589425638506_reveal" class="equation_link">(6)</span><span id="margin_589425638506" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_589425638506_reveal').click(function() {$('#margin_589425638506').toggle('slow', function() {});});</script> を使った2次コスト関数を選択する理由についてはまだ不思議に思っているかもしれません。これはずいぶんと<em>アドホックな</em>選
択ではないでしょうか？もし、仮に違うコスト関数を選んだ場合、最小化する重みとバイアスの組は全く異なってくるのではないでしょうか？それはもっともな心配で、後で私たちはコスト関数を再訪していくつかの修正を行うことになります。しかしながら、この2次コスト関数の等式<span id="margin_436040069757_reveal" class="equation_link">(6)</span><span id="margin_436040069757" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_436040069757_reveal').click(function() {$('#margin_436040069757').toggle('slow', function() {});});</script>はニューラルネットワークの学習の基礎を理解するのにとても良く機能するので、今はこのまま続けることにします。

<!--
Even given that we want to use a smooth cost function, you may still
wonder why we choose the quadratic function used in
Equation <span id="margin_589425638506_reveal" class="equation_link">(6)</span><span id="margin_589425638506" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_589425638506_reveal').click(function() {$('#margin_589425638506').toggle('slow', function() {});});</script>.  Isn't this a rather <em>ad
  hoc</em> choice?  Perhaps if we chose a different cost function we'd get
a totally different set of minimizing weights and biases?  This is a
valid concern, and later we'll revisit the cost function, and make
some modifications.  However, the quadratic cost function of
Equation <span id="margin_436040069757_reveal" class="equation_link">(6)</span><span id="margin_436040069757" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_436040069757_reveal').click(function() {$('#margin_436040069757').toggle('slow', function() {});});</script> works perfectly well for
understanding the basics of learning in neural networks, so we'll
stick with it for now.
-->

</p><p>
要約すると、ニューラルネットワークの訓練における私たちのゴールは2次コスト関数 $C(w, b)$ を最小化する重みとバイアスを見つけることです。これは良設定問題ですが、いまの定式化のままではたくさんの注意をそらす構造を持っています。重みとバイアスの解釈、背後に潜んでいるシグモイド関数、ネットワーク構造の選択、MNIST等があります。これらが明らかにするのは、私たちは膨大な構成の大部分を無視して単に最小化の面に集中しているということです。そこで、私たちはコスト関数の詳細な式やニューラルネットワークとのつながり、その他諸々については一旦忘れましょう。その代わり、私たちにはたくさんの変数からなる関数がシンプルに与えられており、その関数を最小化したいと考えることにします。私たちは、このような最小化問題を解決できる<em>勾配降下法</em>と呼ばれるテクニックを開発するのです。その後、最小化したいニューラルネットワークの具体的な関数に戻ってきましょう。

<!--
Recapping, our goal in training a neural network is to find weights
and biases which minimize the quadratic cost function $C(w, b)$.  This
is a well-posed problem, but it's got a lot of distracting structure
as currently posed - the interpretation of $w$ and $b$ as weights
and biases, the $\sigma$ function lurking in the background, the
choice of network architecture, MNIST, and so on.  It turns out that
we can understand a tremendous amount by ignoring most of that
structure, and just concentrating on the minimization aspect.  So for
now we're going to forget all about the specific form of the cost
function, the connection to neural networks, and so on.  Instead,
we're going to imagine that we've simply been given a function of many
variables and we want to minimize that function.  We're going to
develop a technique called <em>gradient descent</em> which can be used
to solve such minimization problems.  Then we'll come back to the
specific function we want to minimize for neural networks.
-->

</p><p>

  それでは、私たちは関数 $C(v)$ を最小化しようとしているとしましょう。$C(v)$は複数の引数 $v=v_1,v_2, \ldots$
  を取って実数の値を返す関数なら何でもかまいません。ここで私は、どんな関数でも良いということを強調するために $w$ と $b$ の記号を $v$ に置き換えました。もう私たちはニューラルネットワークに特化した文脈で考えているのではありません。ここで $C(v)$ を最小化するのに $C$ が二つの変数からなる関数だと考えることが効果的です。二つの変数を $v_1$ と $v_2$ と呼ぶことにしましょう。

<!--
Okay, let's suppose we're trying to minimize some function, $C(v)$.
This could be any real-valued function of many variables, $v = v_1,
v_2, \ldots$.  Note that I've replaced the $w$ and $b$ notation by $v$
to emphasize that this could be any function - we're not
specifically thinking in the neural networks context any more.  To
minimize $C(v)$ it helps to imagine $C$ as a function of just two
variables, which we'll call $v_1$ and $v_2$:
-->

</p><p><center><img src="images/valley.png" width="542px"></center>
</p><p>
私たちが見つけ出したいもの、それは $C$ の大域最小値です。もちろん、今ここで与えられた関数であれば、私たちはグラフを眺めて最小値を見つけられます。そういう意味では、幾分<em>簡単すぎる</em>関数を示してしまいました！おそらく一般的な関数 $C$ はたくさんの変数からなる複雑な関数であるためグラフを眺めるだけでは最小値を見つけられないでしょう。

<!--
What we'd like is to find where $C$ achieves its global minimum.  Now,
of course, for the function plotted above, we can eyeball the graph
and find the minimum.  In that sense, I've perhaps shown slightly
<em>too</em> simple a function! A general function, $C$, may be a
complicated function of many variables, and it won't usually be
possible to just eyeball the graph to find the minimum.
-->

</p><p>

この問題の一つの攻略法は、微積分を使って解析的に最小値を見つけることです。導関数の計算結果から私たちは $C$ の極値を見つけられるでしょう。運よく関数 $C$ が一つの変数、あるいは少数の変数であればおそらく上手く行きます。しかし、変数が大量にある場合は悪夢に変わるでしょう。また、ニューラルネットワークはしばしば <em>膨大な</em> 変数を必要とします-もっとも巨大なニューラルネットワークのコスト関数は10億の重みとバイアスを持っており極めて複雑になります。こういった場合、微積分による最小化は機能しません！

<!--
One way of attacking the problem is to use calculus to try to find the
minimum analytically.  We could compute derivatives and then try using
them to find places where $C$ is an extremum.  With some luck that
might work when $C$ is a function of just one or a few variables.  But
it'll turn into a nightmare when we have many more variables.  And for
neural networks we'll often want <em>far</em> more variables - the
biggest neural networks have cost functions which depend on billions
of weights and biases in an extremely complicated way.  Using calculus
to minimize that just won't work!
-->

</p><p>
( $C$ を二つの変数の関数で考えれば洞察があると主張した後に"変数が二つ以上の場合はどうなってしまうでしょう？"と、二つの段落の中で立場を変えて申し訳ありません。それでも、 $C$ を二つの変数の関数で考えることが効果的だという私の言うことを信じてください。最後の2段落は概観の分析を行っているのです。しばしば数学に関する名案では、複数の直観的イメージを巧みに扱い、学習する際のイメージの適切な使い分けを伴うのです。)

<!--
(After asserting that we'll gain insight by imagining $C$ as a
function of just two variables, I've turned around twice in two
paragraphs and said, "hey, but what if it's a function of many more
than two variables?"  Sorry about that.  Please believe me when I say
that it really does help to imagine $C$ as a function of two
variables.  It just happens that sometimes that picture breaks down,
and the last two paragraphs were dealing with such breakdowns.  Good
thinking about mathematics often involves juggling multiple intuitive
pictures, learning when it's appropriate to use each picture, and when
it's not.)
-->

</p><p><a name="gradient_descent"></a>
</p><p>

さて、微積分は機能しません。幸いなことに、非常に良く機能する一つのアルゴリズムを示唆する見事な例え話があります。手始めに関数が谷であるかのように想像してみましょう。上のグラフを見てボールが谷の斜面を転がり落ちていくところを想像してください。普段の経験から、ボールは最終的に谷底まで転がっていくと分かるでしょう。この考え方を関数の最小化に使えないでしょうか？私たちは(想像上の)ボールのスタート地点をランダムに選び、その後ボールが谷底へ転がっていく動きをシミュレーションするのです。おそらく、単に $C$ の導関数(あるいは二次導関数)を微分すればこのシミュレーションが行えるでしょう。これらの微分係数は、谷の局所形状やボールがどう転がるかといった私たちが知るべき全てのことを教えてくれます。

<!--
Okay, so calculus doesn't work.  Fortunately, there is a beautiful
analogy which suggests an algorithm which works pretty well.  We start
by thinking of our function as a kind of a valley.  If you squint just
a little at the plot above, that shouldn't be too hard.  And we
imagine a ball rolling down the slope of the valley.  Our everyday
experience tells us that the ball will eventually roll to the bottom
of the valley.  Perhaps we can use this idea as a way to find a
minimum for the function?  We'd randomly choose a starting point for
an (imaginary) ball, and then simulate the motion of the ball as it
rolled down to the bottom of the valley.  We could do this simulation
simply by computing derivatives (and perhaps some second derivatives)
of $C$ - those derivatives would tell us everything we need to know
about the local "shape" of the valley, and therefore how our ball
should roll.
-->

</p><p>

  あなたは私が述べた内容に基づき、私たちが摩擦力や重力の影響等を考慮し、ニュートンの運動方程式を書き始めると思うかもしれません。実際には、ボールの転がりの例えをそう深刻に扱ったりはしません。私たちは $C$ の最小化アルゴリズムを考案しようとしているのであり、物理法則の精密なシミュレーションを開発するわけではありません。ボール目線の観点は想像力を刺激するためのものであり、思考を制限するためではありません。そういうわけで、物理学の詳細には入っていかずにシンプルな問い掛けをします：もし私たちが一日神様を任命され、物理法則を好きに決めていいことになったら、ボールにどう動くよう命令すべきでしょう？どんな運動法則を選べばボールは谷底へと転がり続けていくでしょう？

<!--
Based on what I've just written, you might suppose that we'll be
trying to write down Newton's equations of motion for the ball,
considering the effects of friction and gravity, and so on.  Actually,
we're not going to take the ball-rolling analogy quite that seriously
- we're devising an algorithm to minimize $C$, not developing an
accurate simulation of the laws of physics!  The ball's-eye view is
meant to stimulate our imagination, not constrain our thinking.  So
rather than get into all the messy details of physics, let's simply
ask ourselves: if we were declared God for a day, and could make up
our own laws of physics, dictating to the ball how it should roll,
what law or laws of motion could we pick that would make it so the
ball always rolled to the bottom of the valley?
-->

</p><p>

この問いをもう少し詳細化するため、 $v_1$ 方向に微小な量 $\Delta v_1$ 、 $v_2$ 方向に微小な量 $\Delta v2$ だけボールを動かした時に何が起こるか考えてみましょう。計算の結果、 $C$ は次のようになります:

<a class="displaced_anchor" name="eqtn7"></a>\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2.
\tag{7}\end{eqnarray}

ここで $\Delta C$が負の値;すなわち、ボールが谷を転がり降りていくような $\Delta v_1$ と $\Delta v_2$ を選ぶ方法を見つけましょう。これを明らかにするため、 $\Delta v$ を $v$の変化のベクトルとして、 $\Delta v \equiv (\Delta v_1, \Delta v_2)^T$ と定義し、ここで $T$ は転置演算子(再掲)なので、行ベクトルと列ベクトルを入れ替えます。同様に、 $C$ の<em>勾配</em>についても偏導関数のベクトル $\left(\frac{\partial
    C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T$ として定義します。ここで勾配ベクトルを $\nabla C$ と記して:

<a class="displaced_anchor" name="eqtn8"></a>\begin{eqnarray}
  \nabla C \equiv \left( \frac{\partial C}{\partial v_1},
  \frac{\partial C}{\partial v_2} \right)^T.
\tag{8}\end{eqnarray}

<!--
To make this question more precise, let's think about what happens
when we move the ball a small amount $\Delta v_1$ in the $v_1$
direction, and a small amount $\Delta v_2$ in the $v_2$ direction.
Calculus tells us that $C$ changes as follows:
<a class="displaced_anchor" name="eqtn7"></a>\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2.
\tag{7}\end{eqnarray}
We're going find a way of choosing $\Delta v_1$ and $\Delta v_2$ so as
to make $\Delta C$ negative; i.e., we'll choose them so the ball is
rolling down into the valley.  To figure out how to make such a choice
it helps to define $\Delta v$ to be the vector of changes in $v$,
$\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again the
transpose operation, turning row vectors into column vectors.  We'll
also define the <em>gradient</em> of $C$
to be the vector of partial derivatives, $\left(\frac{\partial
    C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T$.  We
denote the gradient vector by $\nabla C$, i.e.:
<a class="displaced_anchor" name="eqtn8"></a>\begin{eqnarray}
  \nabla C \equiv \left( \frac{\partial C}{\partial v_1},
  \frac{\partial C}{\partial v_2} \right)^T.
\tag{8}\end{eqnarray}
-->

すぐに私たちは $\Delta C$ を $\Delta v$ と勾配 $\nabla C$ に書き換えるのですが、これに着手する前に、勾配のハマりやすい箇所を明らかにしておきたいと思います。 $\nabla C$ の表記に出会った時、しばしば人々は $\nabla$ の記号をどう考えて良いのか分からずに戸惑います。 $\nabla$ の正確な意味は何でしょう？実際、 $\nabla C$ が数学における一つの記号ということは自明で-上記定義のベクトル-それは二つの記号を使って表記されています。この観点で言えば $\nabla$ は" $\nabla C$ は勾配ベクトル"とあなたに教えるため旗を振る記号の一つです。更に踏み込んだ観点では  $\nabla$ はそれ自体で独立した数学の構成要素(例えば微分演算子のようなもの)であると見なせもしますが、こういった観点は私たちには必要ありません。

<!--
In a moment we'll rewrite the change $\Delta C$ in terms of $\Delta v$
and the gradient, $\nabla C$.  Before getting to that, though, I want
to clarify something that sometimes gets people hung up on the
gradient.  When meeting the $\nabla C$ notation for the first time,
people sometimes wonder how they should think about the $\nabla$
symbol.  What, exactly, does $\nabla$ mean?  In fact, it's perfectly
fine to think of $\nabla C$ as a single mathematical object - the
vector defined above - which happens to be written using two
symbols.  In this point of view, $\nabla$ is just a piece of
notational flag-waving, telling you "hey, $\nabla C$ is a gradient
vector".  There are more advanced points of view where $\nabla$ can
be viewed as an independent mathematical entity in its own right (for
example, as a differential operator), but we won't need such points of
view.
-->

</p><p>
これまでの定義から式 <span id="margin_775012797590_reveal" class="equation_link">(7)</span><span id="margin_775012797590" class="marginequation" style="display: none;"><a href="chap1.html#eqtn7" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}</a></span><script>$('#margin_775012797590_reveal').click(function() {$('#margin_775012797590').toggle('slow', function() {});});</script> の $\Delta C$ を次のように変形できます。

\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v.
\tag{9}\end{eqnarray}

<!--
With these definitions, the expression <span id="margin_775012797590_reveal" class="equation_link">(7)</span><span id="margin_775012797590" class="marginequation" style="display: none;"><a href="chap1.html#eqtn7" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}</a></span><script>$('#margin_775012797590_reveal').click(function() {$('#margin_775012797590').toggle('slow', function() {});});</script> for
$\Delta C$ can be rewritten as
<a class="displaced_anchor" name="eqtn9"></a>\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v.
\tag{9}\end{eqnarray}
-->

この等式は $\nabla C$ がなぜ勾配ベクトルと呼ばれるかを教えてくれます: $\nabla C$ は $C$ を変化させる $V$ の変化に関わっており、これはちょうど私たちが勾配と呼んでいるものです。しかし、本当に面白いのはこの等式が $\Delta C$ を負にする $\Delta v$ の選び方を教えてくれるということです。とりわけ、次の仮定を与えれば

<a class="displaced_anchor" name="eqtn10"></a>\begin{eqnarray}
  \Delta v = -\eta \nabla C,
\tag{10}\end{eqnarray}

<!--
This equation helps explain why $\nabla C$ is called the gradient
vector: $\nabla C$ relates changes in $v$ to changes in $C$, just as
we'd expect something called a gradient to do.  But what's really
exciting about the equation is that it lets us see how to choose
$\Delta v$ so as to make $\Delta C$ negative.  In particular, suppose
we choose
<a class="displaced_anchor" name="eqtn10"></a>\begin{eqnarray}
  \Delta v = -\eta \nabla C,
\tag{10}\end{eqnarray}
-->

$\eta$ は小さい正のパラメータ(<em>学習率</em>として知られるもの)です。ここで等式<span id="margin_268985733901_reveal" class="equation_link">(9)</span><span id="margin_268985733901" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_268985733901_reveal').click(function() {$('#margin_268985733901').toggle('slow', function() {});});</script>
  から $\Delta C \approx
-\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$ となることが分かります。$\|
\nabla C \|^2 \geq 0$ であることから $\Delta C \leq 0$ が成り立つため<span id="margin_47692521241_reveal" class="equation_link">(10)</span><span id="margin_47692521241" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_47692521241_reveal').click(function() {$('#margin_47692521241').toggle('slow', function() {});});</script>の前提に従い $v$ を変更する限り $C$ は常に減少し、決して増加しないことが保証されます(勿論<span id="margin_188819216022_reveal" class="equation_link">(9)</span><span id="margin_188819216022" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_188819216022_reveal').click(function() {$('#margin_188819216022').toggle('slow', function() {});});</script> の等式が近似する限りです)。これはまさしく私たちが求めていた特性です！そこで等式<span id="margin_189316591389_reveal" class="equation_link">(10)</span><span id="margin_189316591389" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_189316591389_reveal').click(function() {$('#margin_189316591389').toggle('slow', function() {});});</script>を私たちの勾配降下アルゴリズムのボールの"運動の法則"と定義しましょう。つまり、私たちは等式<span id="margin_744144000697_reveal" class="equation_link">(10)</span><span id="margin_744144000697" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_744144000697_reveal').click(function() {$('#margin_744144000697').toggle('slow', function() {});});</script>を使い $\Delta v$ の値を計算し、ボールの位置を $v$ から次のように動かすのです:

<a class="displaced_anchor" name="eqtn11"></a>\begin{eqnarray}
  v \rightarrow v' = v -\eta \nabla C.
\tag{11}\end{eqnarray}

<!--
where $\eta$ is a small, positive parameter (known as the
<em>learning rate</em>).
Then Equation <span id="margin_268985733901_reveal" class="equation_link">(9)</span><span id="margin_268985733901" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_268985733901_reveal').click(function() {$('#margin_268985733901').toggle('slow', function() {});});</script> tell us us that $\Delta C \approx
-\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$.  Because $\|
\nabla C \|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e.,
$C$ will always decrease, never increase, if we change $v$ according
to the prescription in <span id="margin_47692521241_reveal" class="equation_link">(10)</span><span id="margin_47692521241" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_47692521241_reveal').click(function() {$('#margin_47692521241').toggle('slow', function() {});});</script>.  (Within, of
course, the limits of the approximation in
Equation <span id="margin_188819216022_reveal" class="equation_link">(9)</span><span id="margin_188819216022" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_188819216022_reveal').click(function() {$('#margin_188819216022').toggle('slow', function() {});});</script>).  This is exactly the property we wanted!
And so we'll take Equation <span id="margin_189316591389_reveal" class="equation_link">(10)</span><span id="margin_189316591389" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_189316591389_reveal').click(function() {$('#margin_189316591389').toggle('slow', function() {});});</script> to define the
"law of motion" for the ball in our gradient descent algorithm.
That is, we'll use Equation <span id="margin_744144000697_reveal" class="equation_link">(10)</span><span id="margin_744144000697" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_744144000697_reveal').click(function() {$('#margin_744144000697').toggle('slow', function() {});});</script> to compute a
value for $\Delta v$, then move the ball's position $v$ by that
amount:
<a class="displaced_anchor" name="eqtn11"></a>\begin{eqnarray}
  v \rightarrow v' = v -\eta \nabla C.
\tag{11}\end{eqnarray}
-->

その後、私たちは以降もこの規則を使い続けます。もし私たちがこれを続けて、何度も繰り返すと、 $C$ は減少を続け、やがては - 待望の - 大域最小値に到達します。

<!--
Then we'll use this update rule again, to make another move.  If we
keep doing this, over and over, we'll keep decreasing $C$ until - we
hope - we reach a global minimum.
-->

</p><p>
要約すると、勾配降下法は勾配 $\nabla C$ を計算し<em>逆の</em>方向へと動かすことを繰り返すことで谷の斜面へと"降下"させる方法です。これを視覚化すると以下のようになります。

<!--
Summing up, the way the gradient descent algorithm works is to
repeatedly compute the gradient $\nabla C$, and then to move in the
<em>opposite</em> direction, "falling down" the slope of the valley.
We can visualize it like this:
-->

</p><p><center><img src="images/valley_with_ball.png" width="542px"></center>
</p><p>
ここで留意すべきは勾配降下法の規則が現実世界の物理的な運動を再現していないということです。現実世界のボールは運動量を持っているので、斜面を転がり、(少しの間)そのまま登っていくかもしれません。その後、摩擦力によってはじめて谷を降り始めるでしょう。これに対して、私たちが $\Delta v$ を選ぶ規則は"今この瞬間だけ降りなさい"というものです。これはやはり最小値を見つけるのにとても良い規則です！

<!--
Notice that with this rule gradient descent doesn't reproduce real
physical motion.  In real life a ball has momentum, and that momentum
may allow it to roll across the slope, or even (momentarily) roll
uphill.  It's only after the effects of friction set in that the ball
is guaranteed to roll down into the valley.  By contrast, our rule for
choosing $\Delta v$ just says "go down, right now".  That's still a
pretty good rule for finding the minimum!
-->

</p><p>
勾配降下法を正しく動作させるには十分小さな学習率 $\eta$ を選んで等式<span id="margin_565114699479_reveal" class="equation_link">(9)</span><span id="margin_565114699479" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_565114699479_reveal').click(function() {$('#margin_565114699479').toggle('slow', function() {});});</script>をよく近似させる必要があります。さもなければ $\Delta C > 0$ となり明らかに良くありません！その一方で $\eta$ が小さすぎる場合は $\Delta v$ の変化がとても小さくなり勾配降下法の動きは非常に遅くなってしまいます。実用的な実装では、等式<span id="margin_496706523868_reveal" class="equation_link">(9)</span><span id="margin_496706523868" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_496706523868_reveal').click(function() {$('#margin_496706523868').toggle('slow', function() {});});</script>の近似を維持できるように $\eta$ を頻繁に変更してアルゴリズムが遅くなりすぎないようにします。これがどのように行われるかは後の章で理解することにしましょう。

<!--
To make gradient descent work correctly, we need to choose the
learning rate $\eta$ to be small
enough that Equation <span id="margin_565114699479_reveal" class="equation_link">(9)</span><span id="margin_565114699479" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_565114699479_reveal').click(function() {$('#margin_565114699479').toggle('slow', function() {});});</script> is a good approximation.  If
we don't, we might end up with $\Delta C > 0$, which obviously would
not be good!  At the same time, we don't want $\eta$ to be too small,
since that will make the changes $\Delta v$ tiny, and thus the
gradient descent algorithm will work very slowly.  In practical
implementations, $\eta$ is often varied so that
Equation <span id="margin_496706523868_reveal" class="equation_link">(9)</span><span id="margin_496706523868" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_496706523868_reveal').click(function() {$('#margin_496706523868').toggle('slow', function() {});});</script> remains a good approximation, but the
algorithm isn't too slow.  We'll see later how this
works.
-->

</p><p>
私は勾配降下法の $C$ がちょうど二つの変数の関数である場合を説明しました。しかし、実際には、 $C$ がもっと多くの変数の関数であっても何も問題はありません。ここで $C$ が $m$ 変数 $v_1, \ldots ,v_m$ の関数であると仮定します。この時、微小な変化 $\Delta v = (\Delta v_1,
\ldots, \Delta v_m)^T$ によって持たされる $C$ の変化 $\Delta C$ は

<!--
I've explained gradient descent when $C$ is a function of just two
variables.  But, in fact, everything works just as well even when $C$
is a function of many more variables.  Suppose in particular that $C$
is a function of $m$ variables, $v_1,\ldots,v_m$.  Then the change
$\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1,
\ldots, \Delta v_m)^T$ is
-->

<a class="displaced_anchor" name="eqtn12"></a>\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v,
\tag{12}\end{eqnarray}

ここで勾配 $\nabla C$ のベクトルは

<!--
where the gradient $\nabla C$ is the vector
-->

<a class="displaced_anchor" name="eqtn13"></a>\begin{eqnarray}
  \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots,
  \frac{\partial C}{\partial v_m}\right)^T.
\tag{13}\end{eqnarray}

二変数の時と同様に、次のように設定します。

<!--
Just as for the two variable case, we can
choose
-->

<a class="displaced_anchor" name="eqtn14"></a>\begin{eqnarray}
  \Delta v = -\eta \nabla C,
\tag{14}\end{eqnarray}

これで等式<span id="margin_563463857829_reveal" class="equation_link">(12)</span><span id="margin_563463857829" class="marginequation" style="display: none;"><a href="chap1.html#eqtn12" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_563463857829_reveal').click(function() {$('#margin_563463857829').toggle('slow', function() {});});</script>の(近似)式の $\Delta C$ が負の値となるように保証されます。 $C$ が複数の変数の関数であっても、この定義を繰り返し更新して当てはめていけば、最小値への勾配に繋がる道が得られます。

<!--
and we're guaranteed that our (approximate)
expression <span id="margin_563463857829_reveal" class="equation_link">(12)</span><span id="margin_563463857829" class="marginequation" style="display: none;"><a href="chap1.html#eqtn12" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_563463857829_reveal').click(function() {$('#margin_563463857829').toggle('slow', function() {});});</script> for $\Delta C$ will be negative.
This gives us a way of following the gradient to a minimum, even when
$C$ is a function of many variables, by repeatedly applying the update
rule
-->

<a class="displaced_anchor" name="eqtn15"></a>\begin{eqnarray}
  v \rightarrow v' = v-\eta \nabla C.
\tag{15}\end{eqnarray}

この規則は勾配降下法の<em>定義</em>とみなすことができます。この規則によって$v$ の位置を繰り返し変更して関数 $C$ を最小化する 方法が分かります。この規則はどんな時でも機能する訳ではありません - しばしば間違い、勾配降下法が大域最小値の発見を妨げる場合があります。この点は、後の章でまた戻って綿密に調べます。しかし、実際には勾配降下法はほとんどの場合とても良く機能し、ニューラルネットワークのコスト関数の非常に強力な最小化手段でありネットワークの学習を助けてくれます。

<!--
You can think of this update rule as <em>defining</em> the gradient
descent algorithm.  It gives us a way of repeatedly changing the
position $v$ in order to find a minimum of the function $C$.  The rule
doesn't always work - several things can go wrong and prevent
gradient descent from finding the global minimum of $C$, a point we'll
return to explore in later chapters.  But, in practice gradient
descent often works extremely well, and in neural networks we'll find
that it's a powerful way of minimizing the cost function, and so
helping the net learn.
-->

</p><p></p><p></p><p>

実際、勾配降下法は最小値を探索する最適戦略であるとさえ感じます。私たちは $C$ を可能な限り減少する位置へと $\Delta v$ 動かそうとしていると仮定してみましょう。これは $\Delta C \approx \nabla C
\cdot \Delta v$ に等しいです。ここで移動量 $\|
\Delta v \| = \epsilon$ に微小な固定値 $\epsilon > 0$ という制約を与えます。言い換えれば、私たちは小刻みな固定値の動きを望んでいて、 $C$ を可能な限り減少させる移動方向を見つけようとしているのです。これは $\nabla C \cdot \Delta v$ を最小化する $\Delta v$ が $\Delta v = - \eta \nabla C$ であり $\eta = \epsilon / \|\nabla C\|$ は制約量 $\|\Delta v\| = \epsilon$ により決まるということから証明できます。つまり、勾配降下法はその瞬間に $C$ を最も減少させる方向へと小刻みに動く方法と見なすことができます。

<!--
Indeed, there's even a sense in which gradient descent is the optimal
strategy for searching for a minimum.  Let's suppose that we're trying
to make a move $\Delta v$ in position so as to decrease $C$ as much as
possible.  This is equivalent to minimizing $\Delta C \approx \nabla C
\cdot \Delta v$.  We'll constrain the size of the move so that $\|
\Delta v \| = \epsilon$ for some small fixed $\epsilon > 0$.  In other
words, we want a move that is a small step of a fixed size, and we're
trying to find the movement direction which decreases $C$ as much as
possible.  It can be proved that the choice of $\Delta v$ which
minimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$,
where $\eta = \epsilon / \|\nabla C\|$ is determined by the size
constraint $\|\Delta v\| = \epsilon$.  So gradient descent can be
viewed as a way of taking small steps in the direction which does the
most to immediately decrease $C$.
-->

</p><p>
<h4><a name="exercises_647181"></a><a href="#exercises_647181">演習</a></h4><ul>
<li> 最後の段落の主張を証明してください。<em>ヒント:</em> もしあなたがまだ
    <a href="http://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">コーシーシュワルツの不等式</a>について詳しくなければ、習熟することが理解に役立つでしょう。
</p><p><li> 私は勾配降下法の $C$ が二つの変数である場合と二つ以上の変数の関数の場合について説明しました。 $C$ がただ一つの関数の場合は何が起こるでしょう？あなたは一次元の場合の勾配降下法の動きについて幾何学的な説明が出来ますか？
</ul></p><p>

<!--
<h4><a name="exercises_647181"></a><a href="#exercises_647181">Exercises</a></h4><ul>
<li> Prove the assertion of the last paragraph.  <em>Hint:</em> If
    you're not already familiar with the
    <a href="http://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz
      inequality</a>, you may find it helpful to familiarize yourself
    with it.</p><p><li> I explained gradient descent when $C$ is a function of two
  variables, and when it's a function of more than two variables.
  What happens when $C$ is a function of just one variable?  Can you
  provide a geometric interpretation of what gradient descent is doing
  in the one-dimensional case?
</ul></p><p>
-->

</p><p>
より現実のボールに近い形で物理法則を摸倣する種類を含め、これまで様々な勾配降下法が研究されてきました。そういったボールを模倣する種類はいくつか長所を持っているものの、大きな欠点も持っています：二階偏微分の計算が必要であり、その計算が非常に大変なのです。なぜ計算が大変か明らかにするため、全ての二階偏微分 $\partial^2 C/ \partial v_j \partial v_k$ を計算したいと仮定してみましょう。仮に100万の変数がある時、私たちはおよそ1兆回(つまり100万の2乗)の二階微分*<span class="marginnote">
*実際は、1兆の半分、なぜなら
  $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial
  v_k \partial v_j$ だからです。</span>！の計算が必要で計算量的に重くなります。そうは言っても、こういった問題を回避する手段はいくつか存在していますし、また勾配降下法の代替手段の調査は研究が盛んな分野になっています。しかし、私たちはこの本では勾配降下法(とその派生形)をニューラルネットワークの学習への主要なアプローチに使いましょう。

<!--
People have investigated many variations of gradient descent,
including variations that more closely mimic a real physical ball.
These ball-mimicking variations have some advantages, but also have a
major disadvantage: it turns out to be necessary to compute second
partial derivatives of $C$, and this can be quite costly.  To see why
it's costly, suppose we want to compute all the second partial
derivatives $\partial^2 C/ \partial v_j \partial v_k$.  If there are a
million such $v_j$ variables then we'd need to compute something like
a trillion (i.e., a million squared) second partial
derivatives*<span class="marginnote">
*Actually, more like half a trillion, since
  $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial
  v_k \partial v_j$.  Still, you get the point.</span>!  That's going to be
computationally costly.  With that said, there are tricks for avoiding
this kind of problem, and finding alternatives to gradient descent is
an active area of investigation.  But in this book we'll use gradient
descent (and variations) as our main approach to learning in neural
networks.
-->

</p><p>
私たちはどうすればニューラルネットワークの学習に勾配降下法を適用できるでしょう？その考え方は等式<span id="margin_213844555977_reveal" class="equation_link">(6)</span><span id="margin_213844555977" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_213844555977_reveal').click(function() {$('#margin_213844555977').toggle('slow', function() {});});</script>のコストを最小化する重みとバイアスの探索に勾配降下法を用いるというものです。これがどう行われるかを理解するため、変数 $v_j$ を重みとバイアスに置き換えて勾配降下法の更新規則を再定義しましょう。つまり、私たちの"位置"は要素として $w_k$ と $b_l$ を持っており、勾配ベクトル $\nabla C$ は要素として $\partial C / \partial w_k$ と $\partial C
/ \partial b_l$ を持っていることに一致します。これらの要素の用語で勾配降下法の更新規則を書き直すと、

<!--
How can we apply gradient descent to learn in a neural network?  The
idea is to use gradient descent to find the weights $w_k$ and biases
$b_l$ which minimize the cost in
Equation <span id="margin_213844555977_reveal" class="equation_link">(6)</span><span id="margin_213844555977" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_213844555977_reveal').click(function() {$('#margin_213844555977').toggle('slow', function() {});});</script>.  To see how this works, let's
restate the gradient descent update rule, with the weights and biases
replacing the variables $v_j$.  In other words, our "position" now
has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has
corresponding components $\partial C / \partial w_k$ and $\partial C
/ \partial b_l$.  Writing out the gradient descent update rule in
terms of components, we have
-->

<a class="displaced_anchor" name="eqtn16"></a><a class="displaced_anchor" name="eqtn17"></a>\begin{eqnarray}
  w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\
  b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}.
\tag{17}\end{eqnarray}

この更新規則を繰り返し適用することで"坂を転がり降りる"ことができ、上手くいけばコスト関数の最小値を見つけられます。言い換えれば、この規則をニューラルネットワークの学習に使うことが出来ます。

<!--
By repeatedly applying this update rule we can "roll down the hill",
and hopefully find a minimum of the cost function.  In other words,
this is a rule which can be used to learn in a neural network.
-->

</p><p>

勾配降下法の規則の適用にはいくつか課題があります。この詳細は後の章で見ることにしましょう。それより今は一つの問題にだけ言及したいと思います。問題が何であるかを理解するため、等式<span id="margin_988567640552_reveal" class="equation_link">(6)</span><span id="margin_988567640552" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_988567640552_reveal').click(function() {$('#margin_988567640552').toggle('slow', function() {});});</script>の二次コスト関数を振り返りましょう。ここでコスト関数は $C = \frac{1}{n} \sum_x C_x$ という形をしており、個々の訓練データ $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ の総和になっていることが分かると思います。実際には、私たちは勾配 $\nabla C$ を計算するため、個々の訓練入力 $x$の勾配 $\nabla C_x$ を計算し、その後その平均を取って $\nabla C = \frac{1}{n}
\sum_x \nabla C_x$ とします。不運にも、訓練入力の数が非常に大きい場合はとても時間が掛かり、その結果学習は非常に遅くなってしまいます。

<!--
There are a number of challenges in applying the gradient descent
rule.  We'll look into those in depth in later chapters.  But for now
I just want to mention one problem.  To understand what the problem
is, let's look back at the quadratic cost in
Equation <span id="margin_988567640552_reveal" class="equation_link">(6)</span><span id="margin_988567640552" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_988567640552_reveal').click(function() {$('#margin_988567640552').toggle('slow', function() {});});</script>.  Notice that this cost
function has the form $C = \frac{1}{n} \sum_x C_x$, that is, it's an
average over costs $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ for individual
training examples.  In practice, to compute the gradient $\nabla C$ we
need to compute the gradients $\nabla C_x$ separately for each
training input, $x$, and then average them, $\nabla C = \frac{1}{n}
\sum_x \nabla C_x$.  Unfortunately, when the number of training inputs
is very large this can take a long time, and learning thus occurs
slowly.
-->

</p><p>

学習の高速化に使えるアイディアの一つに<em>確率的勾配降下法</em>と呼ばれるものがあります。この考え方は訓練入力から無作為に抽出した小さな標本群 $\nabla C_x$ を計算して勾配 $\nabla C$ を推定するというものです。小さな標本群の平均を取ることで速やかに正しい勾配 $\nabla C$ を推定でき、勾配降下法が高速化され、ひいては学習を高速化できます。

<!--
An idea called <em>stochastic gradient descent</em> can be used to speed
up learning.  The idea is to estimate the gradient $\nabla C$ by
computing $\nabla C_x$ for a small sample of randomly chosen training
inputs.  By averaging over this small sample it turns out that we can
quickly get a good estimate of the true gradient $\nabla C$, and this
helps speed up gradient descent, and thus learning.
-->

</p><p>

この考え方をより正確に述べると、確率的勾配降下法は、小さい数 $m$ を無作為抽出し、訓練入力をその数だけ無作為に選ぶことで動くということです。ここでランダムに選んだ訓練入力を $X_1, X_2, \ldots,
X_m$ とラベル付けし、これらを<em>ミニバッチ</em>と呼ぶことにしましょう。標本サイズ $m$ が十分大きければ $\nabla C_{X_j}$ の平均値は全ての $\nabla
C_x$ の平均とほぼ同等になることが期待でき、すなわち、

<!--
To make these ideas more precise, stochastic gradient descent works by
randomly picking out a small number $m$ of randomly chosen training
inputs.  We'll label those random training inputs $X_1, X_2, \ldots,
X_m$, and refer to them as a <em>mini-batch</em>.  Provided the sample
size $m$ is large enough we expect that the average value of the
$\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla
C_x$, that is,
-->

<a class="displaced_anchor" name="eqtn18"></a>\begin{eqnarray}
  \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C,
\tag{18}\end{eqnarray}

ここで二つ目の総和は全ての訓練データです。端と入れ替えると、

<!--
where the second sum is over the entire set of training data.
Swapping sides we get
-->

<a class="displaced_anchor" name="eqtn19"></a>\begin{eqnarray}
  \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}},
\tag{19}\end{eqnarray}

ランダムに選んだミニバッチを計算して全体の勾配を推定できることが確認できます。

<!--
confirming that we can estimate the overall gradient by computing
gradients just for the randomly chosen mini-batch.
-->

</p><p>
これを明確にニューラルネットワークの学習と紐付けるため、私たちのニューラルネットワークにおける重みとバイアスの表記 $w_k$ と $b_l$ で考えてみましょう。この時、確率的勾配降下法は無作為に選んだ訓練入力のミニバッチによって動き、それらで訓練を行います。

<!--
To connect this explicitly to learning in neural networks, suppose
$w_k$ and $b_l$ denote the weights and biases in our neural network.
Then stochastic gradient descent works by picking out a randomly
chosen mini-batch of training inputs, and training with those,
-->

<a class="displaced_anchor" name="eqtn20"></a><a class="displaced_anchor" name="eqtn21"></a>\begin{eqnarray}
  w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\

  b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial b_l},
\tag{21}\end{eqnarray}

総和は現在のミニバッチにおける全ての訓練サンプル $X_j$ です。次に、私たちは別の無作為に選んだミニバッチで訓練を行います。同じように、訓練入力がなくなるまで続ければ、1回の訓練の<em>エポック</em>(訳注:訓練データ全体を1巡する事)が完了します。この時点で私たちは新しい訓練エポックをやり直します。

<!--
where the sums are over all the training examples $X_j$ in the current
mini-batch.  Then we pick out another randomly chosen mini-batch and
train with those.  And so on, until we've exhausted the training
inputs, which is said to complete an
<em>epoch</em> of training.  At that point
we start over with a new training epoch.
-->

</p><p>
ちなみに、コスト関数とミニバッチで重みとバイアスを更新する縮尺の取り方によって、更新規則が異なってくることは注目に値します。等式<span id="margin_620946352852_reveal" class="equation_link">(6)</span><span id="margin_620946352852" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_620946352852_reveal').click(function() {$('#margin_620946352852').toggle('slow', function() {});});</script>で私たちは全てのコスト関数を $\frac{1}{n}$ の縮尺にしました。しばしば人々は $\frac{1}{n}$ を省略し、個々の訓練例のコストの平均を取る代わりに総和を取ります。これはとりわけ訓練例の総数が事前に分かっていない場合に有効です。これは、例えば、リアルタイムに訓練データが生成されている場合に生じ得ます。そして、同じように、ミニバッチの更新規則 <span id="margin_118229027727_reveal" class="equation_link">(20)</span><span id="margin_118229027727" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_118229027727_reveal').click(function() {$('#margin_118229027727').toggle('slow', function() {});});</script>
と <span id="margin_746944869735_reveal" class="equation_link">(21)</span><span id="margin_746944869735" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_746944869735_reveal').click(function() {$('#margin_746944869735').toggle('slow', function() {});});</script> でもしばしば総和の前にある $\frac{1}{m}$ を省略します。これは学習率 $\eta$ の縮尺の大きさを変更することに相当するので概念的には大差がありません。しかし、両者の動作の詳細な比較について気にすることは価値のあることです。

<!--
Incidentally, it's worth noting that conventions vary about scaling of
the cost function and of mini-batch updates to the weights and biases.
In Equation <span id="margin_620946352852_reveal" class="equation_link">(6)</span><span id="margin_620946352852" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_620946352852_reveal').click(function() {$('#margin_620946352852').toggle('slow', function() {});});</script> we scaled the overall cost
function by a factor $\frac{1}{n}$.  People sometimes omit the
$\frac{1}{n}$, summing over the costs of individual training examples
instead of averaging.  This is particularly useful when the toal
number of training examples isn't known in advance.  This can occur if
more training data is being generated in real time, for instance.
And, in a similar way, the mini-batch update rules <span id="margin_118229027727_reveal" class="equation_link">(20)</span><span id="margin_118229027727" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_118229027727_reveal').click(function() {$('#margin_118229027727').toggle('slow', function() {});});</script>
and <span id="margin_746944869735_reveal" class="equation_link">(21)</span><span id="margin_746944869735" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_746944869735_reveal').click(function() {$('#margin_746944869735').toggle('slow', function() {});});</script> sometimes omit the $\frac{1}{m}$ term out the
front of the sums.  Conceptually this makes little difference, since
it's equivalent to rescaling the learning rate $\eta$.  But when doing
detailed comparisons of different work it's worth watching out for.
-->

</p><p>
確率的勾配降下法は世論調査のように考えることができます：国民総選挙よりも世論調査を実施する方が簡単であるように、全部一括処理で実施するより小さな標本のミニバッチの方が勾配降下法の適用は簡単です。例えば、仮に私たちがMNIST等で $n = 60,000$ の訓練セットを持っておりミニバッチの大きさが(例えば) $m = 10$とすると、勾配の推定を $6,000$ 倍速く出来ます！勿論、この推定は完璧ではありません - これには統計変動があるでしょう - しかし、完璧である必要はありません：私たちが気にするのは $C$ が減少する大まかな移動方向だけなので、勾配の正確な計算は必要ありません。実際、確率的勾配降下法はニューラルネットワークの学習によく用いられている強力な手法であり、また、私たちがこの本で開発する学習テクニックにおける大部分の基礎になります。

<!--
We can think of stochastic gradient descent as being like political
polling: it's much easier to sample a small mini-batch than it is to
apply gradient descent to the full batch, just as carrying out a poll
is easier than running a full election.  For example, if we have a
training set of size $n = 60,000$, as in MNIST, and choose a
mini-batch size of (say) $m = 10$, this means we'll get a factor of
$6,000$ speedup in estimating the gradient!  Of course, the estimate
won't be perfect - there will be statistical fluctuations - but it
doesn't need to be perfect: all we really care about is moving in a
general direction that will help decrease $C$, and that means we don't
need an exact computation of the gradient.  In practice, stochastic
gradient descent is a commonly used and powerful technique for
learning in neural networks, and it's the basis for most of the
learning techniques we'll develop in this book.
-->

</p><p></p><p></p><p></p><p></p><p></p><p><h4><a name="exercise_263792"></a><a href="#exercise_263792">演習</a></h4><ul>
<li> 極端な勾配降下法は大きさ1のミニバッチを使います。つまり、与えられた一つの訓練入力 $x$ について、重みとバイアスを規則に従い更新して $w_k \rightarrow w_k' =
  w_k - \eta \partial C_x / \partial w_k$ と $b_l \rightarrow b_l' =
  b_l - \eta \partial C_x / \partial b_l$。その次に、別の訓練入力を選び、そして再びバイアスを更新します。この手続きは<em>オンライン</em>学習または<em>逐次</em>学習として知られるものです。オンライン学習は、ニューラルネットワークの一回の学習を一つの訓練入力で行います(ちょうど人間がそうするように)。オンライン学習の長所と短所を一つずつ、確率的勾配降下法のミニバッチの大きさが $20$ の場合と比較して挙げてください。
</ul></p><p>

<!--
</p><p></p><p></p><p></p><p></p><p></p><p><h4><a name="exercise_263792"></a><a href="#exercise_263792">Exercise</a></h4><ul>
<li> An extreme version of gradient descent is to use a mini-batch
  size of just 1.  That is, given a training input, $x$, we update our
  weights and biases according to the rules $w_k \rightarrow w_k' =
  w_k - \eta \partial C_x / \partial w_k$ and $b_l \rightarrow b_l' =
  b_l - \eta \partial C_x / \partial b_l$.  Then we choose another
  training input, and update the weights and biases again.  And so on,
  repeatedly.  This procedure is known as <em>online</em>,
  <em>on-line</em>, or <em>incremental</em> learning.  In online learning,
  a neural network learns from just one training input at a time (just
  as human beings do).  Name one advantage and one disadvantage of
  online learning, compared to stochastic gradient descent with a
  mini-batch size of, say, $20$.
</ul></p><p>
-->

本節の結びに勾配降下法に慣れていない人をしばしば悩ませる点を議論させてください。ニューラルネットワークにおけるコスト $C$ は、当然ながら、複数の変数の関数であり - 全ての重みとバイアス - それゆえに非常に高次元な空間上の曲面であるともいえます。一部の人はこのとき、"私はこれらの超次元の可視化ができるようになる必要がある"と考え、困ってしまいます。そして彼らは心配しはじめます："私は四次元で考えることができないし、五次元(あるいは五百万次元)なんてもっと無理だ"。彼らには"真の"数学者が持っている何か特別な能力が欠けているのでしょうか？勿論、答えはノーです。本職の数学者でも四次元の可視化はできませんし、それ以上についてはなおさらです。彼らは別の表現方法を開発するというトリックを使っているのです。それはちょうどこれまで私たちがやってきた： $C$ を減少させる方法を明らかにする $\Delta C$ の代数表現です(視覚化ではなく)。高次元について考えるのが得意な人は頭の中にこの種の多種多様なテクニックをライブラリとして持っています；私たちの代数的トリックもその一つです。これらのテクニックには私たちの慣れている三次元を視覚化する時の簡素性は持ってませんが、ひとたびそういったテクニックのライブラリを確立すれば、あなたは高次元について考えることがとても得意になるでしょう。私はこれ以上この詳細に入っていきませんが、もしあなたに興味があるなら、本職の数学者が高次元の思考に用いるテクニックの<a href="http://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking">この議論</a> を読んで楽しめるでしょう。いくつかのテクニックに関する議論は非常に難解ですが、たくさんの素晴らしいコンテンツが直観的でとっつきやすく、そして、全ての人が習得できるものです。

<!--
Let me conclude this section by discussing a point that sometimes bugs
people new to gradient descent.  In neural networks the cost $C$ is,
of course, a function of many variables - all the weights and biases
- and so in some sense defines a surface in a very high-dimensional
space.  Some people get hung up thinking: "Hey, I have to be able to
visualize all these extra dimensions".  And they may start to worry:
"I can't think in four dimensions, let alone five (or five
million)".  Is there some special ability they're missing, some
ability that "real" supermathematicians have?  Of course, the answer
is no.  Even most professional mathematicians can't visualize four
dimensions especially well, if at all.  The trick they use, instead,
is to develop other ways of representing what's going on.  That's
exactly what we did above: we used an algebraic (rather than visual)
representation of $\Delta C$ to figure out how to move so as to
decrease $C$.  People who are good at thinking in high dimensions have
a mental library containing many different techniques along these
lines; our algebraic trick is just one example.  Those techniques may
not have the simplicity we're accustomed to when visualizing three
dimensions, but once you build up a library of such techniques, you
can get pretty good at thinking in high dimensions.  I won't go into
more detail here, but if you're interested then you may enjoy reading
<a href="http://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking">this
  discussion</a> of some of the techniques professional mathematicians
use to think in high dimensions.  While some of the techniques
discussed are quite complex, much of the best content is intuitive and
accessible, and could be mastered by anyone.
-->

</p><p></p><p>

  <h3><a name="implementing_our_network_to_classify_digits"></a><a href="#implementing_our_network_to_classify_digits">数字を分類するニューラルネットワークの実装</a></h3></p><p>
  <!--
  <h3><a name="implementing_our_network_to_classify_digits"></a><a href="#implementing_our_network_to_classify_digits">Implementing our network to classify digits</a></h3></p><p>
  -->

  <!--
  Alright, let's write a program that learns how to recognize
handwritten digits, using stochastic gradient descent and the MNIST
training data.  The first thing we need is to get the MNIST data.  If
you're a <tt>git</tt> user then you can obtain the data by cloning the
code repository for this book,</p><p><div class="highlight"><pre>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
</pre></div>
</p><p>If you don't use <tt>git</tt> then you can download the data and code
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">here</a>.</p><p>
-->

それでは、手書き数字を認識する方法を学ぶプログラムを作成していきましょう。その際、確率的勾配降下法とMNISTの訓練データを使用します。
最初に私たちはMNISTデータを手に入れる必要があります。もしあなたが<tt>git</tt>のユーザーならば、下記のリポジトリからクローンすることでデータを取得できます。
</p><p><div class="highlight"><pre>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
</pre></div>
</p><p>もし<tt>git</tt> のユーザーではない場合は、あなたは<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">ここ</a>からデータとコードをダウンロードすることができます。
</p><p>

  そういえば、以前MNISTデータを説明したとき、MNISTデータは60,000枚の訓練用画像と
10,000枚の試験用画像に分かれているって言いましたよね。じつは、データの分け方を少々変えようと思います。
試験用画像はそのままにして、60,000枚の訓練用画像のうち、50,000枚を訓練のために使い、10,000枚は
<em>検証データセット</em>としてとっておきます。検証データはこの章では使いませんが、本書の後の方で
<em>ハイパーパラメータ</em>を設定するのに重宝します。ハイパーパラメータとは、学習率など、
学習アルゴリズムで直接選択できないもののことです。検証データセットは元々のMNIST仕様には含まれていませんが、
多くの人はMNISTをこのやり方で使っていますし、検証データはニューラルネットワークではよく使われます。
今後、「MNIST訓練データ」と言った場合は、
オリジナルの60,000枚の画像セットではなく、今作った
50,000枚の画像からなるデータセットのことを指すこととします。


*<span class="marginnote">

*前述したように、MNISTデータセットは、アメリカ国立標準技術研究所(NIST)によって定義されているものです。
MNISTを構築するために、NISTデータはYann LeCun, Corinna Cortes, Christopher J. C. Burgesによって整理・より便利なフォーマットを追加されました。
詳しいことには、<a href="http://yann.lecun.com/exdb/mnist/">こちらのリンク</a>を見てください。私のレポジトリの中のこのデータセットは、Pythonによって読み込みしやすく、操作しやすいフォーマットになっています。
このデータのフォーマットは、モントリオール大学のLISA machine learning laboratory (<a href="http://www.deeplearning.net/tutorial/gettingstarted.html">link</a>)から取得しました。</span></p><p></p><p>


MNISTデータとは別に、高速に線形代数を解くことができる<a href="http://numpy.org">Numpy</a>と呼ばれるPythonライブラリーが必要です。
もしあなたがNumpyをインストールしていないならば、<a href="http://www.scipy.org/install.html">ここ</a>から手に入れてください。</p><p>


それでは、完全なプログラムリストを示す前に、ニューラルネットワークのコードのコア機能の説明を以下でしましょう。
コードの中心部は<tt>Network</tt>クラスであり、ニューラルネットワークを表現するために使います。以下が、<tt>Network</tt>を初期化するためのコードです。</p><p>


<div class="highlight"><pre><span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sizes</span><span class="p">):</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sizes</span><span class="p">)</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span> <span class="o">=</span> <span class="n">sizes</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
                        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>
</pre></div>
</p><p>
<tt>sizes</tt>は、それぞれの層におけるニューロンの数を表しています。
もし1層目に2つのニューロン、2層目に3つのニューロン、最終層に1つのニューロンを持つ<tt>Network</tt>を作りたいならば、以下のようにコードを定義します。


<div class="highlight"><pre><span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
</pre></div><a name="weight_initialization"></a>

<tt>Network</tt>の中のバイアスと重みは、Numpyの<tt>np.random.randn</tt>によって生成された平均値0・標準偏差1のガウス分布の乱数に初期化されます。
この初期化のための乱数は、確率的勾配降下法の開始点として使用します。今回は乱数によって初期値を決めることにしますが、後半の章では、重みとバイアスの初期化のより良い方法を解説します。
<tt>Network</tt>の初期化のコードでは、ニューロンの1層目は入力層であり、バイアスは後半の層から出力を計算するときにだけ使われるので、入力層のニューロンのバイアスは省略する仮定をしていることについて注意してください。</p><p>

Numpy内の行列のリストとしてバイアスと重みは保存されることについても注意してください。
なので、<tt>net.weights[1]</tt>は、2層目と3層目をつなぐ重みを保存するNumpyの行列です。（Pythonのインデックスは <tt>0</tt>から開始されるので、1層目と2層目を繋ぐ重みではありません。）
<tt>net.weights[1]</tt>という記述は冗長なので、ここでは$w$という行列として示しましょう。
それは、$w_{jk}$という行列で表現されていて、
2層目の$k$番目のニューロンと3層目の$j$番目のニューロンを繋ぐ重みです。この$j$と$k$の順序は奇妙に見えるかもしれません。
確かに$j$と$k$を交換するほうが理に適っていそうです。この順序の大きな利点は、ニューロン3層目の活性化のベクトルは以下を意味することです。

<a class="displaced_anchor" name="eqtn22"></a>\begin{eqnarray}
  a' = \sigma(w a + b).
\tag{22}\end{eqnarray}

この式では、かなり多くの振る舞いがあるので一つ一つ紐解いてみましょう。$a$は2層目の活性化のベクトルです。$a'$を得るために、私たちは$a$と重み行列$w$を掛け算し、バイアスのベクトル$b$を足し算します。
私たちは、ベクトル$w a +b$に関数$\sigma$を作用させます。（これは関数$\sigma$の<em>vectorizing</em>と呼ばれます。）
等式 <span id="margin_530827989636_reveal" class="equation_link">(22)</span><span id="margin_530827989636" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_530827989636_reveal').click(function() {$('#margin_530827989636').toggle('slow', function() {});});</script>
が、シグモイドニューロンの出力を計算するための
等式 <span id="margin_163635338916_reveal" class="equation_link">(4)</span><span id="margin_163635338916" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_163635338916_reveal').click(function() {$('#margin_163635338916').toggle('slow', function() {});});</script>
と同じ結果になることを確認するのは簡単です。

<h4><a name="exercise_695318"></a><a href="#exercise_695318">Exercise</a></h4><ul>

<li>Equation <span id="margin_177086642938_reveal" class="equation_link">(22)</span><span id="margin_177086642938" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_177086642938_reveal').click(function() {$('#margin_177086642938').toggle('slow', function() {});});</script>
をベクトルの要素を記述し、そしてシグモイドニューロンの出力を計算するための  Equation <span id="margin_215368916270_reveal" class="equation_link">(4)</span><span id="margin_215368916270" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_215368916270_reveal').click(function() {$('#margin_215368916270').toggle('slow', function() {});});</script>
と同じ結果を与えることを確認しましょう。

</ul></p><p>こうした流れで、<tt>Network</tt> から出力を計算するコードを記述するのは簡単なことがわかります。
シグモイド関数を定義することからはじめます。このとき、シグモイド関数はベクトル形式でNumpyを使って定義します。

<div class="highlight"><pre><span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">)</span>
</pre></div>

ネットワークの入力<tt>a</tt> が与えられたら、対応した出力を返す<tt>feedforward</tt>を<tt>Network</tt>クラスに追加します。このメソッドは、層ごとに等式 <span id="margin_323063151061_reveal" class="equation_link">(22)</span><span id="margin_323063151061" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_323063151061_reveal').click(function() {$('#margin_323063151061').toggle('slow', function() {});});</script>
を適用します。

<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">feedforward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the output of the network if &quot;a&quot; is input.&quot;&quot;&quot;</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">a</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span><span class="o">+</span><span class="n">b</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">a</span>
</pre></div>
</p><p>
もちろん、<tt>Network</tt>にしてほしいことは学習することです。
そのために確率的勾配降下法(<tt>SGD</tt>)を使用します。
コードはここに記します。少しばかり不可解な場所がありますが、それについては下記で解説していきます。

</p><p><div class="highlight"><pre>    <span class="k">def</span> <span class="nf">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span>
            <span class="n">test_data</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Train the neural network using mini-batch stochastic</span>
<span class="sd">        gradient descent.  The &quot;training_data&quot; is a list of tuples</span>
<span class="sd">        &quot;(x, y)&quot; representing the training inputs and the desired</span>
<span class="sd">        outputs.  The other non-optional parameters are</span>
<span class="sd">        self-explanatory.  If &quot;test_data&quot; is provided then the</span>
<span class="sd">        network will be evaluated against the test data after each</span>
<span class="sd">        epoch, and partial progress printed out.  This is useful for</span>
<span class="sd">        tracking progress, but slows things down substantially.&quot;&quot;&quot;</span>
        <span class="k">if</span> <span class="n">test_data</span><span class="p">:</span> <span class="n">n_test</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span>
        <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
            <span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span>
            <span class="n">mini_batches</span> <span class="o">=</span> <span class="p">[</span>
                <span class="n">training_data</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="o">+</span><span class="n">mini_batch_size</span><span class="p">]</span>
                <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">)]</span>
            <span class="k">for</span> <span class="n">mini_batch</span> <span class="ow">in</span> <span class="n">mini_batches</span><span class="p">:</span>
                <span class="bp">self</span><span class="o">.</span><span class="n">update_mini_batch</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">test_data</span><span class="p">:</span>
                <span class="k">print</span> <span class="s">&quot;Epoch {0}: {1} / {2}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
                    <span class="n">j</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">test_data</span><span class="p">),</span> <span class="n">n_test</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span> <span class="s">&quot;Epoch {0} complete&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">j</span><span class="p">)</span>
</pre></div>
</p><p>
<tt>training_data</tt>は、訓練入力と対応した目的出力の組<tt>(x, y)</tt>のリストです。
変数<tt>epochs</tt>と<tt>mini_batch_size</tt>は訓練のための世代数と、サンプリングするときに使用するミニバッチの大きさです。
変数<tt>eta</tt>は学習率$\eta$です。
もしオプションの引数<tt>test_data</tt>がある場合、プログラムは各訓練のエポックのあとにネットワークを評価して、現在の進行状況を出力します。
この機能は性能改善の進行状況を確認するときに役に立ちますが、計算に少し時間がかかるようになります。
</p><p>
コードは以下のように機能します。各エポックでは、訓練データをランダムにシャッフルすることによって開始し、適切なサイズのミニバッチに分割します。
このコードは、訓練データからランダムにサンプルする簡単な方法になります。
各ミニバッチに、勾配降下法を1ステップ実行します。これは、コード<tt>self.update_mini_batch(mini_batch, eta)</tt>によって行われ、
ミニバッチの訓練データだけを使用して勾配降下法を実行し、ネットワークの重みとバイアスを更新します。ここに、<tt>update_mini_batch</tt>のコードを示します。

<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Update the network&#39;s weights and biases by applying</span>
<span class="sd">        gradient descent using backpropagation to a single mini batch.</span>
<span class="sd">        The &quot;mini_batch&quot; is a list of tuples &quot;(x, y)&quot;, and &quot;eta&quot;</span>
<span class="sd">        is the learning rate.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">mini_batch</span><span class="p">:</span>
            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
            <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">nb</span><span class="o">+</span><span class="n">dnb</span> <span class="k">for</span> <span class="n">nb</span><span class="p">,</span> <span class="n">dnb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_b</span><span class="p">)]</span>
            <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">nw</span><span class="o">+</span><span class="n">dnw</span> <span class="k">for</span> <span class="n">nw</span><span class="p">,</span> <span class="n">dnw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_w</span><span class="p">,</span> <span class="n">delta_nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nw</span>
                        <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">nw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">b</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nb</span>
                       <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">nb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="n">nabla_b</span><span class="p">)]</span>
</pre></div>

作業の多くは下記のコードで行われます
<div class="highlight"><pre>            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</pre></div>

このコードは、コスト関数の勾配を計算する高速な方法である誤差逆伝播法（<em>backpropagation</em>）アルゴリズムを起動する部分です。
<tt>update_mini_batch</tt>は単純にミニバッチ内の訓練データごとに勾配を計算し、<tt>self.weights</tt>と<tt>self.biases</tt>を適切に更新します。</p><p>

<tt>self.backprop</tt>のコードは今すぐには説明しません。次の章で誤差逆伝播法について勉強し、その際に<tt>self.backprop</tt>のコードを紹介します。
なので今は、訓練データの<tt>x</tt>に関連するコストに対して適切な勾配を返す働きをするということを前提にします。</p><p>

それでは、下記の完全なプログラムを見てください。この際、説明を省略した部分や説明文を含んでいます。
<tt>self.backprop</tt>を除いて、プログラムは明快であり、すでにお話しした通り、全ての処理の重い部分は<tt>self.SGD</tt>と<tt>self.update_mini_batch</tt>で行われています。
<tt>self.backprop</tt>は勾配を計算することを手助けするためのいくつかの追加機能を使用しており、詳細はここでは説明しませんが、
$\sigma$関数の導関数を計算する<tt>sigmoid_prime</tt>、ベクトル形式の<tt>sigmoid_prime_vec</tt>と<tt>self.cost_derivative</tt>です。
次の章で詳細を確認しますが、コードと説明を見る事によって要点を理解する事ができます。
プログラムが長いように見えますが、多くはプログラム内の理解を促すための解説文であり、コード自体は理解が簡単に書いているつもりです。
実際、プログラムはたった空行と解説を除いて74行です。
コードのすべては<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/code/network.py">GitHub</a>で見つけることができます。</p><p></p><p><div class="highlight"><pre><span class="sd">&quot;&quot;&quot;</span>


<span class="sd">network.py</span>
<span class="sd">~~~~~~~~~~</span>

<span class="sd">A module to implement the stochastic gradient descent learning</span>
<span class="sd">algorithm for a feedforward neural network.  Gradients are calculated</span>
<span class="sd">using backpropagation.  Note that I have focused on making the code</span>
<span class="sd">simple, easily readable, and easily modifiable.  It is not optimized,</span>
<span class="sd">and omits many desirable features.</span>
<span class="sd">&quot;&quot;&quot;</span>

<span class="c">#### Libraries</span>
<span class="c"># Standard library</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="c"># Third-party libraries</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>

<span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sizes</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;The list ``sizes`` contains the number of neurons in the</span>
<span class="sd">        respective layers of the network.  For example, if the list</span>
<span class="sd">        was [2, 3, 1] then it would be a three-layer network, with the</span>
<span class="sd">        first layer containing 2 neurons, the second layer 3 neurons,</span>
<span class="sd">        and the third layer 1 neuron.  The biases and weights for the</span>
<span class="sd">        network are initialized randomly, using a Gaussian</span>
<span class="sd">        distribution with mean 0, and variance 1.  Note that the first</span>
<span class="sd">        layer is assumed to be an input layer, and by convention we</span>
<span class="sd">        won&#39;t set any biases for those neurons, since biases are only</span>
<span class="sd">        ever used in computing the outputs from later layers.&quot;&quot;&quot;</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sizes</span><span class="p">)</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span> <span class="o">=</span> <span class="n">sizes</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
                        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>

    <span class="k">def</span> <span class="nf">feedforward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the output of the network if ``a`` is input.&quot;&quot;&quot;</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">a</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span><span class="o">+</span><span class="n">b</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">a</span>

    <span class="k">def</span> <span class="nf">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span>
            <span class="n">test_data</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Train the neural network using mini-batch stochastic</span>
<span class="sd">        gradient descent.  The ``training_data`` is a list of tuples</span>
<span class="sd">        ``(x, y)`` representing the training inputs and the desired</span>
<span class="sd">        outputs.  The other non-optional parameters are</span>
<span class="sd">        self-explanatory.  If ``test_data`` is provided then the</span>
<span class="sd">        network will be evaluated against the test data after each</span>
<span class="sd">        epoch, and partial progress printed out.  This is useful for</span>
<span class="sd">        tracking progress, but slows things down substantially.&quot;&quot;&quot;</span>
        <span class="k">if</span> <span class="n">test_data</span><span class="p">:</span> <span class="n">n_test</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span>
        <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
            <span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span>
            <span class="n">mini_batches</span> <span class="o">=</span> <span class="p">[</span>
                <span class="n">training_data</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="o">+</span><span class="n">mini_batch_size</span><span class="p">]</span>
                <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">)]</span>
            <span class="k">for</span> <span class="n">mini_batch</span> <span class="ow">in</span> <span class="n">mini_batches</span><span class="p">:</span>
                <span class="bp">self</span><span class="o">.</span><span class="n">update_mini_batch</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">test_data</span><span class="p">:</span>
                <span class="k">print</span> <span class="s">&quot;Epoch {0}: {1} / {2}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
                    <span class="n">j</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">test_data</span><span class="p">),</span> <span class="n">n_test</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span> <span class="s">&quot;Epoch {0} complete&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">j</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Update the network&#39;s weights and biases by applying</span>
<span class="sd">        gradient descent using backpropagation to a single mini batch.</span>
<span class="sd">        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``</span>
<span class="sd">        is the learning rate.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">mini_batch</span><span class="p">:</span>
            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
            <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">nb</span><span class="o">+</span><span class="n">dnb</span> <span class="k">for</span> <span class="n">nb</span><span class="p">,</span> <span class="n">dnb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_b</span><span class="p">)]</span>
            <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">nw</span><span class="o">+</span><span class="n">dnw</span> <span class="k">for</span> <span class="n">nw</span><span class="p">,</span> <span class="n">dnw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_w</span><span class="p">,</span> <span class="n">delta_nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nw</span>
                        <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">nw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">b</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nb</span>
                       <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">nb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="n">nabla_b</span><span class="p">)]</span>

    <span class="k">def</span> <span class="nf">backprop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return a tuple ``(nabla_b, nabla_w)`` representing the</span>
<span class="sd">        gradient for the cost function C_x.  ``nabla_b`` and</span>
<span class="sd">        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar</span>
<span class="sd">        to ``self.biases`` and ``self.weights``.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="c"># feedforward</span>
        <span class="n">activation</span> <span class="o">=</span> <span class="n">x</span>
        <span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="c"># list to store all the activations, layer by layer</span>
        <span class="n">zs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># list to store all the z vectors, layer by layer</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">activation</span><span class="p">)</span><span class="o">+</span><span class="n">b</span>
            <span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activation</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activations</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">activation</span><span class="p">)</span>
        <span class="c"># backward pass</span>
        <span class="n">delta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost_derivative</span><span class="p">(</span><span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> \
            <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
        <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
        <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="c"># Note that the variable l in the loop below is used a little</span>
        <span class="c"># differently to the notation in Chapter 2 of the book.  Here,</span>
        <span class="c"># l = 1 means the last layer of neurons, l = 2 is the</span>
        <span class="c"># second-last layer, and so on.  It&#39;s a renumbering of the</span>
        <span class="c"># scheme in the book, used here to take advantage of the fact</span>
        <span class="c"># that Python can use negative indices in lists.</span>
        <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span>
            <span class="n">spv</span> <span class="o">=</span> <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">delta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">spv</span>
            <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
            <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">test_data</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the number of test inputs for which the neural</span>
<span class="sd">        network outputs the correct result. Note that the neural</span>
<span class="sd">        network&#39;s output is assumed to be the index of whichever</span>
<span class="sd">        neuron in the final layer has the highest activation.&quot;&quot;&quot;</span>
        <span class="n">test_results</span> <span class="o">=</span> <span class="p">[(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">feedforward</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span> <span class="n">y</span><span class="p">)</span>
                        <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="n">test_data</span><span class="p">]</span>
        <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="n">test_results</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">cost_derivative</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">output_activations</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the vector of partial derivatives \partial C_x /</span>
<span class="sd">        \partial a for the output activations.&quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">output_activations</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>

<span class="c">#### Miscellaneous functions</span>
<span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;The sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">sigmoid_prime</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Derivative of the sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_prime_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid_prime</span><span class="p">)</span>
</pre></div>
</p><p>それでは、このコードがどれだけ良く手書き数字を認識できるかを確認していきましょう。
まずはMNISTデータをダウンロードするところからはじめてみよう。
ここでは<tt>mnist_loader.py</tt>を使用します。以下のコマンドをpythonシェルで実行してください。

</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
</pre></div>
</p><p>もちろん、これは独立したpythonプログラムとして実行できますが、
ここまで従ってきたならば、pythonシェルで簡単に処理できるでしょう。
</p><p>MNISTデータをダウンロードした後、私たちは$30$個の隠れニューロンをもつ<tt>Network</tt>を設定します。
私たちは<tt>network</tt>と名前を付けた上記のpythonプログラムをインポートした後に、この処理を行います。

</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
</pre></div>
</p><p>
最後に、30世代・ミニバッチサイズ10・訓練率$\eta = 3.0$の条件で、MNISTの<tt>training_data</tt>から確率的勾配降下法を使用して学習します。
</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="n">test_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">)</span>
</pre></div>
</p><p>
もしこの文章を読みながらコードを実行しているならば、この計算は少々時間がかかるので注意してください。
2014年における一般的なスペックのパソコンならば、訓練の1世代ごとに数分程度かかります。
計算を実行しつつ、読み続けて、たまに計算結果を確認することをおすすめします。
もしあなたが急いでいる場合は、あなたは世代数を減らすか、隠れニューロンの数を減らすか、訓練データの一部のみ使用することによって計算を速くすることができます。
実際の商用コードはより速く計算が可能ですが、
このpythonコードはニューラルネットワークを理解することを助けることが目的であるため、計算が早いわけではありません。
もちろん、一度ニューラルネットワークを訓練すれば、私たちは多くのコンピュータープラットホーム上で非常に高速に実行することができます。
例えば、ニューラルネットワークの重みとバイアスの良いセットがあれば、webブラウザのJavascriptや、携帯デバイスのアプリに移植し、実行するのは簡単です。
それでは以下に、ニューラルネットワークのある訓練プロセスの一部の結果を示しましょう。
この出力は訓練のエポックごとにニューラルネットワークを使用して適切に訓練データを認識できた数を表しています。
最初の世代が終わったあとに10000個中の9129個が正しく認識できており、その後は増加し続けていることがわかります。
</p><p><div class="highlight"><pre>Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000
</pre></div>
</p><p>
訓練されたネットワークは95%の分類率を有しており、ピーク性能は28世代での95.42%でした。
この結果は、最初の試みとしては大変有望です。
ネットワークをランダムな重みとバイアスによって初期化しているので、
このコードを実行したとしても、上記で示した値と全く一緒になるとは限らないことに注意してください。
この章では、3回の計算のうちのベストの解を示しています。</p><p>

それでは、隠れニューロンの数を100個にして上記の実験を再計算してみましょう。
この計算も同様に時間がかかりますので、計算を実行しつつ読み進めることが賢明です。（今回の場合、隠れニューロンの数が多いので各世代での計算時間がよりかかるので。）

</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="n">test_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">)</span>
</pre></div>
</p><p>
予想通り、この計算では性能が96.59%に向上しました。
少なくともこのケースでは、より多くの隠れ層を使用することでより良い結果を得ることが出来ます。*<span class="marginnote">

*読者のフィードバックによると、この実験では性能の違いが報告されており、いくつかの計算結果ではかなり性能が悪くなるようです。
3章で紹介するテクニックを使用することで、計算ごとの予測性能の違いを劇的に減らすことができます。</span>

</p><p>
もちろん、これらの精度を獲得するために、訓練のエポック数、ミニバッチのサイズ、学習率$\eta$を具体的に選択しなくてはなりませんでした。
上記のように、学習アルゴリズムによって学習するパラメータ（重みとバイアス）と区別するために、これらはニューラルネットワークのハイパーパラメータとして呼ばれています。
もしハイパーパラメータを不適切に選択したならば、悪い結果を得ることになります。例えば、学習率$\eta = 0.001$を選んだとすると、

</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.001</span><span class="p">,</span> <span class="n">test_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">)</span>
</pre></div>
</p><p>結果の改善の進捗は遅くなってしまいます。
<div class="highlight"><pre>Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000
</pre></div>

しかしながら、ネットワークの予測性能はゆっくりと良くなっていくことがわかります。
これは学習率を大きくすべきということを提案しているので、学習率$\eta = 0.01$にしてみましょう。
この設定で計算した場合、より良い結果を得ることができます。この結果は、さらに学習率を増加させたほうが良いことを意味します。
（もし変更を与えて性能が改善したならば、さらに変更を大きくさせてみてください。）
数回にわたって再計算を行ったならば、学習率は前の実験で使用した値に近い$\eta = 1.0$程度になるでしょう。
最初にハイパーパラメータの不適切な選択をしたにもかかわらず、少なくともハイパーパラメータを選び方について性能を改善する情報を得ることができました。</p><p>

たいていの場合、ニューラルネットワークをデバッグすることは困難なことであると言えます。
ハイパーパラメータの初期の選択が悪く、ランダムノイズとほぼ同然の結果しか得られないときは、特に困難です。
私たちは前に使用した隠れニューロンが30個のネットワークにおいて、学習率を$\eta = 100.0$に変更した場合を仮定してみましょう。
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">,</span> <span class="n">test_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">)</span>
</pre></div>
この設定は実際に度を超しているとは思っていましたが、やはり学習率が高すぎるようです。

<div class="highlight"><pre><span class="n">Epoch</span> <span class="mi">0</span><span class="p">:</span> <span class="mi">1009</span> <span class="o">/</span> <span class="mi">10000</span>
<span class="n">Epoch</span> <span class="mi">1</span><span class="p">:</span> <span class="mi">1009</span> <span class="o">/</span> <span class="mi">10000</span>
<span class="n">Epoch</span> <span class="mi">2</span><span class="p">:</span> <span class="mi">1009</span> <span class="o">/</span> <span class="mi">10000</span>
<span class="n">Epoch</span> <span class="mi">3</span><span class="p">:</span> <span class="mi">1009</span> <span class="o">/</span> <span class="mi">10000</span>
<span class="o">...</span>
<span class="n">Epoch</span> <span class="mi">27</span><span class="p">:</span> <span class="mi">982</span> <span class="o">/</span> <span class="mi">10000</span>
<span class="n">Epoch</span> <span class="mi">28</span><span class="p">:</span> <span class="mi">982</span> <span class="o">/</span> <span class="mi">10000</span>
<span class="n">Epoch</span> <span class="mi">29</span><span class="p">:</span> <span class="mi">982</span> <span class="o">/</span> <span class="mi">10000</span>
</pre></div>

初めてこの問題に直面したことを、今想像してみてください。
もちろん、私たちは前の実験を知っていて、学習率を下げることが正しいことだということを今は<em>知っています</em> 。
しかし、この問題に初めて直面する場合、出力結果は、何をすべきかを教えてはくれません。
学習率だけでなく、ニューラルネットワークの他の全ての側面について心配するかもしれません。
ネットワークが学習することを難しくさせるような方法で重みとバイアスを初期化したのかもしれないと疑うかもしれません。
それとも、意味のある学習をするための十分な訓練データを持っていないと思うかもしれません。
十分なエポック数を計算していないのかもしれない？手書き数字を認識するために学習することはニューラルネットワークでは不可能と思うかもしれません。
学習率が<em>低すぎる</em>かもしれない、それとも高すぎる？
はじめて、この問題に直面する際は、常に何もわからない状態にあるのです。

</p><p>
これらの疑問を取り除くためのニューラルネットワークのデバッグとしてのレッスンは些細な問題ではありません、そして通常のプログラミングに関して言えば、関係した技術があります。
ニューラルネットワークから良い結果を得るためのデバッグ技術を学ぶ必要があります。
より一般的に言えば、私たちは良いハイパーパラメータと良いアーキテクチャを選択するためのヒューリスティック技術を開発する必要があります。
この本を通して、上記のハイパーパラメータの設定方法を含むこれらの技術について解説します。

</p><p>
<h4><a name="exercise_420023"></a><a href="#exercise_420023">Exercise</a></h4><ul></p><p><li>
隠れ層がない入力層と出力層のみの2層のネットワークを作ってみてください。それぞれニューロンは、784個と10個です。そして、確率的勾配法で学習させてみてください。どんな分類精度を達成できるでしょうか。
</ul></p><p></p><p>
以前、MNISTデータのロード方法の詳細について説明を省略していました。かなり簡単ではありますが、念のためにコードを以下に載せました。
MNISTデータを格納するために使われるデータ構造は、資料内の解説にある通りで、Numpyの<tt>ndarray</tt>オブジェクトです。（もし<tt>ndarray</tt>に馴染みがない方は、ベクトルとして考えてください。）

</p><p><div class="highlight"><pre><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">mnist_loader</span>
<span class="sd">~~~~~~~~~~~~</span>

<span class="sd">A library to load the MNIST image data.  For details of the data</span>
<span class="sd">structures that are returned, see the doc strings for ``load_data``</span>
<span class="sd">and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the</span>
<span class="sd">function usually called by our neural network code.</span>
<span class="sd">&quot;&quot;&quot;</span>

<span class="c">#### Libraries</span>
<span class="c"># Standard library</span>
<span class="kn">import</span> <span class="nn">cPickle</span>
<span class="kn">import</span> <span class="nn">gzip</span>

<span class="c"># Third-party libraries</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>

<span class="k">def</span> <span class="nf">load_data</span><span class="p">():</span>
    <span class="sd">&quot;&quot;&quot;Return the MNIST data as a tuple containing the training data,</span>
<span class="sd">    the validation data, and the test data.</span>

<span class="sd">    The ``training_data`` is returned as a tuple with two entries.</span>
<span class="sd">    The first entry contains the actual training images.  This is a</span>
<span class="sd">    numpy ndarray with 50,000 entries.  Each entry is, in turn, a</span>
<span class="sd">    numpy ndarray with 784 values, representing the 28 * 28 = 784</span>
<span class="sd">    pixels in a single MNIST image.</span>

<span class="sd">    The second entry in the ``training_data`` tuple is a numpy ndarray</span>
<span class="sd">    containing 50,000 entries.  Those entries are just the digit</span>
<span class="sd">    values (0...9) for the corresponding images contained in the first</span>
<span class="sd">    entry of the tuple.</span>

<span class="sd">    The ``validation_data`` and ``test_data`` are similar, except</span>
<span class="sd">    each contains only 10,000 images.</span>

<span class="sd">    This is a nice data format, but for use in neural networks it&#39;s</span>
<span class="sd">    helpful to modify the format of the ``training_data`` a little.</span>
<span class="sd">    That&#39;s done in the wrapper function ``load_data_wrapper()``, see</span>
<span class="sd">    below.</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">f</span> <span class="o">=</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s">&#39;../data/mnist.pkl.gz&#39;</span><span class="p">,</span> <span class="s">&#39;rb&#39;</span><span class="p">)</span>
    <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> <span class="n">cPickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
    <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">load_data_wrapper</span><span class="p">():</span>
    <span class="sd">&quot;&quot;&quot;Return a tuple containing ``(training_data, validation_data,</span>
<span class="sd">    test_data)``. Based on ``load_data``, but the format is more</span>
<span class="sd">    convenient for use in our implementation of neural networks.</span>

<span class="sd">    In particular, ``training_data`` is a list containing 50,000</span>
<span class="sd">    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray</span>
<span class="sd">    containing the input image.  ``y`` is a 10-dimensional</span>
<span class="sd">    numpy.ndarray representing the unit vector corresponding to the</span>
<span class="sd">    correct digit for ``x``.</span>

<span class="sd">    ``validation_data`` and ``test_data`` are lists containing 10,000</span>
<span class="sd">    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional</span>
<span class="sd">    numpy.ndarry containing the input image, and ``y`` is the</span>
<span class="sd">    corresponding classification, i.e., the digit values (integers)</span>
<span class="sd">    corresponding to ``x``.</span>

<span class="sd">    Obviously, this means we&#39;re using slightly different formats for</span>
<span class="sd">    the training data and the validation / test data.  These formats</span>
<span class="sd">    turn out to be the most convenient for use in our neural network</span>
<span class="sd">    code.&quot;&quot;&quot;</span>
    <span class="n">tr_d</span><span class="p">,</span> <span class="n">va_d</span><span class="p">,</span> <span class="n">te_d</span> <span class="o">=</span> <span class="n">load_data</span><span class="p">()</span>
    <span class="n">training_inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="p">(</span><span class="mi">784</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tr_d</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
    <span class="n">training_results</span> <span class="o">=</span> <span class="p">[</span><span class="n">vectorized_result</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">tr_d</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span>
    <span class="n">training_data</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">training_inputs</span><span class="p">,</span> <span class="n">training_results</span><span class="p">)</span>
    <span class="n">validation_inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="p">(</span><span class="mi">784</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">va_d</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
    <span class="n">validation_data</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">validation_inputs</span><span class="p">,</span> <span class="n">va_d</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">test_inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="p">(</span><span class="mi">784</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">te_d</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
    <span class="n">test_data</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">test_inputs</span><span class="p">,</span> <span class="n">te_d</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">vectorized_result</span><span class="p">(</span><span class="n">j</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Return a 10-dimensional unit vector with a 1.0 in the jth</span>
<span class="sd">    position and zeroes elsewhere.  This is used to convert a digit</span>
<span class="sd">    (0...9) into a corresponding desired output from the neural</span>
<span class="sd">    network.&quot;&quot;&quot;</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="n">e</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.0</span>
    <span class="k">return</span> <span class="n">e</span>
</pre></div>
</p><p>
このプログラムはかなり良い結果を得られたと言いました。これはどういう意味でしょうか。
何と比較して良いと言っているのでしょうか。お互いに比較したり、何がよい実行結果なのかを理解したりするために、いくつかのニューラルネットワークではない単純な性能基準を持つことは有益です。
もちろん、全ての中で最もシンプルな基準は数字をランダムに推測するものです。それは、性能は10%程度になるでしょう。私たちは、それよりも遥かに良い性能を持っています。

</p><p>取るに足らなくない基準とはなんでしょうか。それでは、とてもシンプルなアイデアに挑戦してみましょう。
画像がどれくらいか<em>暗いか</em>について見ています。例えば、「2」の画像は「１」の画像に比べて、より暗い画像となります。
なぜならば、下記の例を見ればわかる通り、多くのピクセルが黒く塗りつぶされているからです。
</p><p><center><img src="images/mnist_2_and_1.png" width="256px"></center></p><p>
これは、各数字（$0, 1, 2,\ldots, 9$）の黒色を持つピクセルの平均を計算するために訓練データを使用することを提案しています。
新しい画像が示されたとき、画像がどれだけ黒いかを計算し、最も近い黒色を持つピクセルの平均の数字として区別します。
これは簡単な手順であり、コーディングも簡単です。私は明示的にコードを書きませんが、GitHubの<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/code/mnist_average_darkness.py">GitHub
  repository</a>に公開しておきます。しかし、この方法では、10000枚の訓練データのうち2225枚を適切に区別することができ、つまり22.25%の精度、ランダム推測に比べて大きな改善をもたらします。
</p><p><a name="SVM"></a></p><p>
予測精度20から50%程度の他のアイデアを見つけることは難しくありません。
もしあなたが少し頑張れば、予測精度50%に到達できるでしょう。
しかしさらに高精度を取得するためには、機械学習アルゴリズムを使用することが手助けになるでしょう。
良く知られている手法の一つであるサポートベクターマシン(<em>SVM</em>)について挑戦してみましょう。
もしSVMに馴染みがなかったとしても、心配しないでください。SVMの働きについて詳細に理解する必要はありません。
代わりに、私たちは<a href="http://www.csie.ntu.edu.tw/&#126;cjlin/libsvm/">LIBSVM</a>として知られているSVMの高速なCベースのライブラリのpythonインターフェースである<a href="http://scikit-learn.org/stable/">scikit-learn</a>というライブラリを使用します。

</p><p>もしscikit-learnのSVM分類器のデフォルト設定で計算したとすれば、10000個の画像中の9435個を適切に分類できます。（コードは<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/code/mnist_svm.py">ここ</a>にあります。）
黒色の平均値に基づく分類から考えると、とても大きな改善です。
確かに、SVMはだいたいニューラルネットワークと同じくらいか、ちょっと悪いくらいの性能を持っています。
後の章では、私たちはSVMをはるかに超える良い性能を得るために、ニューラルネットワークを改良する新技術を紹介します。

</p><p>話はこれで終わりではありません。
10000個中の9435個の適切な分類は、scikit-learnのデフォルト設定のSVMによるものでした。
SVMは調整可能なパラメータをいくつか持っていて、さらに性能の良い判別が行うことができるパラメータを探すことができる可能性があります。
この探索を明示的には行いませんが、代わりにもし詳細を知りたいならば<a href="http://peekaboo-vision.blogspot.ca/">Andreas
  Mueller</a>の<a href="http://peekaboo-vision.blogspot.de/2010/09/mnist-for-ever.html">ブログ</a>を確認してください。
Muellerは、SVMのパラメータを最適化することで、98.5%の予測性能に到達していることを示しています。
言い換えれば、適切にチューニングされたSVMでは間違いを70個しかしないことになります。これは、かなり良い性能です。ニューラルネットワークはこれより良い推定ができるでしょうか。

</p><p>実際には可能です。現在、よく設計されたニューラルネットワークは、SVMを含めて他のMNISTの識字アルゴリズムの中で最も優れています。
2014年現在の記録では、10000個中9979個の画像を適切に分類することが出来ています。
これは、<a href="http://www.cs.nyu.edu/&#126;wanli/">Li
  Wan</a>、 <a href="http://www.matthewzeiler.com/">Matthew Zeiler</a>,、Sixin
Zhang、<a href="http://yann.lecun.com/">Yann LeCun</a>、
<a href="http://cs.nyu.edu/&#126;fergus/pmwiki/pmwiki.php">Rob Fergus</a>によって行われたものです。
この本の中の後半で、彼らが使用した技術の多くを見ることができます。その性能のレベルは人間とほぼ同等であり、おそらくより良いものです。
なぜならば、人間が自信を持って認識することさえ難しいMNIST画像がいくつかあるからです。例えば以下のようなものです。

</p><p><center><img src="images/mnist_really_bad_images.png" width="560px"></center></p><p>
これらを分類するのは難しいと認めることを確信しています。MNISTデータセットの中でこのような画像があるのにもかかわらず、ニューラルネットワークは10000個の画像の中の21個以外は適切に分類できることは驚くべきことです。
プログラミングをするとき、たいていはMNISTの識字を理解するような複雑な問題を解くことは、高度なアルゴリズムが必要だと考えています。
しかし、Wanらのペーパーの中で言及したニューラルネットワークはかなりシンプルなアルゴリズムを意味しています。私たちがこの章で見てきたアルゴリズムのバリエーションも含んでいます。

<center>
  高度なアルゴリズム $\leq$ シンプルな学習アルゴリズム + 良い訓練データ
</center></p><p><h3><a name="toward_deep_learning"></a><a href="#toward_deep_learning">Deep Learningに向けて</a></h3></p><p>
ニューラルネットワークは印象的な性能を提供していますが、その性能はやや神秘的です。
ネットワークの重みとバイアスが自動的に発見されました。
これはつまり、ネットワークがどのように機能しているかについて説明できないことを意味します。
ネットワークが手書き数字を分類する原理を理解する方法を見つけることができるでしょうか？
そして、そのような原理を考え、さらによりよく出来るでしょうか？

</p><p>
この質問により厳密に答えるために、向こう20～30年でニューラルネットワークは人工知能(AI)になるだろうと考えてください。
私たちは、そんな賢いネットワークがどのように働くかを理解できるでしょうか。
もしかしたら、私たちが把握していない重みとバイアスを使ったネットワークは、不透明なものになるかもしれない。
というのも、ネットワークは自動的に学習してきたからです。
初期のAI研究者は、AIを構築するための努力によって知能の原理（人間の脳機能のような）を理解できるようになる、そんなことを望んでいました。
しかし、私たちは脳だけでなく、人工知能の働きさえ理解せずに終わってしまうことになるかもしれません！


  </p><p>これらの問題に答えるために、この章の最初に説明した人工ニューロンのパーセプトロンについて振り返ってみましょう。画像が人の顔を示しているか否かを判断したいとします。

</p><p> </p><p>  <span class="marginnote">Credits: 1. <a
  href="http://commons.wikimedia.org/wiki/User:ST">Ester Inbar</a>. 2.
  Unknown. 3. NASA, ESA, G. Illingworth, D. Magee, and P. Oesch
  (University of California, Santa Cruz), R. Bouwens (Leiden
  University), and the HUDF09 Team.  Click on the images for more
  details.</span></p><p>  <a
  href="http://commons.wikimedia.org/wiki/File:Kangaroo_ST_03.JPG"><img
  src="images/Kangaroo.JPG" height="190px"/></a> <a
  href="http://commons.wikimedia.org/wiki/File:Albert_Einstein_at_the_age_of_three_(1882).jpg"><img
  src="images/Einstein_crop.jpg" height="190px"/></a> <a
  href="http://commons.wikimedia.org/wiki/File:The_Hubble_eXtreme_Deep_Field.jpg"><img
  src="images/hubble.jpg" height="190px"/></a> </p><p>

この問題に対しても、手書き文字認識と同じ方法で取り組むことができます。つまり、ニューラルネットワークの入力として画像のピクセルを使い、1つの出力ニューロンによって「顔である」か「顔でない」かを判定させるのです。

</p><p>では考えてみましょう。ただし、学習アルゴリズムは使いません。
その代わりに、ネットワークを手動で設計し、適切な重みとバイアスを設定していきます。
どのように取り組めばよいでしょうか？ニューラルネットワークを一瞬全て忘れるとして、私たちが使うことのできるヒューリスティクスによって問題を小さな問題に分割します。
「画像の左上に目はあるか」「画像の右上に目はあるか」「画像の中心に鼻はあるか」「画像の中央下に口はあるか」「髪の毛は上のほうにあるか」などなど。

</p><p>これらの質問の答えのいくつかが「YES」や「おそらくYES」だとすれば、その画像は顔であると言えるでしょう。逆に、質問の答えがほとんど「NO」だとすれば、それはおそらく顔ではないでしょう。

</p><p>
もちろん、これはかなり荒いヒューリスティクスであり、多くの欠点を持ちます。
例えば、禿げた人を考えた場合は、彼らには髪の毛がありません。
私たちは、顔の一部のみだったり、顔に角度がついていたり、顔の一部が隠されていたりしても、顔だと判断できます。
それでも、このヒューリスティクスによって次のような示唆を得ることができます。
つまり、小さい問題をニューラルネットワークを使って解くことができるなら、それらネットワークを組み合わせることで、顔判定のためのネットワークを構築することができる、ということです。
以下に、ありうりそうなアーキテクチャを示しましょう。これは、サブネットワークを長方形で表しています。
ただし、ここでは顔認識問題を解くための現実的なアプローチは意図していないことに注意してください。
むしろこれは、どのようにネットワークが機能するかについて直感的な理解を促してくれます。

</p><p><center>
<img src="images/tikz14.png"/>
</center></p><p>
サブネットワーク自体も小さく分解できる、というのはいかにもありそうなことです。
それでは次のような質問について検討してみましょう。
「画像の左上に目があるか？」これは次のような質問にさらに分けることができます。
「眉毛はあるか？」「まつ毛はあるか？」「眼球の光彩はあるか？」などです。
もちろんこれらの質問は、実際は位置の情報を含んでいるべきです。
「画像の右上に眉毛はあるか？それは光彩の上にあるか？」というように。
このような感じですが、シンプルにはしておきましょう。このようにして、「左上に目はあるか？」は次のように分解できます。
</p><p><center>
<img src="images/tikz15.png"/>
</center></p><p>
こういった質問は、複数の層を介して、さらにさらに分解して行くことが可能です。
究極的には、単一のピクセルレベルで簡単に答えられるようなシンプルな質問に答えるサブネットワークで作業することになります。
例えばその質問は、画像内のある特定の点での、シンプルな形状の有り・無しかもしれません。
その手の質問は、画像のピクセルにそのまま直接接続された、単一のニューロンによって答えることができます。


</p><p>
最終的な結果は、非常に込み入った問題（「画像は顔を表しているか否か」のような）を、単一のピクセルレベルで答えられる問題に分解したネットワークになります。
以上の結果を得るためには、入力画像についてとても簡単で特定の質問に答える初期の層と、より複雑で抽象的な概念の階層を構築している後半の層を含む多くの層を通ることが必要です。
多くの隠れ層（２つかそれ以上）を含む多層構造のネットワークは、<em>ディープニューラルネットワーク</em>と呼ばれています。

</p><p></p><p></p><p>
もちろん、どのように再帰的にサブネットワークに分けていくのかについては、述べていません。
ネットワーク内の重みとバイアスを手動で設計するのは、実用的ではありません。
代わりに訓練データから、自動的に重みとバイアス（さらに言えば、概念の階層構造まで）を習得できるような学習アルゴリズムを使います。
1980年代と1990年代の研究者は、確率的勾配降下法と誤差逆伝播法をディープネットワークの訓練に使用してみようとしました。
残念ながら、いくつかの特別なアーキテクチャを除いて、彼らは良い結果を得られませんでした。
学習はするのですが、とても遅く、現実的に使用できませんでした。

</p><p>
2006年になってようやく、ディープニューラルネットを学習可能にする一連の技術が開発されました
これらの学習技術は確率的勾配降下法と誤差逆伝播法に基づいてはいますが、新しいアイデアが追加されています。
これらの技術は、より深く（そしてより大きな）ネットワークを訓練可能にし、現在では5～10層のネットワークが当然のように訓練されています。
そして、浅いニューラルネットワーク（特に隠れ層が１層のみの場合）よりも、多くの問題解決において非常に良くなっていることがわかりました。
もちろんこれは、ディープネットワークが、概念の複雑な階層構造を構築できるからです。
従来のプログラム言語で複雑なプログラムを作るときによく使う、モジュール方式のデザインと考え方に少し似ています。
ディープネットワークと浅いネットワークの関係は、関数作成と呼び出しが可能なプログラム言語と、そのような能力を持たない言語の関係に少し似ています。
抽象化は、従来のプログラミングにおけるものとは異なる形式を取りますが、それは極めて重要なことなのです。

</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>
</div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2014

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Tue Sep  2 09:19:44 2014
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>