indico's Head of Research, Alec Radford, led a workshop on general sequence learning using recurrent neural networks at Next.ML in San Francisco. Here's his presentation and workshop resources available for free.
---
Recurrent Neural Networks hold great promise as general sequence learning algorithms. As such, they are a very promising tool for text analysis. However, outside of very specific use cases like handwriting recognition and recently, machine translation, they have not seen widespread use. Why has this been the case?
In this workshop, Alec will introduce RNNs as a concept. Then you'll sketch how to implement them and cover the tricks necessary to make them work well. With the basics covered, we will investigate using RNNs as general text classification and regression models, examining where they succeed and where they fail compared to more traditional text analysis models.
Finally, a simple Python and Theano library for training RNNs with a scikit-learn style interface will be introduced and you'll see how to use it through several hands-on tutorials on real world text datasets.
---
Next.ML was created to help you use the latest machine learning techniques the minute you leave the workshop. Learn from industry-leading data scientists at Next.ML on April 27th, 2015 at the Microsoft NERD Center -- for more info, visit http://Next.ML
---
Resources
Passage - https://github.com/IndicoDataSolutions/Passage
Presentation Video - https://www.youtube.com/watch?v=VINCQghQRuM
General Sequence Learning using Recurrent Neural Networks
1.
2. How ML
-0.15, 0.2, 0, 1.5
A, B, C, D
The cat sat on the
mat.
Numerical, great!
Categorical, great!
Uhhh…….
3. How text is dealt with (ML perspective)
Text
Features
(bow, TFIDF, LSA, etc...)
Linear Model
(SVM, softmax)
4. Structure is important!
The cat sat on the mat.
sat the on mat cat the
● Certain tasks, structure is essential:
○ Humor
○ Sarcasm
● Certain tasks, ngrams can get you a
long way:
○ Sentiment Analysis
○ Topic detection
● Specific words can be strong indicators
○ useless, fantastic (sentiment)
○ hoop, green tea, NASDAQ (topic)
5. Structure is hard
Ngrams is typical way of preserving some structure.
sat
the on
mat
cat
the cat cat sat sat on
on thethe mat
Beyond bi or tri-grams occurrences become very rare and
dimensionality becomes huge (1, 10 million + features)
6. How text is dealt with (ML perspective)
Text
Features
(bow, TFIDF, LSA, etc...)
Linear Model
(SVM, softmax)
7. How text should be dealt with?
Text RNN
Linear Model
(SVM, softmax)
9. How an RNN works
the cat sat on the mat
input to hidden
10. How an RNN works
the cat sat on the mat
input to hidden
hidden to hidden
11. How an RNN works
the cat sat on the mat
input to hidden
hidden to hidden
12. How an RNN works
the cat sat on the mat
projections
(activities x weights)
activities
(vectors of values)
input to hidden
hidden to hidden
13. How an RNN works
the cat sat on the mat
projections
(activities x weights)
activities
(vectors of values)
Learned representation of
sequence.
input to hidden
hidden to hidden
14. How an RNN works
the cat sat on the mat
projections
(activities x weights)
activities
(vectors of values)
cat
hidden to output
input to hidden
hidden to hidden
15. From text to RNN input
the cat sat on the mat
“The cat sat on the mat.”
Tokenize .
Assign index 0 1 2 3 0 4 5
String input
Embedding lookup 2.5 0.3 -1.2 0.2 -3.3 0.7 -4.1 1.6 2.8 1.1 5.7 -0.2 2.5 0.3 -1.2 1.4 0.6 -3.9 -3.8 1.5 0.1
2.5 0.3 -1.2
0.2 -3.3 0.7
-4.1 1.6 2.8
1.1 5.7 -0.2
1.4 0.6 -3.9
-3.8 1.5 0.1
Learned matrix
16. You can stack them too
the cat sat on the mat
cat
hidden to output
input to hidden
hidden to hidden
17. But aren’t RNNs unstable?
Simple RNNs trained with SGD are unstable/difficult to learn.
But modern RNNs with various tricks blow up much less often!
● Gating Units
● Gradient Clipping
● Steeper gates
● Better initialization
● Better optimizers
● Bigger datasets
18. Simple Recurrent Unit
ht-1
xt
+ ht
xt+1
+ ht+1
+ Element wise addition
Activation function
Routes information can propagate along
Involved in modifying information flow and
values
19. ⊙
⊙⊙
Gated Recurrent Unit - GRU
xt
r
ht
ht-1
ht
z
+
~
1-z z
+ Element wise addition
⊙ Element wise multiplication
Routes information can propagate along
Involved in modifying information flow and
values
20. Gated Recurrent Unit - GRU
⊙
⊙⊙
xt
r
ht
ht-1
z
+
~
1-z z
⊙
⊙⊙
xt+1
r
ht+1
ht
z
+
~
1-z z
ht+1
21. Gating is important
For sentiment analysis of longer
sequences of text (paragraph or so)
a simple RNN has difficulty learning
at all while a gated RNN does so
easily.
22. Which One?
There are two types of gated RNNs:
● Gated Recurrent Units (GRU) by
K. Cho, recently introduced and
used for machine translation and
speech recognition tasks.
● Long short term memory (LSTM)
by S. Hochreiter and J.
Schmidhuber has been around
since 1997 and has been used
far more. Various modifications to
it exist.
23. Which One?
GRU is simpler, faster, and optimizes
quicker (at least on sentiment).
Because it only has two gates
(compared to four) approximately 1.5-
1.75x faster for theano
implementation.
If you have a huge dataset and don’t
mind waiting LSTM may be better in
the long run due to its greater
complexity - especially if you add
peephole connections.
24. Exploding Gradients?
Exploding gradients are a major problem
for traditional RNNs trained with SGD. One
of the sources of the reputation of RNNs
being hard to train.
In 2012, R Pascanu and T. Mikolov
proposed clipping the norm of the gradient
to alleviate this.
Modern optimizers don’t seem to have this
problem - at least for classification text
analysis.
25. Better Gating Functions
Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so
they change more rapidly from “off” to “on” so model learns to use them quicker.
26. Better Initialization
Andrew Saxe last year showed that initializing weight matrices with random
orthogonal matrices works better than random gaussian (or uniform) matrices.
In addition, Richard Socher (and more recently Quoc Le) have used identity
initialization schemes which work great as well.
28. Comparing Optimizers
Adam (D. Kingma) combines the
early optimization speed of
Adagrad (J. Duchi) with the better
later convergence of various other
methods like Adadelta (M. Zeiler)
and RMSprop (T. Tieleman).
Warning: Generalization
performance of Adam seems
slightly worse for smaller datasets.
29. It adds up
Up to 10x more efficient training once you
add all the tricks together compared to a
naive implementation - much more stable
- rarely diverges.
Around 7.5x faster, the various tricks add
a bit of computation time.
30. Too much? - Overfitting
RNNs can overfit very well as we will
see. As they continue to fit to training
dataset, their performance on test data
will plateau or even worsen.
Keep track of it using a validation set,
save model at each iteration over
training data and pick the earliest, best,
validation performance.
31. The Showdown
Model #1 Model #2
+ 512 dim
embedding
512 dim
hidden state
output
Using bigrams and grid search on min_df for
vectorizer and regularization coefficient for model.
Using whatever I tried that worked :)
Adam, GRU, steeper sigmoid gates, ortho/identity
init are good defaults
33. Effect of Dataset Size
● RNNs have poor generalization properties on small
datasets.
○ 1K labeled examples 25-50% worse than linear model…
● RNNs have better generalization properties on large
datasets.
○ 1M labeled examples 0-30% better than linear model.
● Crossovers between 10K and 1M examples
○ Depends on dataset.
34. The Thing we don’t talk about
For 1 million paragraph sized text examples to converge:
● Linear model takes 30 minutes on a single CPU core.
● RNN takes 90 minutes on a Titan X.
● RNN takes five days on a single CPU core.
RNN is about 250x slower on CPU than linear model…
This is why we use GPUs
37. Quantities of TimeQualifiers
Product nouns
Punctuation
Much cooler, model also begins to learn components of language from only binary sentiment labels
38. The library - Passage
● Tiny RNN library built on top of Theano
● https://github.com/IndicoDataSolutions/Passage
● Still alpha - we’re working on it!
● Supports simple, LSTM, and GRU recurrent layers
● Supports multiple recurrent layers
● Supports deep input to and deep output from hidden layers
○ no deep transitions currently
● Supports embedding and onehot input representations
● Can be used for both regression and classification problems
○ Regression needs preprocessing for stability - working on it
● Much more in the pipeline
50. Summary
● RNNs look to be a competitive tool in certain situations
for text analysis.
● Especially if you have a large 1M+ example dataset
○ A GPU or great patience is essential
● Otherwise it can be difficult to justify over linear models
○ Speed
○ Complexity
○ Poor generalization with small datasets