60 minute blitz uses stacked Dense layers with no activation function

In the 60 minute blitz tutorial, [we use a sequence of stacked Dense layers, each with no activation function](https://github.com/FluxML/model-zoo/blob/52a7b8923ef7f0313b6e38765536166ae1ef7961/tutorials/60-minute-blitz/60-minute-blitz.jl#L333-L335).  This doesn't make much sense, as multiple linear operators can always be combined down into a single linear operator:

```
julia> using Flux
       model = Chain(
           Dense(200, 120, bias=false),
           Dense(120, 84, bias=false),
           Dense(84, 10, bias=false),
       )

       model_condensed = Chain(
           Dense(model[3].W * model[2].W * model[1].W),
       )

       x = randn(200)
       sum(abs, model(x) .- model_condensed(x))
2.4189600187907168e-6
```

While yes, there are machine precision/rounding issues that cause it to not be _exactly_ equivalent, you don't get any material benefit from multiple stacked `Dense` layers, and in fact you get a performance penalty due to the same values moving in and out of CPU cache.

It would be better to either add nonlinearities between these `Dense` layers to increase model flexibility, or replace them with a single Dense layer that directly drops from rank 200 to 10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

60 minute blitz uses stacked Dense layers with no activation function #339

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

60 minute blitz uses stacked Dense layers with no activation function #339

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions