Hey, great series! Love it. One thing though, I think you got the batch training concept wrong. Batch training is done so the weights are not updated AS WE TRAIN EACH INPUT over the batch. We accumulate the delta weights, and update them all at the end of the batch. The reason we do this, is because our final AIM is to minimize cost (error) over all the dataset (which is hard, hence batches). If this is confusing, I'm not sure how to explain this, but you can check 3Blue1Brown's 4 video playlist on Neural networks. He explains this in 3rd or 4th vid I think. Good luck all
Mario Velez there is an even faster version. I uploaded it on github and it supports convolution etc. You can look for Luecx on github and under my repositories you should find something called AILibrary.
Great Video again, but this time I have a question behind that algorithm logic: My first thought was, why would you train multiple times the same training pairs instead of just training each once and increasing the learning rate. But then I asked myself if its true that anytime you adjust the weights the timewise older adjustments tend to fade out in effectiveness. In that case you would flat that "fade out" curve by increasing the amount of training loops and decreasing the learning rate. (I hope I explained my thoughts unambiguously, if not feel free to ask me back.) But training it multiple times with lower learning rate just seems to cover the problem and not solve it. So the real question is how do we train the algorithm so that every input-target pair has the same influence on the weight adjustments (without training them infinitely many times)?
Hi Finn, thanks for all the videos, I'm following them with great intensity and they're helping me a lot! I have a question though, why is 8:40 a bad idea? What happens if you have network sizes of 4, 1000, 3, 2? Also, is there a rule for how many layers your network should have? Thanks a lot!
+Anon ymous that is actually really though to answer but I will give it a try: Generally there is no formula to get the perfect architecture but think about it this way: As described in the video about the backprop algorithm the network can be understood as a function that takes an input vector and the weights/bias and returns the output vector. The more weights you have, the more complex can you function be. A network is pretty much trying to interpolate a dataset. Lets say you have 100 points on a 2D graph. You can create a polynomial function with 100 variables so that every point is intersected by your function. But this is not a good interpolation. You want to have a lot of data and want to generalise it and create a polynomial function of maybe 3rd, 4th or 5th degree with only a few variables. This means that your interpolation is way more general. In a Network you want to find the optimal architecture to not overfit and to not overgeneralise your network for your data. Neurons in a Layer can be compared to pattern recognisers that are looking for a specific pattern in the previous layer. So the first hidden layer is looking for n patterns in the input layer. The second hidden layer compares these pattern-values and looks for new patterns. Now lets assume that you have 3 input neuron and 1000 Neurons in the 2nd layer. So you are checking for 1000 patterns that are being created by 3 input Neurons. This sound definitely like overfitting. It would mean that (1,0,0) in the input layer would trigger completely different Neurons in the next layer as if you put (0.9,0,0) into the network because the pattern is different. But obviously the network should probably trigger the same at the given inputs. So you need less Neurons in the second layer to detect less patterns
Finn Eggers Thanks for the reply. For my network I will have 35 or 60 inputs (to be decided), and I only have 150 training sets. My output size will be the same as the input size. For 35 inputs would a network size of 100, 75, 60, 50, 40, 35. And for 60 inputs 200, 150, 120, 100, 85, 75, 65, 60. So the first hidden layer should ideally be a lot larger than the input layer, and then each hidden layer should reduce in size?
+Anon ymous mhm.. I am not quiet sure what you mean. Can you tell me what you are trying to make your network learn? If the input size is 30 . You cannot use a network (60,45,30...) because the data input size needs to have the same size as the amount of Neurons in the input layer.
Sorry I wasn't clear. I'm not sure if my network will consist of 35 inputs or 60 inputs, so I'm exploring both options for now. After doing some more research I will conclude this on my own. Regarding each layer size, I'm just trying to get a general idea. What do you think of what I suggested below? Option 1 - 35 Inputs: 35 (Input layer size) -> 75 (first hidden layer size) -> 60 (second hidden layer size) -> 50 -> 40 -> 35 (output layer size). Option 2 - 60 Inputs: 60 (Input layer size) -> 200 (first hidden layer size) -> 120 -> 100 -> 85 -> 75 -> 65 -> 60 (output layer size). I only have 150 sets of inputs to train with. I hope I explained it a bit better :)
Yeah that might work. It's usually really hard to tell if that's a good structure. By the looks of it it looks like that you might have to many layers in the second example with 60 inputs. Especially when you have like what, 150 trainsets? Just calculate the amount of weights and you will see that your weights / data ration might be way to high and results in overfitting. Overfitting cannot be seen during the training stage but when you run your network on some new data. You should play around with that. I've seen an attempt how some people create their networks: They start with no hidden layer and only their input and output size and run the training stuff. If the overall error does not reduce anymore and it is still really bad. They add a new layer with n neurons where n is most likely in between the input and the output size. If the error has been to high at this point. The network probably did not have enough weights so it resulted in overgeneralisation. So they keep adding layers until the result is as they want. Try this for yourself because if you start with a huge network your overall network error will be really small when training but your success rate later on might be extremly high. Here is an image that illustrates overfitting and underfitting (overgeneralisation): i2.wp.com/www.geeksprogramming.com/wp-content/uploads/2016/08/ML3.png?resize=838%2C300&ssl=1
Great series! But there is one thing I didn't really get yet: How do I know how many layers and how many neurons in each layer I should use for a specific problem? Thank you!
Okay, so I am not 100% sure about this but: When using a batch algorithm the system weights are kept constant while computing the error associated with each sample in the input. Which means that what you should be doing is: (In pseudo code) loop for each batch for each training item in batch compute weights and bias deltas for curr item accumulate the deltas end for adjust weights and bias deltas using accumulated deltas end loop Where online training goes like the one you have shown us: for each training data item compute weights and bias deltas for curr item adjust weights and bias values using deltas end for So your approch isn't really a batch training approch as much as a batched online one. P.S. Really nice tutorials though, I just thought I should point this out.
Actually you are right. I've seen an implementation that is being done like this but yes. This is not "true stochastic gradient descent". Also the way I am choosing training samples randomly is not perfect. At the point I made these videos, I just had bad sources and I totally agree with you. The problem with this is that the sources on the web differ in many small points and a lot of them are not correct and sadly I kinda copied this badly. However this series should be a starting point for beginners in this topic and smaller projects still work fine. I see that you have done a little bit more in this field like I did lately.
I'm really excited about your video series Finn!! I have a question, which I hope isn't too silly. When you are back propagating the conversion of one number to another number... how do you get to a point where it improves it's ability to do this with any number? The way it's set up is great in that it shows the learning of one conversion. I'm just not clear on how it gets from that to an overall strategy. If I'm understanding anything correctly in the process, the bias/deltas need to be collected and averaged against particular inputs? So a database of deltas with a key relation to a particular number would allow it to solve within X fewer training rounds?
Don't worry, your question is not silly. If you have any questions after this, feel free to ask. So, what is a neural network? The simplest answer: It is a function approximator. What kind of function? Any. Really! Let's talk about simple one dimensional mathematical functions like f(x) = x^2. We can make a neural network learn that (kinda). Just give him like, 10 x-values with coresponding f(x)-values and you will get a pretty good approximation for f(x). Now let's talk about the weights. Well yeah we are taking the average of the deltas for different input/output values and change the weight accordingly. A database would only work if the network topologie (layer sizes etc) would be constant. What a neural network does is the following: We give it something to "learn". For that, we put something into the network and see what it calculates. After that, we tell it what the output should be. The output depends on nothing but the input and the weights. Now, we can define something like an error of the network. And "simple" calculus: We take the derivative of the error function with respect to each weight and change those accordingly. This might not be the best answer to your question because I was not quiet sure what you were asking in the second part. Hope this explains a bit. Feel free to ask further.
Thank you so much for your answer. It seems I need to better understand the derivative of the error and maybe create a system that tracks errors from an extended history. I was initially looking at it perhaps from the wrong direction and thinking I could use a sort of snapshot of the neural net at the point of success to seed the next set of calculations instead of a randomized initialization. I really hope to see more videos from you in machine learning. Please keep up the awesome work!
Hi Finn, I was just kind of reprocessing your answers. In your initial response you wrote: "So, what is a neural network? The simplest answer: It is a function approximator. " This is actually my fascination with these networks. I have an existing NLP project with a series of complex functions which I want to map out and replicate with a series of neural nets. It's a rather large project so it may require a NN composed of other NNs. (If that makes any sense). If you would like to see what I'm trying to do, feel free to ask to see what I'm working with on Discord when you have the time. I am very much looking forward to a greater understanding of the concepts you are working with here!
Hey, I know this has been uploaded for a while, but I am getting a NullPointerException on the if statement in the train method from your previous video. It starts at this train method then spreads to the train method with the trainset. Can anybody help me?
Well... I just figured it out if anyone ever comes across the same problem. I had a typo in the trainset train method. The second for loop was " for (int b = 0; i < batch_size; b++) { " instead of " for (int b = 0; b < batch_size; b++) {".
going from 4 to 1000 is bad to begin with because you should think about neurons as things that can find a structure in the input data. Now, when you go from 4 to 1000, there is no way you can find 1000 distinct patterns in the 4 values. Also when you go from 1000 to 4, there are definetly more than just 4 patterns to detect in the 1000 neurons.
Great video but the extractBatch function throws a nullpointer can someone tell me why that happens btw I downloaded the file in the desc so it should be working
Hey I don't know if you still have this issue but I also had it and it was a problem with NetworkTools.randomValues(). It wasn't returning numbers on the specified range - change it from int n = (int)(Math.random() * (upperBound - lowerBound + 1) + lowerBound); to int n = (int)(Math.random() * (upperBound - lowerBound )); and it should work correctly.
Hey i've got a quick easy question how does it work when you have input and output values above 1 how do you convert them to work and then convert them back to get the wanted value? For example input 4.5 gives me 100 in output
well just plug it in. That's not a big deal. Your input value must not be in between 0 and 1. Same for your output. It is recommended but it does not have to. You could use a logarithmic function to map your values first. something like f(x) = log(x+1) and the inverse function: g(x) = e^x - 1
Hey Finn, you are probably not looking at these comments any more, but can you help please help me with my code. My problem: The network is working but giving me very small numbers. I tried to run the code you have in the start of the video with the two inputs and outputs, I got this as an ouput: [8.794263284359918E-6, 8.569996486197259E-6] [8.408882815470081E-6, 8.191886381195247E-6]. Why are these so small. They don't really match up with the target. Any help will be awesome! Thank you. P.S. I need to make an neural network that plays tic tac toe. So, I was thinking of using a minimax for finding the target values and then running it through this network. Will that be possible?
Sorry for the trouble, I found my error and fixed it. Works great now! But can you still give me some advice on how I would go about making that tic tac toe game? Please and thank you.
Sorry for hogging the comment section, but I just discovered something interesting. I was getting surprisingly high error scores on my test data. Training essentially manipulates the inner workings of the network to fit the last-trained data, meaning it 'forgets' the previously trained data, right? My training set consists of 120 items. My testing set consists of 40 items. In an attempt to understand further, I used my TESTING DATA to train with first! After which I then trained the training data (so total of 160 items trained). When I then tested with the testing data, to my surprise the error scores were only about 3% lower than if I didn't train the test data. Do you have any idea how to fix this? My test data should have had a much lower error score, no? Currently I'm working with few layers (around 5 total). Is this a common phenomenon by any chance that has an easy solution? :) :) P.S. Running the code in a Linux VM (VMware) gave me a +30% speed boost. Surprisingly running Linux natively or on a VM makes no difference in execution speed. Also, I ported the code to C++ for Linux which, SOMEHOW (WTF??), ran 20% slower than Java. Thought you might find this info useful in case you ever need a speed boost. ;)
Well try to use less layers. I've made some simulations and I also discovered that when you have to many layers it will totally fuck up. So try it with less layers. Usually your architecture is wrong if sth. like that happens
I undid the L2 regularization but I still have the same issue. Maybe I need more layers then? So is it normal for a network to (completely) 'forget' training data as more training occurs?
Hi Finn, I did some more reading on the L2 regularisation and found this article: jamesmccaffrey.wordpress.com/2017/06/29/implementing-neural-network-l2-regularization/ It has a very nice and easy snippet of code but I thought I'd run it by you in case I'm doing it wrong. In the updateWeights() method I'm adding this line in the for-loop (right before the weights get increased): weights[layer][neuron][prevNeuron] -= lambda * weights[layer][neuron][prevNeuron]; As recommended in the article, I will leave lambda as a high value, starting at 0.97 and seeing how it goes. This is correct, right? Thanks very much for all the help!
I quickly read the article as well and I would have gone for that solution aswell. I am afraid that a starting value of 0.97 might be to high but that's up to you :). Go ahead and test it. In theory your term should be correct.
Hey, great series! Love it.
One thing though, I think you got the batch training concept wrong. Batch training is done so the weights are not updated AS WE TRAIN EACH INPUT over the batch. We accumulate the delta weights, and update them all at the end of the batch.
The reason we do this, is because our final AIM is to minimize cost (error) over all the dataset (which is hard, hence batches). If this is confusing, I'm not sure how to explain this, but you can check 3Blue1Brown's 4 video playlist on Neural networks. He explains this in 3rd or 4th vid I think. Good luck all
you are correct. I am sorry for this. Has been long time ago and i realised this was very wrong :P
I implemented your same code but i did not get the expected output when i trained sets of data as you did in 1:50 in your video. please help me.
I assume there seems to be some issue in your backpropagation then. Maybe go ahead and check line by line :)
Thanks! this network runs about 10 times faster than the one from Coding Train!
Mario Velez there is an even faster version. I uploaded it on github and it supports convolution etc. You can look for Luecx on github and under my repositories you should find something called AILibrary.
Great Video again, but this time I have a question behind that algorithm logic:
My first thought was, why would you train multiple times the same training pairs instead of just training each once and increasing the learning rate.
But then I asked myself if its true that anytime you adjust the weights the timewise older adjustments tend to fade out in effectiveness. In that case you would flat that "fade out" curve by increasing the amount of training loops and decreasing the learning rate. (I hope I explained my thoughts unambiguously, if not feel free to ask me back.)
But training it multiple times with lower learning rate just seems to cover the problem and not solve it.
So the real question is how do we train the algorithm so that every input-target pair has the same influence on the weight adjustments (without training them infinitely many times)?
Hi Finn, thanks for all the videos, I'm following them with great intensity and they're helping me a lot! I have a question though, why is 8:40 a bad idea? What happens if you have network sizes of 4, 1000, 3, 2? Also, is there a rule for how many layers your network should have? Thanks a lot!
+Anon ymous that is actually really though to answer but I will give it a try:
Generally there is no formula to get the perfect architecture but think about it this way:
As described in the video about the backprop algorithm the network can be understood as a function that takes an input vector and the weights/bias and returns the output vector.
The more weights you have, the more complex can you function be. A network is pretty much trying to interpolate a dataset.
Lets say you have 100 points on a 2D graph. You can create a polynomial function with 100 variables so that every point is intersected by your function.
But this is not a good interpolation. You want to have a lot of data and want to generalise it and create a polynomial function of maybe 3rd, 4th or 5th degree with only a few variables. This means that your interpolation is way more general.
In a Network you want to find the optimal architecture to not overfit and to not overgeneralise your network for your data.
Neurons in a Layer can be compared to pattern recognisers that are looking for a specific pattern in the previous layer.
So the first hidden layer is looking for n patterns in the input layer. The second hidden layer compares these pattern-values and looks for new patterns.
Now lets assume that you have 3 input neuron and 1000 Neurons in the 2nd layer. So you are checking for 1000 patterns that are being created by 3 input Neurons.
This sound definitely like overfitting. It would mean that (1,0,0) in the input layer would trigger completely different Neurons in the next layer as if you put (0.9,0,0) into the network because the pattern is different. But obviously the network should probably trigger the same at the given inputs. So you need less Neurons in the second layer to detect less patterns
Finn Eggers Thanks for the reply. For my network I will have 35 or 60 inputs (to be decided), and I only have 150 training sets. My output size will be the same as the input size. For 35 inputs would a network size of 100, 75, 60, 50, 40, 35. And for 60 inputs 200, 150, 120, 100, 85, 75, 65, 60. So the first hidden layer should ideally be a lot larger than the input layer, and then each hidden layer should reduce in size?
+Anon ymous mhm.. I am not quiet sure what you mean.
Can you tell me what you are trying to make your network learn?
If the input size is 30 . You cannot use a network (60,45,30...) because the data input size needs to have the same size as the amount of Neurons in the input layer.
Sorry I wasn't clear.
I'm not sure if my network will consist of 35 inputs or 60 inputs, so I'm exploring both options for now. After doing some more research I will conclude this on my own.
Regarding each layer size, I'm just trying to get a general idea. What do you think of what I suggested below?
Option 1 - 35 Inputs:
35 (Input layer size) -> 75 (first hidden layer size) -> 60 (second hidden layer size) -> 50 -> 40 -> 35 (output layer size).
Option 2 - 60 Inputs:
60 (Input layer size) -> 200 (first hidden layer size) -> 120 -> 100 -> 85 -> 75 -> 65 -> 60 (output layer size).
I only have 150 sets of inputs to train with. I hope I explained it a bit better :)
Yeah that might work. It's usually really hard to tell if that's a good structure. By the looks of it it looks like that you might have to many layers in the second example with 60 inputs. Especially when you have like what, 150 trainsets? Just calculate the amount of weights and you will see that your weights / data ration might be way to high and results in overfitting. Overfitting cannot be seen during the training stage but when you run your network on some new data. You should play around with that.
I've seen an attempt how some people create their networks: They start with no hidden layer and only their input and output size and run the training stuff. If the overall error does not reduce anymore and it is still really bad. They add a new layer with n neurons where n is most likely in between the input and the output size.
If the error has been to high at this point. The network probably did not have enough weights so it resulted in overgeneralisation.
So they keep adding layers until the result is as they want. Try this for yourself because if you start with a huge network your overall network error will be really small when training but your success rate later on might be extremly high.
Here is an image that illustrates overfitting and underfitting (overgeneralisation): i2.wp.com/www.geeksprogramming.com/wp-content/uploads/2016/08/ML3.png?resize=838%2C300&ssl=1
Great series! But there is one thing I didn't really get yet: How do I know how many layers and how many neurons in each layer I should use for a specific problem? Thank you!
Hello. Do you have another link for the code? It seems that mediafire is currently not working well.
Okay, so I am not 100% sure about this but:
When using a batch algorithm the system weights are kept constant while computing the error associated with each sample in the input. Which means that what you should be doing is: (In pseudo code)
loop for each batch
for each training item in batch
compute weights and bias deltas for curr item
accumulate the deltas
end for
adjust weights and bias deltas using accumulated deltas
end loop
Where online training goes like the one you have shown us:
for each training data item
compute weights and bias deltas for curr item
adjust weights and bias values using deltas
end for
So your approch isn't really a batch training approch as much as a batched online one.
P.S. Really nice tutorials though, I just thought I should point this out.
Actually you are right.
I've seen an implementation that is being done like this but yes. This is not "true stochastic gradient descent".
Also the way I am choosing training samples randomly is not perfect.
At the point I made these videos, I just had bad sources and I totally agree with you.
The problem with this is that the sources on the web differ in many small points and a lot of them are not correct and sadly I kinda copied this badly.
However this series should be a starting point for beginners in this topic and smaller projects still work fine.
I see that you have done a little bit more in this field like I did lately.
I'm really excited about your video series Finn!! I have a question, which I hope isn't too silly. When you are back propagating the conversion of one number to another number... how do you get to a point where it improves it's ability to do this with any number? The way it's set up is great in that it shows the learning of one conversion. I'm just not clear on how it gets from that to an overall strategy.
If I'm understanding anything correctly in the process, the bias/deltas need to be collected and averaged against particular inputs? So a database of deltas with a key relation to a particular number would allow it to solve within X fewer training rounds?
Don't worry, your question is not silly. If you have any questions after this, feel free to ask.
So, what is a neural network? The simplest answer: It is a function approximator.
What kind of function? Any. Really!
Let's talk about simple one dimensional mathematical functions like f(x) = x^2.
We can make a neural network learn that (kinda).
Just give him like, 10 x-values with coresponding f(x)-values and you will get a pretty good approximation for f(x).
Now let's talk about the weights. Well yeah we are taking the average of the deltas for different input/output values and change the weight accordingly.
A database would only work if the network topologie (layer sizes etc) would be constant.
What a neural network does is the following:
We give it something to "learn". For that, we put something into the network and see what it calculates. After that, we tell it what the output should be.
The output depends on nothing but the input and the weights. Now, we can define something like an error of the network.
And "simple" calculus: We take the derivative of the error function with respect to each weight and change those accordingly.
This might not be the best answer to your question because I was not quiet sure what you were asking in the second part. Hope this explains a bit. Feel free to ask further.
Thank you so much for your answer. It seems I need to better understand the derivative of the error and maybe create a system that tracks errors from an extended history. I was initially looking at it perhaps from the wrong direction and thinking I could use a sort of snapshot of the neural net at the point of success to seed the next set of calculations instead of a randomized initialization.
I really hope to see more videos from you in machine learning. Please keep up the awesome work!
NR Updates if you like, we can talk on discord, Skype etc later this day and I will try to explain it as good as possible.
Hi Finn, I was just kind of reprocessing your answers. In your initial response you wrote: "So, what is a neural network? The simplest answer: It is a function approximator. "
This is actually my fascination with these networks. I have an existing NLP project with a series of complex functions which I want to map out and replicate with a series of neural nets. It's a rather large project so it may require a NN composed of other NNs. (If that makes any sense). If you would like to see what I'm trying to do, feel free to ask to see what I'm working with on Discord when you have the time. I am very much looking forward to a greater understanding of the concepts you are working with here!
Sure, I will be home at roughly 4 pm CEST. Got a free evening. Just tell me when you got time.
Hey, I know this has been uploaded for a while, but I am getting a NullPointerException on the if statement in the train method from your previous video. It starts at this train method then spreads to the train method with the trainset. Can anybody help me?
Well... I just figured it out if anyone ever comes across the same problem. I had a typo in the trainset train method. The second for loop was " for (int b = 0; i < batch_size; b++) {
"
instead of " for (int b = 0; b < batch_size; b++) {".
why its not good idea to have a net like 4,1000,4,2
going from 4 to 1000 is bad to begin with because you should think about neurons as things that can find a structure in the input data. Now, when you go from 4 to 1000, there is no way you can find 1000 distinct patterns in the 4 values.
Also when you go from 1000 to 4, there are definetly more than just 4 patterns to detect in the 1000 neurons.
Great video but the extractBatch function throws a nullpointer can someone tell me why that happens
btw I downloaded the file in the desc so it should be working
Kristóf Bácskai at what point is the null pointer thrown?
Hey I don't know if you still have this issue but I also had it and it was a problem with NetworkTools.randomValues(). It wasn't returning numbers on the specified range - change it from
int n = (int)(Math.random() * (upperBound - lowerBound + 1) + lowerBound);
to
int n = (int)(Math.random() * (upperBound - lowerBound ));
and it should work correctly.
@@vitorbasso The + 1 was wrong. But your code would return a value in the range from 0 to (upper - lower).
You need to add the lower range.
Hey i've got a quick easy question how does it work when you have input and output values above 1 how do you convert them to work and then convert them back to get the wanted value? For example input 4.5 gives me 100 in output
well just plug it in. That's not a big deal. Your input value must not be in between 0 and 1. Same for your output. It is recommended but it does not have to.
You could use a logarithmic function to map your values first. something like f(x) = log(x+1) and the inverse function: g(x) = e^x - 1
Oh thanks for the quick response ;p
You are welcome ;)
Hey Finn, you are probably not looking at these comments any more, but can you help please help me with my code. My problem:
The network is working but giving me very small numbers. I tried to run the code you have in the start of the video with the two inputs and outputs, I got this as an ouput: [8.794263284359918E-6, 8.569996486197259E-6]
[8.408882815470081E-6, 8.191886381195247E-6]. Why are these so small. They don't really match up with the target. Any help will be awesome! Thank you.
P.S. I need to make an neural network that plays tic tac toe. So, I was thinking of using a minimax for finding the target values and then running it through this network. Will that be possible?
Sorry for the trouble, I found my error and fixed it. Works great now! But can you still give me some advice on how I would go about making that tic tac toe game? Please and thank you.
@@tusharwani3146 What was you error? I have the same issue.
@@ates641 My error was in the backpropError method. I add one of the signs wrong. After fixing that everything worked!
@@tusharwani3146 Thanks for quick reply. I will check my method hope it works...
@@ates641 I am assuming you have checked for big mistakes. So, check for small trivial errors that might make the code misbehave.
Sorry for hogging the comment section, but I just discovered something interesting. I was getting surprisingly high error scores on my test data.
Training essentially manipulates the inner workings of the network to fit the last-trained data, meaning it 'forgets' the previously trained data, right?
My training set consists of 120 items. My testing set consists of 40 items. In an attempt to understand further, I used my TESTING DATA to train with first! After which I then trained the training data (so total of 160 items trained). When I then tested with the testing data, to my surprise the error scores were only about 3% lower than if I didn't train the test data.
Do you have any idea how to fix this? My test data should have had a much lower error score, no? Currently I'm working with few layers (around 5 total). Is this a common phenomenon by any chance that has an easy solution? :) :)
P.S. Running the code in a Linux VM (VMware) gave me a +30% speed boost. Surprisingly running Linux natively or on a VM makes no difference in execution speed. Also, I ported the code to C++ for Linux which, SOMEHOW (WTF??), ran 20% slower than Java. Thought you might find this info useful in case you ever need a speed boost. ;)
Well try to use less layers. I've made some simulations and I also discovered that when you have to many layers it will totally fuck up. So try it with less layers. Usually your architecture is wrong if sth. like that happens
I'm only using 2 hidden layers... Maybe I did something wrong with the L2 regularization? I'll go investigate. Thanks.
I undid the L2 regularization but I still have the same issue. Maybe I need more layers then?
So is it normal for a network to (completely) 'forget' training data as more training occurs?
Can you tell me your network looks, how much data you got? How the data looks etc?
Hi Finn, I did some more reading on the L2 regularisation and found this article: jamesmccaffrey.wordpress.com/2017/06/29/implementing-neural-network-l2-regularization/ It has a very nice and easy snippet of code but I thought I'd run it by you in case I'm doing it wrong. In the updateWeights() method I'm adding this line in the for-loop (right before the weights get increased): weights[layer][neuron][prevNeuron] -= lambda * weights[layer][neuron][prevNeuron];
As recommended in the article, I will leave lambda as a high value, starting at 0.97 and seeing how it goes.
This is correct, right? Thanks very much for all the help!
I quickly read the article as well and I would have gone for that solution aswell. I am afraid that a starting value of 0.97 might be to high but that's up to you :). Go ahead and test it. In theory your term should be correct.
wow man youre awesome