Blog posts

2023

Product management for AI products - old wine in new bottle?

Published:

Netflix recently posted a job opening for an AI Product Manager role with a staggering annual salary of up to $900,000. The question arises: why such a high compensation for an AI PM? When was the last time a Product Manager got paid so high?

The answer lies in the specialized skill set required for product management in the realm of AI. The traditional principles and experiences in product management for standard software products do not suffice for AI products. Product management in AI development demands a significant upgrade.

Before delving into the specifics of 'what' product management skills to enhance and 'how' to upgrade them to thrive in the AI product era, it's crucial to address the 'why.' Why do traditional product management methodologies and approaches fall short in the AI-first world? Let's explore the fundamental changes from traditional software to AI software that necessitate the evolution of product management skills:

  1. Stochastic vs Deterministic: Traditional software (referred to as software 1.0) operates deterministically, meaning it consistently produces the same output for a given system and input. In contrast, AI software (referred to as software 2.0) is stochastic, meaning it can yield varying outputs for the same system and input.

    As a product manager, meticulous control over the user experience is paramount, especially to prevent negative user experiences. In the realm of software 1.0, this entails thorough scenario planning within the Product Requirements Document (PRD) to anticipate and design for potential errors. Once the software undergoes comprehensive testing, scenarios that pass ensure a reliable user experience in production, barring any unforeseen bugs.

    However, the landscape shifts with AI software (software 2.0). Even after rigorous testing, AI systems may still generate incorrect responses due to their stochastic nature. Such occurrences are not considered bugs but rather inherent to the probabilistic framework of AI software. Acknowledging that no AI system achieves perfect accuracy, product managers must proactively devise user experiences capable of managing errors and gracefully handling failures. This necessitates a specialized understanding of AI design processes for product managers to effectively navigate this terrain.

  2. Designing for loops (control loop, feedback loop, human-in-the-loop): Given the stochastic nature of AI systems, errors are inevitable. As a product manager, it's imperative to establish control loops to swiftly determine whether the AI's output is beneficial to the end user. If not, end users should be empowered to take control of the system to override the AI's decisions, stop AI recommendations.

    Similarly, the product team must institute feedback loops for each AI feature to assess its efficacy for end users. This involves gathering data across various scenarios to ascertain when the feature performs optimally and when it falters.

    Additionally, designing for humans-in-the-loop is crucial, enabling the seamless transition of control from AI systems to human operators when necessary.

  3. Data Collection: It's widely recognized that AI relies heavily on data. However, the responsibility of data collection doesn't fall on AI scientists; rather, it lies with the product team. Why? Because the product team holds ownership of the product, making them best suited to embed the necessary instrumentation for data collection.

    This is where PMs need to understand the lingo of data systems and work with data engineering teams to create roadmaps & sprints for data collection. In large AI first MNCs, this is a very specialized role called Data PM.

  4. Introducing Users to AI: When integrating AI into your products for the first time, it's vital to introduce AI to your users delicately. Setting the right expectations and assuring users that they can regain control if they're uncomfortable is essential. Achieving this demands the creation of a highly specialized flows & UX.
  5. Economics of AI: AI constitutes a costly endeavor, encompassing expenses in data acquisition, cleaning, storage to create datasets; computational resources, and talent acquisition. Given the high costs involved, not every potential use case justifies the investment. Hence, it's imperative to select use cases judiciously.

    Within a company, the business team comprehends business dynamics but often lacks technical expertise, while individual contributors (ICs) in the AI team possess technical proficiency but may not grasp business intricacies. In this landscape, product managers play a pivotal role, bridging the gap between technology, user needs, and business objectives. Consequently, the responsibility of evaluating and determining the feasibility & viability of AI use cases predominantly falls on the shoulders of product managers.

  6. The AI development process is very different: The transition from traditional software development (Software 1.0) to AI-driven systems (Software 2.0) introduces significant differences in how product management approaches roadmaps and timelines. Here are some implications for product owners:
    • Uncertainty in Improvement: Unlike traditional software where incremental improvements are often achievable by addressing specific issues or corner cases, AI-driven systems may hit performance plateaus where further improvements are challenging. This uncertainty must be factored into product roadmaps and timelines, as achieving desired performance levels may require more extensive reevaluation and experimentation.
    • Extended Development Cycles: Improving AI-driven systems often involves revisiting the underlying algorithms, data pipelines, or model architectures, which can be time-consuming processes. Product owners need to allocate sufficient time in their roadmaps for research, experimentation, and validation of new approaches, potentially extending development cycles beyond what is typical for traditional software updates.
    • Resource Allocation: Given the longer development cycles associated with improving AI systems, product owners may need to allocate additional resources, both in terms of engineering talent and computational resources. This may require reprioritizing other initiatives or increasing investment in AI research and development to meet performance targets within the desired timeframe.

2022

Alexa AI and Layoffs

Published:

Lessons from Recent Layoffs in AI

Yesterday I was reading an article on recent layoffs at Alexa voice assistant unit in Amazon. This got me thinking. I believe the whole episode contains some critical lessons for many AI teams:

Key Lessons:

  • You can have the most cutting-edge technology which even your competition tries to copy, yet if it doesn’t move business needles it is a liability.
  • Business & Product took a leap of faith that users will order goods purely using voice without even looking at items even once. This and other monetization efforts didn’t happen and everything fell apart.
  • Mistakes made by the product team are super expensive. PMs must find ways to list down and validate assumptions, especially leap of faith, as quickly as possible. In the case of Alexa, I wonder if there were ways to test these without having to invest so much time & effort.
  • Alexa single-handedly contributed to so many developments in speech tech, it is an absolute marvel. Yet reduced only to alarms, how is the weather outside and play song. Primarily due to points (2) and (3).
  • Business needs to understand that building cutting-edge tech takes time - no matter how much you may push, it can't be done in days, weeks or sometimes even months.

#Learnings_From_AI_Trenches #business #technology #ai #BuisnessOfAI #ai #machinelearning #nlp #nlproc #tech #deeplearning #PayItForward #MachineLearning #AI #nlproc #naturallanguageprocess #speechrecoganition

Moonlighting & Side Hustle - Problem and Solution

Published:

Moonlighting in tech has emerged as major stress for companies. Below is my personal take on it:
The core reason IMO is: companies/consultants project every job as very high-end work (once-in-a-lifetime opportunity) but often the reality is very different. Being a knowledge industry, tech leaders talk about hunting as lion (hunt, rest, repeat) but in reality do nothing to facilitate it and make you work like a cow (graze all day). All this along with ‘fairness’ in compensation + hire-fire policy means there is nothing left nothing on employee loyalty. Now that employees are giving back to employers a dose of their own medicine (no show on day of joining, quick churn) - there is hue n cry! IMO the blame lies equally on both sides.

Possible Solution? Both employees and employers must understand, professional relationship is a 2 way street. Both will have to do their part to solve it:

  • Organizations must clearly specify minimum working hours they expect. Be explit, say Mon-Fri (10am-7pm) or watever works for them. Be reasonable - dont say 24 hrs a day, 7 days a week!
  • In their free time, employees should be free to pursue their interests - some like to chill n watch Netflix, others like to work on side gigs. That should be entirely employee’s choice - no organizations should not dictate this
  • Organizations should make moonlighting policy super transparent. Create a system where employees must inform of all side gigs to the employer. In return, employers must not object to side gigs unless there is a clear conflict of interest.
  • Organizations should evaluate employees purely on their performance in their organization. If the performance is great, reward them accordingly. If not, put them on PIP or let them go. Unless a side gig is clearly impacting an employee’s performance - it should not concern the employer.
  • Employees must understand companies will be more transparent only if employees don’t cheat. So:
    • Only 1 full-time employment. Can’t take 2 overlapping jobs. To be more precise - time commitment of one job must not conflict with time commitment of any other gig.
    • Employee can have side gigs but only in your free time (time not commited to organization).
    • Employee must inform of all side gigs to employer.
    • Employers must not object to side gigs unless there is a clear conflict of interest.
    • Employees’s side gig cannot be with competitor companies for up to 1 yrs (maybe even 2 years) after leaving any employer. Organizations must have a strong right to protect IP, business strategy, internal details etc. Any clear conflict of interest must be strictly prohibited.
  • Despite all this, if an employee moonlighting, Organizations should have the legal right to take a very strict action and penalize them heavily.

A bunch of companies such as Swiggy have created such policies. 100% transparency on both sides is the only way forward to create a true knowledge economy. IMO the way to attract really smart people in a knowledge economy is flexibility, challenging work and transparency.

Pay It Forward

Published:

I am often asked why is AI is a hard technology? (both in terms of time and money). There are multiple reasons for it:

  • Finding the right use case: finding a problem that is important to its users and can be solved well by AI. I talked earlier on this here

  • Having the right data: most companies have tonnes of data but seldom have the data for the problem at hand. In AI its not just the quantity of data that matters but also the quality of data. Do you have the training data that resembles the production data very closely? Does it have enough data points that represent various outliers, mean data points, data points near the actual decision boundary etc
  • Labeling cost - If you need to get your data manually labeled, this will need lots of time and money. The process of labeling itself might not be straightforward. How do you benchmark the quality of labeling? If your labelers are making mistakes this can jeopardize the entire project.
  • Compute cost - the cost of the compute needed for training the model and for inference. This is especially crucial if your solution uses deep learning.
  • Going from a solution to the solution: In AI there is no one way to solve a problem. There are multiple ways and each will give you a different accuracy. Getting from the point where you have a model that gives you an acceptable performance to getting to a model that gives you fabulous performance is often a very long journey.
  • Fitting model into the product to get AI-driven feature: Its not just a simple API integration. One needs to figure out the best UX to serve the predictions, should this UX be intrusive or soft suggestions, UX should be able to gracefully handle the scenarios when the model makes mistakes, a mechanism to let the human take over when it is going all wrong for the end-user, instrumentation to collect data to continuously evaluate the performance of the model on prod data, instrumentation to collect data for future modeling efforts.
  • Need to retrain model: many times AI systems require one to retrain the model after every 15-30 days on the latest data.
  • Talent: DS/ML team alone cannot pull off all the above. You need good and diverse talent for this - ML/DS folks who can evaluate the various approaches depending on the situation and can take the most suitable one to build the right model (and not blindly apply DL), Data engineers to build data pipelines to collect all data, a product manager who can figure out right use case, build instrumentation, UX to handle mistakes, a AI leader who understands this entire life cycle and work very closely with business and speak their language. This talent doesn’t come cheap.
For AI effort to fail - going wrong on just one of the above is enough.

2021

Innovation in AI Teams

Published:

Innovation and AI are two words that often go hand in hand. However, depending on the type of setting an AI team is embedded in - the semantics on ‘Innovation’ can vary.
If you are part of the research lab, this often means AI team works on fundamental problems and create fundamental & new techniques to power cutting-edge solutions. They publish a lot, advance the entire area in a fundamental way - this is their true north star. It goes without saying this means lot of publications, paper presentation, talks & tutorials at research conferences, very close collaboration with academia etc.
When it comes to AI teams working in the product settings, the innovation here is of very different type. And in this post and next I will try n focus on this facet. There are mainly 4 type of innovations that happen in such teams:

  • Finding newer uses cases and touch points in your product(s) where AI can deliver a completely new experience to the user : This innovation often create USP/key differentiator for the product compared to the competition in the respective segment. And if the feature is killer, it changes the user’s behavior which in turn changes user expectation. And then the competition has no other option but to play the catch-up game. This creates a very strong moat. For example, YouTube uses AI to generate subtitles in multiple languages on the fly. They have fundamentally changed user expectations and other platforms have to have a similar feature.

    This kind of innovation is often not as easy as it may sound. I have written about this in an earlier post (ref : Hobbiton - Uses cases where AI can deliver great returns). This requires first principle thinking. Product managers who can envision features in a completely new way using AI, very close collaboration between AI & product teams to understand what is possible what is not, a very good execution to deliver A-class experience (despite AI model will make mistakes).

  • Building the dataset for the problem at hand : For AI teams embedded in product settings, building a comprehensive dataset for the problem at hand is often 80% of the battle. Building such a dataset is never writing a bunch of queries on a massive data store at your disposal. AI Teams often have to come up with very smart strategies, hacks to augment data to arrive at a comprehensive dataset. Also, this is not a one-time activity but often highly iterative. What more type of data is needed - comes from the kind of mistakes your models are making.

    I will soon be writing a dedicated post on some of the best examples I have seen/heard on innovative ideas to collect data.

  • Publishing paper/patent (doesn’t happen that often) : AI teams embedded in product teams do write patents and papers - but this is seldom their true north star.

So what about coming up with newer techniques? Rarely AI teams embedded in product teams do this. Why? Because in most cases, known techniques with a comprehensive dataset with rigorous evaluation itself can take your AI solution very very far on business metrics. From there on it is mostly a game of diminishing returns. Read about “later part of S curve” in my earlier post on “[ROI in AI] (https://www.linkedin.com/posts/activity-6842736295173775360-etDf/)”. This is not true only if AI is the backbone of your product and your core value proposition.

To read more on AI in industry vs Acedemia, please read my earlier post “[Why machine learning in the industry is different from that is academia] (https://www.linkedin.com/posts/activity-6836567604467970048-nQu8/)”
#Learnings_From_AI_Trenches

ROI in AI - Expectations vs Reality

Published:

A lot of companies and teams are betting big on DeepTech like AI to drive the next phase of growth. While AI can certainly be the “energy” to help you move to the next orbit, most teams falter big time. While there are many reasons for this, one key reason is the ill-founded expectations by senior people (VPs, CXOs, business leaders etc) in the company. Today’s post will explore this facet of AI - often the least spoken one.


Most leaders/companies start AI in their teams/product lines much later. Traditional corporate wisdom has always advocated “quick” wins because that is how one establishes themselves and gets the key stakeholders super excited. So they advocate the same in AI!

This is also driven by:

  • Articles on AI in popular media with lofty claims (remember the humanoid dancing in Tesla AI day or Sophia - not many people understand these are mere marketing and perception gimmicks)
  • The thought that AI at the end of the day is a code - so it cannot be very different from software engineering. The issues are mere initial hiccups which is always the case with any new tech, so is the case with AI

So most leaders expect the ROI curve in AI to be a vertical takeoff. This expectation is further fueled by the fact that AI doesn’t come cheap - data, data engg & AI teams, talent cost, hardware cost, data collection & labeling cost and time etc, MLOps etc Read my previous write up titled the economics of AI

This is where most Orgs/Teams make THE mistake. The true ROI curve in AI is very very different from their expectations. As shown in the figure below, it follows a S shape curve.

  • Initial lower part of S: One needs to invest a lot of time and energy with very little ROI - finding the right use case, right quality and quantity of data and labels (more on this in another post), right modeling approach, right metrics etc. This can take anywhere between 1-3 years.
  • Middle part of S: This is where one gets the highest ROI per unit effort. This is because most of the hard work has been done in the first 1-3 years. All key stakeholders have a better understanding of the AI journey. This phase is all about building the 2nd/3rd version of AI systems
  • Later upper part of S: This is where the ROI again starts to stagnate. This phase is where your team is building the 6th/7th version of the system and trying to push systems performance to the upper 90s. This often requires completely new approaches, new algorithms. Its not just adding a couple of lines of code!

Another thing most folks don’t necessarily understand is that each of the 3 different phases of the curve requires a very different strategy - type of goals, kind of skillsets, focus areas, etc

Making AI Work For You

Published:

I am often asked why is AI is a hard technology? (both in terms of time and money). There are multiple reasons for it:

  • Finding the right use case: finding a problem that is important to its users and can be solved well by AI. I talked earlier on this here

  • Having the right data: most companies have tonnes of data but seldom have the data for the problem at hand. In AI its not just the quantity of data that matters but also the quality of data. Do you have the training data that resembles the production data very closely? Does it have enough data points that represent various outliers, mean data points, data points near the actual decision boundary etc
  • Labeling cost - If you need to get your data manually labeled, this will need lots of time and money. The process of labeling itself might not be straightforward. How do you benchmark the quality of labeling? If your labelers are making mistakes this can jeopardize the entire project.
  • Compute cost - the cost of the compute needed for training the model and for inference. This is especially crucial if your solution uses deep learning.
  • Going from a solution to the solution: In AI there is no one way to solve a problem. There are multiple ways and each will give you a different accuracy. Getting from the point where you have a model that gives you an acceptable performance to getting to a model that gives you fabulous performance is often a very long journey.
  • Fitting model into the product to get AI-driven feature: Its not just a simple API integration. One needs to figure out the best UX to serve the predictions, should this UX be intrusive or soft suggestions, UX should be able to gracefully handle the scenarios when the model makes mistakes, a mechanism to let the human take over when it is going all wrong for the end-user, instrumentation to collect data to continuously evaluate the performance of the model on prod data, instrumentation to collect data for future modeling efforts.
  • Need to retrain model: many times AI systems require one to retrain the model after every 15-30 days on the latest data.
  • Talent: DS/ML team alone cannot pull off all the above. You need good and diverse talent for this - ML/DS folks who can evaluate the various approaches depending on the situation and can take the most suitable one to build the right model (and not blindly apply DL), Data engineers to build data pipelines to collect all data, a product manager who can figure out right use case, build instrumentation, UX to handle mistakes, a AI leader who understands this entire life cycle and work very closely with business and speak their language. This talent doesn’t come cheap.
For AI effort to fail - going wrong on just one of the above is enough.

Hobbiton - Uses cases where AI can deliver great returns

Published:

One often comes across organizations trying to apply AI to problems when it is really not needed. Such organizations often end up doing so because either someone in the org wants to look cool or a CXO in the company wants to start doing AI (mostly out of FOMO or ill-informed media articles with lofty claims). AI is just a tool to solve problems (though a powerful one), just like other problem-solving tools such as engineering, or operations. Today AI can solve some problems quickly & very well which was not possible until even a decade ago. But that doesn’t mean AI is a sledgehammer for every difficult problem. Despite crazy advances in the last decade, AI is a very nascent, fragile and expensive technology and must be applied after due thought and not as a go-to approach for every problem in the world.

One of my favorite examples of this is Google’s approach to “searching for special characters in Google Docs”. Here is the problem statement: Most users who create docs or decks once in a while use special characters. But it is very hard to find the right special character when you need it. Given the large number of special characters that exist, one cannot show all of them to the user. To facilitate quick accessibility, one requires the ability to search for them. However, there is one particular problem with special characters - most users cannot remember them by name (try recalling names of 10 or more special characters yourself). This renders “textual search” useless. So, how does one solve this? Visual search!

Google used a very simple but powerful observation - most users can very easily recall how the special character that they need looks like, unlike their name. So, they provided a small sketch pad for users to draw what they remember and use the concepts from computer vision (Sketch-RNN) to suggest a few closest options based on the visual match. The same concept was later used to power Auto draw.

I often use this example to drive home the point that one often needs to think from first principles when thinking of AI-powered use cases. For the above-mentioned example, the more you think the more you will realize:

  • The problem could not have been solved by traditional engineering approaches.
  • It is a very well-defined narrow problem statement.
  • The core AI concept (Sketch-RNN) was just right to solve this problem statement well.
  • The solution (powered by AI in this case) really helped to solve the user’s pain point and deliver happiness.

Having said that - it is not that this application of AI changed the fate of google docs. More on changing the trajectory of products using AI in a separate post. Stay tuned!

2016

Great resources on CNN

Published:

I am new to the world of CNNs. While I have worked with neural nets in the area of text, when it comes to CNNs I have little knowledge. There are some awesome resources on the internet to learn on CNNs from ground zero.

Basics of CNN Guide-To-Understanding-Convolutional-Neural-Networks 3 part blog series. Builds the initial intuition very well.

Another good resource is this Blog by Ujjwal

Adam Harley's Blog on basic of CNNs. Its bit lengthy but if you are confortable with the key ideas discussed in first 2 blogs by Adesh and Ujjwal, this is a great post.

Wanna get a feel of CNNs in action ? check this awesome visualization by Adam Harley. I highly recomend to play with it.

Dynamic Plotting in Python

Published:

In my work I am often want to visualize how the data or some aspect of model changes. This requires to have plots/graph which get Dynamically or Live updates - I have a bunch of values. One or more of these values keep on changing with time. I want to visualize these changes using plots.

When I looking for a solution, I did not come across an elegent one. So tried piecing together a solution myself.

For simplicity, imagine we have M data points with a set of initial values. Further, these vaues get updated/changed, so we have a set pf new M values. This happens N times. This can be captured via a matrix 2D matrix A of (m x n) dimension where each row is a set of values.

We record the initial set of values, Then these values get updated, then the update happens again, and again. Row 1 stores the initial values, row 2 stores the subsequent updated values, row 3 stores values after next update, so on and so forth. In esence, the matrix A stores the entire history of values.

Now we will like to plot these values. Where the graph starts with values in first row, then gets updated to values in 2nd row, then gets updated tp values in 3rd row, so on and so forth. This notebook is about how to plot such graphs

This is especially useful in machine learning, where you want to visualize how a particular property of model/data evolves with time (training) Our matrix is of MxN dimensions, initialised randonly.

gunicorn server up and running Lets hit the server with a POST request. Open another terminal window and type in:

All the code in the snippets above can be found in the following jupyter notebook. If you face any issue with the code, please open a New issue on Github.

2014

Deploying ML models - Part 2

Published:

Deploy ML model using Falcon

Falcon

In the last blog, we used flask to deploy our ML model. Today, we will be exploring a different framework to achieve the same - Falcon. Falcon is pretty simple to work with. Lets get coding :

Install Falcon

Load your virtual environment and execute

pip install falcon
                            

I used python 2.7 and falcon==1.1.0. This will get falcon installed.

Bare bones Example

The code consists if 2 files - app.py (contains app structure and end points) and functionality.py (contains the code to support the functionality).

app.py

import functionality

                            api = application = falcon.API()

                            hello_world = functionality.hello_world()

                            api.add_route('/hi', hello_world)
                            

All this code does is:

  • Imports the falcon package and creates a basic application.
  • Accesses the hello_world object from functionality.py and creates an endpoint 'hi' where, when data is sent via a POST request, an appropriate function of the hello_world object is invoked.

functionality.py

import falcon
                            import json

                            class hello_world(object):
                            def __init__(self):
                                print "init"

                            def on_post(self, req, resp):
                                print "post: hello_world"
                                result = {}

                                result['msg'] = "hello world"
                                resp.status = falcon.HTTP_200
                                resp.body = json.dumps(result, encoding='utf-8')
                            

Here, the code does the following:

  • Creates a class hello_world with on_post(), which takes a request and response object.
  • Consumes the request object and sets the attributes of the response object.

Show time:

Prerequisite: You need to install the gunicorn server. Load your virtual environment and execute:

pip install gunicorn
                            

I used Python 2.7 and gunicorn==19.6.0

Run the following steps in terminal:

  1. Use cd to navigate to the directory where app.py is saved.
  2. gunicorn app

This should get the gunicorn server up and running :

gunicorn server up and running

Lets hit the server with a POST request. Open another terminal window and type in:

curl -i -H "Content-Type: application/json" -X POST http://127.0.0.1:8000/hi

On the client terminal, you should see a 200 OK followed by a JSON containing our "hello world" message.

On the server terminal you should see our print statement in on_post().

Congrats, your first application in Falcon is up and working!

Wanna read more on Falcon? Head to Falcon’s page.

FALCON VS FLASK:

I am told Falcon is lightweight. I am not an expert on this topic, here are a couple of links if you wish to take a deep dive into Flask vs Falcon: Python web frameworks benchmarking, reddit thread, hacker news, slant.

We have a model that is up and running as a service. Now you can go ahead and add more functionality like resetting the model, training the model if not already trained, and a lot more.

If you face any issue with the code, please open a "New issue" on Github. All the above code files are available here.

Deploying ML models - Part 1

Published:

Deploy ML model using Flask

Flask

Flask is a lightweight Python web framework to create microservices. Wanna read more ? I am a ML guy, and this sounds complex :-( Rather than reading lets code a simple one quickly !

Install Flask

To install Flask, use the following command:

pip install flask

I used Python 2.7 and Flask version 0.11.1

Bare Bones Example

Open an editor and copy-paste the code from my GitHub repository:

                                from flask import Flask

                                        app = Flask(__name__)
                                        
                                        @app.route('/1')   # path to resource on server
                                        def index_1():        # action to take
                                          return "Hello_world 1"
                                        
                                        @app.route('/2')
                                        def index_2():
                                          return "Hello_world 2"
                                        
                                        if __name__ == '__main__':
                                          app.run(debug=True)
                                
                            

To run this:

  1. Save it as simple_app.py
  2. Install Flask in your virtual environment using pip install Flask
  3. Open terminal, go to the directory where simple_app.py is saved. Run the following two commands:
export FLASK_APP=simple_app.py
                                        flask run
                                        

This should have the Flask server up and running on http://127.0.0.1:5000

Flask Server up and running

If you see the code carefully it says - we have 2 resources with relative URIs as /1 and /2. Let's access them. Go to your browser and type http://127.0.0.1:5000/1

This should fire the function index_1() function and give the following output on the command prompt:

Output from function index_1

Like wise http://127.0.0.1:5000/2 should work. This is a simple flask application. (Oh yeah! this sounds easy, lets move on)

REST (in peace)

There are a couple of terms that are part and parcel of micro services. Lets quickly understand something about them:

  1. API: Application Program Interface - set of routines, protocols, and tools for building software applications.
  2. API Endpoint: It's one end of a communication channel, so often this would be represented as the URL of a server or service. In our example "http://127.0.0.1:5000/1".
  3. REST: underlying architectural principle of the web. Read these awesome stackoverflow answer and this brilliant post from Ryan Tomayko and this post from Martin Fowler to understand the same.

In nutshell, you need to have - GET, POST, PUT, DELETE.

Lets add this to our code. To see this in action, run the server (like previously), go to terminal and type:

To see this in action, run the server (like previously), go to the terminal and type:

curl -i http://localhost:5000/tasks
                            

or

curl -i -X GET http://localhost:5000/tasks
                                

Both commands will give the same output:

Your server terminal will show "200" (success) for both the requests.

RESTful App

Lets add other parts of RESTful to out code. Here it is. To see this in action, run the server (like previously), go to terminal and type:

  1. Get All tasks:
curl -i http://localhost:5000/tasks/
                                        
  1. Get a specific task:
curl -i http://localhost:5000/tasks/2
                                        

Since there is no task with id=4, lets see what happens when we try to get that:

curl -i http://localhost:5000/tasks/4

Error. Task with id=4 does not exists

  1. Add a task:
curl -i -H "Content-Type: application/json" -X POST -d '{"title":"Read a book"}' http://localhost:5000/tasks/
  1. Update a specific task:
curl -i -H "Content-Type: application/json" -X PUT -d '{"done":true}' http://localhost:5000/tasks/2

  1. Delete a specific task:

Delete a specific task:

curl -i -X DELETE http://localhost:5000/tasks/2

task with id=2 successfully deleted

ML APP

Its all great until now, what about the core issue – ML model as microservice ? Here we go:

from flask import Flask, jsonify, request, abort
                            import pickle
                            import numpy as np

                            app = Flask(__name__)
                            model_file_path = "./../models/final_model.pkl"
                            model = None

                            def get_prediction(X):
                            X_i = X.reshape(1, 2)
                            print "X_i.shape"
                            print X_i.shape#load the model if not already done
                            
                            global model
                            if model == None:
                                model = pickle.load(open(model_file_path, 'rb'))#make prediction and return the same
                                prediction = model.predict(X_i)[0]

                            return prediction

                            @app.route('/predict', methods=['POST'])
                            def predict():
                            if not request.json or not 'X1' in request.json or not 'X2' in request.json:
                                abort(404)

                            X1 = request.json['X1']
                            X2 = request.json['X2']

                            X1_ = np.float64(X1)
                            X2_ = np.float64(X2)
                            X = np.array([X1_, X2_])
                            prediction = get_prediction(X)

                            return jsonify({'prediction':prediction})

                            if __name__ == '__main__':
                            app.run(debug=True)
                            

To run this code, get the server up and running and then fire the command below from another terminal.

curl -i -H "Content-Type: application/json" -X POST -d '{"X1":"61.1", "X2":"17.3"}' http://localhost:5000/predict

Complete code is here. It has lot of print statements. You should see these statements printing the relevant stuff on the server terminal.

Model prediction

Lets understand some key aspects of this code. Here, predict() handles any POST request coming on /predict on the server. We extract the payload – components of input vector – these cannot be sent from client as numpy arrays. These components come in as unicode strings. Hence, we transform them explicitly into numpy.float64 and then make a numpy array on the server side. There is an elegant way to do it, which we will see in short while.

Once we have the payload in right format, we invoke the function – get_prediction(). Its main job is load the model into memory if not already loaded, and fire the model on input for getting the prediction.

We have a model that is up and running as service. Now you can go ahead and add more functionality like resetting the model, training the model if not already trained, and lot more.

If you face any issue with the code, please open a "New issue" on Github. All the above 4 code files are available here.

Deploying Machine Learning models

Published:

Quick ways to deploy and test ML models

Introduction

I have often heard Data scientist/ML people asking "I have a ML model thats doing well on test data, how do I deploy it in production environment ?" Recently I have been exploring the same, so sharing some of my findings and ways to achieve the same. Here focus will be on deployment and not developing a model - hence, for simplicity, we will assume we have a pre-trainied, fine tuned scikit model ready to be deployed.

This is a 3 part blog series. We will see how to deploy a ML model as a microservice. We will see 3 different ways of doing the same using:

  1. Flask
  2. Falcon
  3. Jupyter notebook as service (Wow!)

Microservice - what and why ?

Microserivce is an architecture pattern. It can best be thought as being completely opposite of Monolithic architecture (problems). The central idea is to break the system/applications into small chunks(services) based on functionality. Each chunk does a specicifc job and does only that. These services talk to each other using HTTP/REST (synchronous or asynchronous). Want to take a deep dive ? I suggest read quora answer and Martin Fowler's article.

Gradients - summary

Published:

Take home on Computing Gradients that go into training Neural Nets

Generalization

In this post, based on our conclusions in last post, we will try and generalise a strategy to compute gradients for arbit networks, as shown in figure below:

Imagine we have a (Feed forward) network with 1 input layer \(L_0\), 1 output layer \(L_3\), and 2 hidden layers \(L_1\), \(L_2\) respectively. Further, let \(l_i\) be the output of layer \(L_i\). Also, by design, \(l_1 = X\) [input] and \(l_3 = \hat{y}\) [output]. Let \(W_{ij}\) be weights between layers \(L_i\) and \(L_j\). We have 3 weight matrices - \(W_{01}\), \(W_{12}\), and \(W_{23}\).

Tables Are Cool
col 1 is left-aligned $1600
col 2 is centered $12
col 3 is right-aligned $1
Name Lunch order Spicy Owes
Joan saag paneer medium $11
Sally vindaloo mild $14
Erin lamb madras HOT $5
First Header Second Header Third Header
Content *Long Cell*
Content Cell Cell
New section More Data
And more With an escaped '|'
foo bar
baz bim

Gradients - Part 4b

Published:

Part 4 of computing gradients for training Neural Nets

2 layer network, single training example (vector)

This is in continuation of last post where we derived gradients for 2 layer network where hidden layer has only 1 node.

1 hidden layer with \(\geq\) 2 nodes

We will derive gradients for hidden layer with 2 nodes. 3 or more nodes is a straight forward extension.

Neural net with 2 layer, 2 nodes in hidden layer. Input is a vector

First layer (\(l_1\)) has weight matrix $$[W_1]_{\scriptscriptstyle 3 \times 2}$$:

$$ \begin{equation} W_1 = \begin{bmatrix} w_{1}^{11} & w_{1}^{12} \\ w_{1}^{21} & w_{1}^{22} \\ w_{1}^{31} & w_{1}^{32} \\ \end{bmatrix} \end{equation} $$

Second layer (\(l_2\)) has weight matrix $$[W_2]_{\scriptscriptstyle 3 \times 2}$$:

$$ \begin{equation} W_2 = \begin{bmatrix} w_{2}^{1} \\ w_{2}^{2} \\ \end{bmatrix} \end{equation} $$
Input & Output definitions

Exactly same as previous setting. Input is \((\vec{X},y)\): \(\vec{X}\) is a vector, while \(y\) is a scalar.

\(X = [x^1 ~~x^2 ~~x^3]\)       \(x^i = i^{th}\) component of \(\vec{X}\).

Thus, in matrix form \(x,y\) are \([X]_{\scriptscriptstyle 1\times 3}\) and \([y]_{\scriptscriptstyle 1\times 1}\).

Let \( l_1 \) be output of layer 1 (hidden layer in this case). In matrix format, \([l_1]_{\scriptscriptstyle 1\times 1}\)

$$ \begin{align} l_1 & = \sigma ([X] . [W_1]) \label{ref130} \tag{30} \\ & = \frac{1}{1 + e^{-[X] . [W_1]}} \label{ref131} \tag{31} \\ \end{align} $$

\( l_1 \) has 2 components - \( l_1^1 \) and \( l_1^2 \), given by:

$$ \begin{equation} l_1=\begin{bmatrix} l_{1}^{1} & l_{1}^{2} \\ \end{bmatrix} \label{ref132} \tag{32} \end{equation} $$ $$ \begin{align} l_1^1 = \frac{1}{1 + e^{-(x^1 \times w_1^{11} + x^2 \times w_1^{21} + x^3 \times w_1^{31})}} \label{ref133} \tag{33} \\ l_1^2 = \frac{1}{1 + e^{-(x^1 \times w_1^{12} + x^2 \times w_1^{22} + x^3 \times w_1^{32})}} \label{ref134} \tag{34} \\ \end{align} $$

Let \( \hat{y} \) be predicted output. Then as per diagram, it is also the output of layer 2 (\( l_2 \)). In matrix format, \([\hat{y}]_{\scriptscriptstyle 1\times 1}\)

$$ \begin{align} \hat{y} & = \sigma ([l_1] . [W_2]) \label{ref135} \tag{35} \\ & = \frac{1}{1 + e^{-[l_1] . [W_2]}} \label{ref136} \tag{36} \\ & = \frac{1}{1 + e^{-(l_1^1 w_2^1 + l_1^2 w_2^2)}} \label{ref137} \tag{37} \\ \end{align} $$

Loss

Like before, we will use half of squared error loss. \( L = \frac{1}{2} (y - \hat{y})^{2} \)

Gradients

There are 2 set of gradients: \(\nabla_{W_1} L\) and \(\nabla_{W_2} L\). Let us first compute \(\nabla_{W_2} L\)

\([\nabla_{W_2} L]_{\scriptscriptstyle 2\times 1}\)

$$ \begin{equation} \nabla_{W_2} L = \frac{\partial L}{\partial W_2} \\ \nabla_{W_2} L = \begin{bmatrix} \frac{\partial L}{\partial w_{2}^{1}} \\ \frac{\partial L}{\partial w_{2}^{2}} \\ \end{bmatrix} \label{ref140} \tag{40} \end{equation} $$

Lets, first compute \(\frac{\partial L}{\partial w_{2}^{1}}\):

$$ \begin{align} \frac{\partial L}{\partial w_2^1} & = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_2^1} \label{ref141} \tag{41}\\ \frac{\partial L}{\partial \hat{y}} &= \frac{1}{2} \times 2 \times (y - \hat{y})^{1} \times (-1) \label{ref142} \tag{42}\\ \frac{\partial \hat{y}}{\partial w_2^1} &= \big{(} \frac{1}{1 + e^{-[l_1] . [W_2]}} \big{)} \times \big{(}1- \frac{1}{1 + e^{-[l_1] . [W_2]}} \big{)} \times l_1^1 \dots && \text{using \eqref{ref137}} \\ & = \sigma ([l_1] . [W_2]) \times (1- \sigma ([l_1] . [W_2])) \times l_1^1 \dots && \text{using \eqref{ref136}} \label{ref143} \tag{43}\\ & = \hat{y} \times (1- \hat{y}) \times l_1^1 \dots && \text{using \eqref{ref135}} \label{ref144} \tag{44}\\ \end{align} $$

Substituting \eqref{ref143} & \eqref{ref144} in \eqref{ref141}, we get

$$ \begin{align} \frac{\partial L}{\partial w_2^1} &= \big{(} (-1) \times (y - \hat{y}) \big{)} \times \big{(} \hat{y} \times (1- \hat{y}) \times l_1^1 \big{)}\\ \\ &= -(y - \hat{y}) \times \hat{y} \times (1- \hat{y}) \times l_1^1 \\ &= (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \times l_1^1 \label{ref145} \tag{45} \\ \end{align} $$

Let,

\begin{align} \Delta l_{2} = (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \label{ref146} \tag{46} \\ \end{align}

Then, eq \eqref{ref145} reduces to:

$$ \begin{align} \frac{\partial L}{\partial w_2^1} &= \Delta l_{2} \times l_1^1 \label{ref147} \tag{47} \\ \end{align} $$

Likewise,

$$ \begin{align} \frac{\partial L}{\partial w_2^2} & = \Delta l_{2} \times l_1^2 \label{ref148} \tag{48} \\ \end{align} $$

Using eq \eqref{ref147} and \eqref{ref148} in \eqref{ref140}, we get:

$$ \begin{equation} \nabla_{W_2} L = \begin{bmatrix} \Delta l_{2} \times l_1^1 \\ \Delta l_{2} \times l_1^2 \\ \end{bmatrix} \label{ref149} \tag{49} \end{equation} $$

$$ \frac{\partial L}{\partial W_2} = \left[ \begin{array}{c} l_1^1 \\ l_1^2 \end{array} \right]_{\scriptscriptstyle 2 \times 1}. \left[ \Delta l_{2} \right]_{\scriptscriptstyle 1 \times 1} $$

$$ \begin{align} \frac{\partial L}{\partial W_2} &= [l_{1}]^\intercal . \Delta l_{2} \label{ref1491} \tag{49.1} \\ \end{align} $$

\([\nabla_{W_1} L]_{\scriptscriptstyle 3\times 2}\)

$$ \begin{equation} \nabla_{W_1} L = \frac{\partial L}{\partial W_1} \\ \nabla_{W_1} L = \begin{bmatrix} \frac{\partial L}{\partial w_{1}^{11}} & \frac{\partial L}{\partial w_{1}^{12}}\\ \frac{\partial L}{\partial w_{1}^{21}} & \frac{\partial L}{\partial w_{1}^{22}}\\ \frac{\partial L}{\partial w_{1}^{31}} & \frac{\partial L}{\partial w_{1}^{32}}\\ \end{bmatrix} \label{ref150} \tag{50} \end{equation} $$

Let's first focus on \(\frac{\partial L}{\partial w_{1}^{11}}\):

$$ \begin{align} \frac{\partial L}{\partial w_1^{11}} & = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial l_1^1} \times \frac{\partial l_1^1}{\partial w_1^{11}}\label{ref151} \tag{51}\\ \frac{\partial L}{\partial \hat{y}} &= -(y - \hat{y}) \label{ref152} \tag{52}\\ \frac{\partial \hat{y}}{\partial l_1^1} &= \bigg( \frac{1}{1 + e^{-[l_1] . [W_2]}} \bigg) \times \bigg(1- \frac{1}{1 + e^{-[l_1] . [W_2]}} \bigg) \times w_2^1 \\ & = \sigma ([l_1] . [W_2]) \times (1- \sigma ([l_1] . [W_2])) \times w_2^1 \dots \text{(using \eqref{ref137})} \label{ref153} \tag{53}\\ & = \hat{y} \times (1- \hat{y}) \times w_2^1 \dots \text{(using \eqref{ref135})} \label{ref154} \tag{54}\\ \frac{\partial l_1^1}{\partial w_1^{11}} &= \bigg( \frac{1}{1 + e^{-(x^1 \times w_1^{11} + x^2 \times w_1^{21} + x^3 \times w_1^{31})}} \bigg) \times \bigg( 1 - \bigg( \frac{1}{1 + e^{-(x^1 \times w_1^{11} + x^2 \times w_1^{21} + x^3 \times w_1^{31})}} \bigg) \bigg) \times x^1 \dots \text{(using \eqref{ref133})} \label{ref155} \tag{55}\\ &= (l_1^1) \times (1-l_1^1) \times x^1 \label{ref156} \tag{56} \end{align} $$

Using eq \eqref{ref152}, \eqref{ref154} and \eqref{ref156} in eq \eqref{ref151}

$$ \begin{align} \frac{\partial L}{\partial w_1^{11}} & = -(y - \hat{y}) \times \hat{y} \times (1- \hat{y}) \times w_2^1 \times l_1^1 \times (1-l_1^1) \times x^1 \\ & = (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \times w_2^1 \times l_1^1 \times (1-l_1^1) \times x^1 \label{ref157} \tag{57} \\ \end{align} $$

Now, using eq \eqref{ref146} in \eqref{ref157}

$$ \begin{align} \frac{\partial L}{\partial w_1^{11}} & = \Delta l_{2} \times w_2^1 \times l_1^1 \times (1-l_1^1) \times x^1 \label{ref158} \tag{58} \\ \end{align} $$

Further, let

$$ \begin{align} \Delta l_{1}^{1} = \Delta l_{2} * w_2^1 \times (l_1^1) \times (1-l_1^1) \label{ref159} \tag{59} \end{align} $$

Then,

$$ \begin{align} \frac{\partial L}{\partial w_1^{11}} & = \Delta l_{1}^{1} \times x^1 \label{ref160} \tag{60} \end{align} $$

Likewise,

$$ \begin{align} \frac{\partial L}{\partial w_1^{21}} & = \Delta l_{2} * w_2^1 \times (l_1^1) \times (1-l_1^1) \times x^2 \\ &= \Delta l_{1}^{1} \times x^2 \label{ref161} \tag{61} \end{align} $$

$$ \begin{align} \frac{\partial L}{\partial w_1^{31}} & = \Delta l_{2} * w_2^1 \times (l_1^1) \times (1-l_1^1) \times x^3 \\ &= \Delta l_{1}^{1} \times x^3 \label{ref162} \tag{62} \\ \end{align} $$

Before we proceed further, Eq \eqref{ref160}, \eqref{ref161} and \eqref{ref162} are key take home equations.

Now, lets focus on \(\frac{\partial L}{\partial w_{1}^{12}}\):

$$ \begin{align} \frac{\partial L}{\partial w_1^{12}} & = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial l_1^2} \times \frac{\partial l_1^2}{\partial w_1^{12}}\label{ref163} \tag{63}\\ \frac{\partial L}{\partial \hat{y}} &= -(y - \hat{y}) \label{ref164} \tag{64}\\ \frac{\partial \hat{y}}{\partial l_1^2} &= \big{(} \frac{1}{1 + e^{-[l_1] . [W_2]}} \big{)} \times \big{(}1- \frac{1}{1 + e^{-[l_1] . [W_2]}} \big{)} * w_2^2 \\ & = \sigma ([l_1] . [W_2]) \times (1- \sigma ([l_1] . [W_2])) * w_2^2 \dots && \text{using \eqref{ref137}} \label{ref165} \tag{65}\\ & = \hat{y} \times (1- \hat{y}) * w_2^2 \dots && \text{using \eqref{ref135}} \label{ref166} \tag{66}\\ \frac{\partial l_1^2}{\partial w_1^{12}} &= \big{(} \frac{1}{1 + e^{-(x^1 \times w_1^{12} + x^2 \times w_1^{22} + x^3 \times w_1^{32})}} \big{)} \times \big{(} 1 - \big{(} \frac{1}{1 + e^{-(x^1 \times w_1^{12} + x^2 \times w_1^{22} + x^3 \times w_1^{32})}} \big{)} \big{)} \times x^1 \dots && \text{using \eqref{ref133}} \label{ref167} \tag{67}\\ &= (l_1^2) \times (1-l_1^2) \times x^1 \label{ref168} \tag{68} \end{align} $$

$$ \begin{align} \frac{\partial L}{\partial w_1^{12}} & = -(y - \hat{y}) \times \hat{y} \times (1- \hat{y}) * w_2^2 \times (l_1^2) \times (1-l_1^2) \times x^1 \\ & = (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) * w_2^2 \times (l_1^2) \times (1-l_1^2) \times x^1 \label{ref169} \tag{69} \end{align} $$

Now, using eq \eqref{ref146} in \eqref{ref169}

$$ \begin{align} \frac{\partial L}{\partial w_1^{12}} & = \Delta l_{2} * w_2^2 \times (l_1^2) \times (1-l_1^2) \times x^1 \label{ref170} \tag{70} \\ \end{align} $$

Further, let

$$ \begin{align} \Delta l_{1}^{2} = \Delta l_{2} * w_2^2 \times (l_1^2) \times (1-l_1^2) \label{ref171} \tag{71} \end{align} $$

Then,

$$ \begin{align} \frac{\partial L}{\partial w_1^{12}} & = \Delta l_{1}^{2} \times x^1 \label{ref172} \tag{72} \end{align} $$

Likewise,

$$ \begin{align} \frac{\partial L}{\partial w_1^{22}} & = \Delta l_{2} \times w_2^2 \times (l_1^2) \times (1-l_1^2) \times x^2 \\ &= \Delta l_{1}^{2} \times x^2 \label{ref173} \tag{73} \end{align} $$ $$ \begin{align} \frac{\partial L}{\partial w_1^{32}} & = \Delta l_{2} \times w_2^2 \times (l_1^2) \times (1-l_1^2) \times x^3 \\ &= \Delta l_{1}^{2} \times x^3 \label{ref174} \tag{73} \\ \end{align} $$

Now, we have the pieces. We just need to assemble them.

Using Eq \eqref{ref160}, \eqref{ref161}, \eqref{ref162} and \eqref{ref172}, \eqref{ref173}, \eqref{ref174} in \eqref{ref150}

$$ \begin{equation} \nabla_{W_1} L = \frac{\partial L}{\partial W_1} \\ \nabla_{W_1} L = \begin{bmatrix} \Delta l_{1}^{1} \times x^1 & \Delta l_{1}^{2} \times x^1\\ \Delta l_{1}^{1} \times x^2 & \Delta l_{1}^{2} \times x^2\\ \Delta l_{1}^{1} \times x^3 & \Delta l_{1}^{2} \times x^3\\ \end{bmatrix} \label{ref175} \tag{75} \end{equation} $$

Using the notation used in Eq \eqref{ref132}, let

$$ \begin{equation} \Delta l_1=\begin{bmatrix} \Delta l_{1}^{1} & \Delta l_{1}^{2} \\ \end{bmatrix} \label{ref176} \tag{76} \end{equation} $$

Using Eq \eqref{ref176} in \eqref{ref175}, we get:

$$ \begin{align} \left[ \frac{\partial L}{\partial W_1} \right]_{\scriptscriptstyle 3 \times 2} = \left[ \begin{array}{c} x^1 \\ x^2 \\ x^3 \end{array} \right]_{\scriptscriptstyle 3 \times 1} . \left[ \Delta l_{1}^{1} \quad \Delta l_{1}^{2} \right]_{\scriptscriptstyle 1 \times 2} \label{ref177} \tag{77} \end{align} $$

Note that,

$$ \begin{align} \Delta l_1 = \left[ \Delta l_{1}^{1} \quad \Delta l_{1}^{2} \right] \tag{76} \end{align} $$ $$ \begin{align} \Delta l_{1}^{1} = \Delta l_{2} * w_2^1 \times (l_1^1) \times (1-l_1^1) \tag{59} \end{align} $$ $$ \begin{align} \Delta l_{1}^{2} = \Delta l_{2} * w_2^2 \times (l_1^2) \times (1-l_1^2) \tag{71} \end{align} $$

Thus,

$$ \begin{align} \frac{\partial L}{\partial W_1} &= [X]^\intercal . \Delta l_{1} \label{ref178} \tag{78} \\ \end{align} $$

Therefore,

$$ \begin{align} \Delta l_1 &= \left[ \Delta l_{2} * w_2^1 \times (l_1^1) \times (1-l_1^1) \quad \Delta l_{2} * w_2^2 \times (l_1^2) \times (1-l_1^2) \right] \\ &= (\left[ \Delta l_{2} \right]_{\scriptscriptstyle 1 \times 1} . \left[ W_2 ^\intercal \right]_{\scriptscriptstyle 1 \times 2}) \times (l_1) \times (1-l_1) \label{ref179} \tag{79} \end{align} $$

To summarize the key equations:

\begin{align} \Delta l_{2} &= (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \tag{46} \\ \end{align} \begin{align} \Delta l_{1} &= (\left[ \Delta l_{2} \right] . \left[ W_2 ^\intercal \right]) \times (l_1) \times (1-l_1) \tag{79}\\ \end{align}

$$ \begin{align} \frac{\partial L}{\partial W_1} &= [X]^\intercal . \Delta l_{1} \tag{78} \\ \end{align} $$

$$ \begin{align} \frac{\partial L}{\partial W_2} &= [l_{1}]^\intercal . \Delta l_{2} \tag{49.1} \\ \end{align} $$

Now, using these 4 equations - \eqref{ref146}, \eqref{ref179}, \eqref{ref1491}, and \eqref{ref178}, one can directly code (bare bones) training algorithm. The following code is borrowed from the blog post of Andrew Trask

import numpy as np

                                        X = np.array([0,0,1])
                                        y = np.array([0])
                                        
                                        alpha = 0.5  # learning rate - hyperparameter
                                        
                                        W_1 = np.random.random((3,2))
                                        W_2 = np.random.random((2,1))
                                        
                                        for i in range(1000):
                                        
                                            layer_1 = 1/(1+np.exp(-(np.dot(X,W_1))))
                                            layer_2 = 1/(1+np.exp(-(np.dot(layer_1,W_2))))
                                        
                                            layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
                                            layer_1_delta = layer_2_delta.dot(W_2.T)*(layer_1*(1-layer_1))
                                        
                                            W_1 = W_1 - alpha * (layer_1.T).dot(layer_2_delta)
                                            W_2 = W_2 - alpha * (X.T).dot(layer_1_delta)
                                        

Gradients - Part 4a

Published:

Part 4 of computing gradients for training Neural Nets

2 layer network, single training example (vector)

In this post we will consider 2 type of networks. In first network, hidden layer has only 1 node while in second network, hidden layer has more than 1 node. Lets start with the case of hidden layer with only 1 node. This part is bit lengthy, but fundamentally not very different from earlier parts. Keep calm and read on :-)

1 hidden layer with 1 node

Neural net with 2 layer, 1 node in hidden layer. Input is a vector

First layer (\(l_1\)) has weight matrix \([W_1]_{3 \times 1}\)

$$ \begin{equation} W_1=\begin{bmatrix} w_{1}^{1} \\ w_{1}^{2} \\ w_{1}^{3} \\ \end{bmatrix} \end{equation} $$

Second layer (\(l_2\)) has weight matrix \([W_2]_{1 \times 1}\) which is a scalar

$$ \begin{equation} W_2=\begin{bmatrix} w_{2} \\ \end{bmatrix} \end{equation} $$
Input & Output definitions

Input is \((\vec{X},y)\) : \(\vec{X}\) is a vector, while \(y\) is a scalar.

\(X = [x^1 ~~x^2 ~~x^3]\)       \(x^i = i^{th}\) component of \(\vec{X}\).

Thus, in matrix form \(x,y\) are \([X]_{1\times 3}\) and \([y]_{1\times 1}\).

Let \( l_1 \) be output of layer 1. In matrix format, \([l_1]_{1\times 1}\)

$$ \begin{align} l_1 & = \sigma ([X] . [W_1]) \label{ref0} \tag{1.1} \\ & = \frac{1}{1 + e^{-[X] . [W_1]}} \label{ref10} \tag{1.2} \\ & = \frac{1}{1 + e^{-(x^1 \times w_1^1 + x^2 \times w_1^2 + x^3 \times w_1^3)}} \label{ref101} \tag{1.3} \end{align} $$

\( \hat{y} \) is predicted output. In matrix format, \([\hat{y}]_{1\times 1}\)

$$ \begin{align} \hat{y} & = \sigma ([l_1] . [W_2]) \label{ref11} \tag{2.1} \\ & = \frac{1}{1 + e^{-[l_1] . [W_2]}} \label{ref112} \tag{2.2} \\ & = \frac{1}{1 + e^{-(l_1 w_2)}} \label{ref113} \tag{2.3} \\ \end{align} $$
Loss

Like before, we will use half of squared error loss. \( L = \frac{1}{2} (y - \hat{y})^{2} \)

Gradients

There are 2 sets of gradients: \(\nabla_{W_1} L\) and \(\nabla_{W_2} L\). Let us first compute \(\nabla_{W_2} L\)

\([\nabla_{W_2} L]_{1\times 1}\)

$$ \begin{align} \frac{\partial L}{\partial W_2} & = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial W_2} \label{ref12} \tag{3}\\ \frac{\partial L}{\partial \hat{y}} &= \frac{1}{2} \times 2 \times (y - \hat{y})^{1} \times (-1) \label{ref13} \tag{4}\\ \frac{\partial \hat{y}}{\partial W_2} &= \bigg( \frac{1}{1 + e^{-[l_1] . [W_2]}} \bigg) \times \bigg(1- \frac{1}{1 + e^{-[l_1] . [W_2]}} \bigg) \times l_1 \\ & = \sigma ([l_1] . [W_2]) \times (1- \sigma ([l_1] . [W_2])) \times l_1 \dots && \text{using \eqref{ref11}} \label{ref14} \tag{5}\\ & = \hat{y} \times (1- \hat{y}) \times l_1 \dots && \text{using \eqref{ref11}} \label{ref15} \tag{6}\\ \end{align} $$

Substituting \eqref{ref13} & \eqref{ref15} in \eqref{ref12}, we get

$$ \begin{align} \frac{\partial L}{\partial W_2} &= \bigg( (-1) \times (y - \hat{y}) \bigg) \times \bigg( \hat{y} \times (1- \hat{y}) \times l_1 \bigg)\\ &= \bigg( -(y - \hat{y}) \bigg) \times \bigg( \hat{y} \times (1- \hat{y}) \times l_1 \bigg)\\ &= (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \times l_1 \label{ref16} \tag{7} \\ \end{align} $$

Let,

\begin{align} \Delta l_{2} &= (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \label{ref17} \tag{8} \\ \end{align}

Then, using eq \eqref{ref17}, eq \eqref{ref16} reduces to:

$$ \begin{align} \frac{\partial L}{\partial W_2} &= \Delta l_{2} \times l_1 \\ & = \Delta l_{2} * l_1 \\ & = [l_1]^\intercal . \Delta l_{2} \label{ref18} \tag{9} \\ \end{align} $$

\([\nabla_{W_1} L]_{\scriptscriptstyle 3\times 1}\)

Now let us compute \(\nabla_{W_1} L\)

$$ \begin{equation} \nabla_{W_1} L = \frac{\partial L}{\partial W_1} \\ \nabla_{W_1} L = \begin{bmatrix} \frac{\partial L}{\partial w_{1}^{1}} \\ \frac{\partial L}{\partial w_{1}^{2}} \\ \frac{\partial L}{\partial w_{1}^{3}} \\ \end{bmatrix} \label{ref20} \tag{10} \end{equation} $$

First, let's compute only \(\frac{\partial L}{\partial w_{1}^{1}}\)

$$ \begin{align} \frac{\partial L}{\partial w_{1}^{1}} & = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial l_1} \times \frac{\partial l_1}{\partial w_1^1} \label{ref21} \tag{11}\\ \frac{\partial L}{\partial \hat{y}} &= \frac{1}{2} \times 2 \times (y - \hat{y})^{1} \times (-1) \label{ref22} \tag{12}\\ \frac{\partial \hat{y}}{\partial l_1} &= \big{(} \frac{1}{1 + e^{-[l_1] . [W_2]}} \big{)} \times \big{(}1- \frac{1}{1 + e^{-[l_1] . [W_2]}} \big{)} * w_2 \\ & = \sigma ([l_1] . [W_2]) \times (1- \sigma ([l_1] . [W_2])) * w_2 \dots && \text{using \eqref{ref11}} \label{ref23} \tag{13}\\ & = \hat{y} \times (1- \hat{y}) * w_2 \dots && \text{using \eqref{ref11}} \label{ref24} \tag{14}\\ \frac{\partial l_1}{\partial w_1^1} &= \big{(} \frac{1}{1 + e^{-[X] . [W_1]}} \big{)} \times \big{(}1- \frac{1}{1 + e^{-[X] . [W_1]}} \big{)} * x^1 \label{ref25} \tag{15}\\ & = l_1 \times (1- l_1) * x^1 \dots && \text{using \eqref{ref11}} \label{ref26} \tag{16}\\ \end{align} $$

Combining, eq \eqref{ref22}, \eqref{ref24}, and \eqref{ref26}:

$$ \begin{align} \frac{\partial L}{\partial w_{1}^{1}} & = \big(-(y - \hat{y}) \big) \times \big( \hat{y} \times (1- \hat{y}) \times w_2 \big) \times \big( l_1 \times (1- l_1) \times x^1 \big) \\ & = \big((\hat{y} - y)\hat{y}(1- \hat{y}) \times w_2 \big) \times \big( l_1 \times (1- l_1) \times x^1 \big) \\ & = \big(\Delta l_{2} \times w_2 \times l_1 \times (1- l_1) \big) \times x^1 \dots && \text{using \eqref{ref17}} \label{ref27} \tag{17}\\ \end{align} $$

Let,

\begin{align} \Delta l_{1} = \Delta l_{2} \times w_2 \times l_1 \times (1- l_1) \label{ref28} \tag{18} \end{align} Then, eq \eqref{ref27} reduces to: $$ \begin{align} \frac{\partial L}{\partial w_{1}^{1}} & = \Delta l_{1} \times x^1 \label{ref29} \tag{19} \\ \end{align} $$ Likewise, $$ \begin{align} \frac{\partial L}{\partial w_{1}^{2}} & = \Delta l_{1} \times x^2 \label{ref30} \tag{20} \\ \frac{\partial L}{\partial w_{1}^{3}} & = \Delta l_{1} \times x^3 \label{ref31} \tag{21} \\ \end{align} $$

Using eq \eqref{ref29}, \eqref{ref30}, and \eqref{ref31} in \eqref{ref20}:

$$ \begin{equation} \nabla_{W_1} L = \begin{bmatrix} \Delta l_{1} \times x^1 \\ \Delta l_{1} \times x^2 \\ \Delta l_{1} \times x^3 \\ \end{bmatrix} \label{ref32} \tag{22} \end{equation} $$ $$ \frac{\partial L}{\partial W_1} = \left[ \begin{array}{c} x^1 \\ x^2 \\ x^3 \end{array} \right]_{\scriptscriptstyle 3 \times 1}. \left[ \Delta l_{1} \right]_{\scriptscriptstyle 1 \times 1} $$ $$ \begin{align} \frac{\partial L}{\partial W_1} &= [X]^\intercal . \Delta l_{1} \\ \end{align} $$

Gradients - Part 3

Published:

1 layer network, multiple training examples (each example is a vector)

Multiple training examples correspond to the scenario of batch training. Each input is still a vector. Our neural net still has 1 layer. For simplicity, assume we have 4 examples, each having 3 components.

Weights

W, weight matrix is $$[W]_{\scriptscriptstyle 3 \times 1}$$

$$ \begin{equation} W=\begin{bmatrix} w_{1} \\ w_{2} \\ w_{3} \\ \end{bmatrix} \end{equation} $$

Input & Output definitions

Input X now is a matrix \( [X]_{\scriptscriptstyle 4 \times 3} \). \(X_i\) is the \(i^{th}\) sample. So, we now have 4 examples \(X_1 \ldots X_4\), each of which is a vector with 3 components. \(x_{i}^{j}\) is the \(j^{th}\) component of the \(i^{th}\) sample. So

$$ \begin{equation} X=\begin{bmatrix} x_{1}^{1} & x_{1}^{2} & x_{1}^{3} \\ x_{2}^{1} & x_{2}^{2} & x_{2}^{3} \\ x_{3}^{1} & x_{3}^{2} & x_{3}^{3} \\ x_{4}^{1} & x_{4}^{2} & x_{4}^{3} \\ \end{bmatrix} \end{equation} $$

\(\vec{y}\) is a vector.

$$ \begin{equation} y=\begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \\ \end{bmatrix} \end{equation} $$

$$y_i$$ = True label for $$i^{th}$$ example.

Likewise, $$\vec{\hat{y}}$$ is a vector where

$$ \begin{equation} \hat{y}=\begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \hat{y}_{3} \\ \hat{y}_{4} \\ \end{bmatrix} \end{equation} $$

$$\hat{y}_i$$ = predicted label for $$i^{th}$$ example. The value of $$\hat{y}_i$$ is computed using \eqref{ref10}:

$$ \hat{y}_i = \frac{1}{1+e^{-(x_{i}^{1} \times w_1 + x_{i}^{2} \times w_2 + x_{i}^{3} \times w_3)}} \label{eq:ref10} \tag{0} $$

Loss

Like before, we will use half of squared error loss, but in this case, it is the total gap in all 4 predictions. Therefore,

$$ L = \sum\limits_{i=1}^{4} \frac{1}{2} (y_i - \hat{y}_i)^{2} $$

Gradients

Let's compute gradients.

$$\begin{equation} \nabla_{W} L = \frac{\partial L}{\partial W} \\ \nabla_{W} L = \begin{bmatrix} \frac{\partial L}{\partial w_{1}} \\ \frac{\partial L}{\partial w_{2}} \\ \frac{\partial L}{\partial w_{3}} \\ \end{bmatrix} \label{eq:ref11} \tag{1} \end{equation} $$

First, let's compute only \( \frac{\partial L}{\partial w_{1}} \)

$$ \begin{align} \frac{\partial L}{\partial w_1} &= \frac{\partial (\sum\limits_{i=1}^{4} \frac{1}{2} (y_i - \hat{y}_i)^{2})}{\partial w_1} \\ & = \sum\limits_{i=1}^{4} \frac{\partial (\frac{1}{2} (y_i - \hat{y}_i)^{2})}{\partial w_1} \\ & = \sum\limits_{i=1}^{4} \frac{\partial (\frac{1}{2} (y_i - \hat{y}_i)^{2})}{\partial \hat{y}_i} \times \frac{\partial \hat{y}_i}{w_1} \tag{2} \label{eq:ref12} \end{align} $$ $$ \begin{align} \frac{\partial (\frac{1}{2} (y_i - \hat{y}_i)^{2})}{\partial \hat{y}_i} &= (-1) \times (y_i - \hat{y}_i) \tag{3} \label{eq:ref13} \end{align} $$ $$ \begin{align} \frac{\partial \hat{y}_i}{w_1} &= \sigma(x_{i}^{1} \times w_1 + x_{i}^{2} \times w_2 + x_{i}^{3} \times w_3)(1 - \sigma(x_{i}^{1} \times w_1 + x_{i}^{2} \times w_2 + x_{i}^{3} \times w_3)) \times x_{i}^{1} \\ & = \sigma( X_i \cdot [W])(1 - \sigma( X_i \cdot [W])) \times x_{i}^{1} \tag{4} \label{eq:ref14} \end{align} $$

Using \eqref{ref13} and \eqref{ref14} in \eqref{ref12}, we get:

$$ \begin{align} \frac{\partial L}{\partial w_1} &= \sum\limits_{i=1}^{4} -(y_i - \hat{y}_i) \cdot [\sigma(X_i \cdot [W])(1 - \sigma(X_i \cdot [W])) \times x_{i}^{1}] \label{eq:ref15} \tag{5} \end{align} $$ $$ \begin{align} \frac{\partial L}{\partial w_1} &= (\hat{y}_1 - y_1) \cdot [\sigma(X_1 \cdot [W])(1 - \sigma(X_1 \cdot [W])) \times x_{1}^{1}] + \\ &\quad (\hat{y}_2 - y_2) \cdot [\sigma(X_2 \cdot [W])(1 - \sigma(X_2 \cdot [W])) \times x_{2}^{1}] + \\ &\quad (\hat{y}_3 - y_3) \cdot [\sigma(X_3 \cdot [W])(1 - \sigma(X_3 \cdot [W])) \times x_{3}^{1}] + \\ &\quad (\hat{y}_4 - y_4) \cdot [\sigma(X_4 \cdot [W])(1 - \sigma(X_4 \cdot [W])) \times x_{4}^{1}] \\ \end{align} $$ $$\begin{align} \frac{\partial L}{\partial w_1} &= (\hat{y}_1 - y_1) \cdot [\sigma(X_1 \cdot [W])(1 - \sigma(X_1 \cdot [W])) \times x_{1}^{1}] + \label{eq:ref16-1} \tag{6} \\ &\quad (\hat{y}_2 - y_2) \cdot [\sigma(X_2 \cdot [W])(1 - \sigma(X_2 \cdot [W])) \times x_{2}^{1}] + \label{eq:ref16-2} \\ &\quad (\hat{y}_3 - y_3) \cdot [\sigma(X_3 \cdot [W])(1 - \sigma(X_3 \cdot [W])) \times x_{3}^{1}] + \label{eq:ref16-3} \\ &\quad (\hat{y}_4 - y_4) \cdot [\sigma(X_4 \cdot [W])(1 - \sigma(X_4 \cdot [W])) \times x_{4}^{1}] \label{eq:ref16-4} \end{align} $$ $$ \begin{align} \Delta l_{1}^{i} &= ((\hat{y}_i - y_i) \cdot [\sigma(X_i \cdot [W])(1 - \sigma(X_i \cdot [W]))]) \quad \forall i \in (1,4) \label{eq:ref17} \tag{7} \end{align} $$ Using \eqref{ref17} in \eqref{ref16}, we get: $$ \begin{align} \frac{\partial L}{\partial w_1} &= \Delta l_{1}^{1} \times x_{1}^{1} + \Delta l_{1}^{2} \times x_{2}^{1} + \Delta l_{1}^{3} \times x_{3}^{1} + \Delta l_{1}^{4} \times x_{4}^{1} \label{eq:ref18} \tag{8} \end{align} $$ $$ \begin{align} \frac{\partial L}{\partial w_2} &= \Delta l_{1}^{1} \times x_{1}^{2} + \Delta l_{1}^{2} \times x_{2}^{2} + \Delta l_{1}^{3} \times x_{3}^{2} + \Delta l_{1}^{4} \times x_{4}^{2} \label{eq:ref19} \tag{9} \end{align} $$ $$ \begin{align} \frac{\partial L}{\partial w_3} &= \Delta l_{1}^{1} \times x_{1}^{3} + \Delta l_{1}^{2} \times x_{2}^{3} + \Delta l_{1}^{3} \times x_{3}^{3} + \Delta l_{1}^{4} \times x_{4}^{3} \label{eq:ref20} \tag{10} \end{align} $$ $$ \begin{equation} \frac{\partial L}{\partial W} = \begin{bmatrix} \Delta l_{1}^{1} \times x_{1}^{1} + \Delta l_{1}^{2} \times x_{2}^{1} + \Delta l_{1}^{3} \times x_{3}^{1} + \Delta l_{1}^{4} \times x_{4}^{1} \\ \Delta l_{1}^{1} \times x_{1}^{2} + \Delta l_{1}^{2} \times x_{2}^{2} + \Delta l_{1}^{3} \times x_{3}^{2} + \Delta l_{1}^{4} \times x_{4}^{2} \\ \Delta l_{1}^{1} \times x_{1}^{3} + \Delta l_{1}^{2} \times x_{2}^{3} + \Delta l_{1}^{3} \times x_{3}^{3} + \Delta l_{1}^{4} \times x_{4}^{3} \\ \end{bmatrix} \label{eq:ref21} \tag{11} \end{equation} $$

Simplifying

$$ \frac{\partial L}{\partial W} = \begin{bmatrix} x_{1}^{1} & x_{2}^{1} & x_{3}^{1} & x_{4}^{1} \\ x_{1}^{2} & x_{2}^{2} & x_{3}^{2} & x_{4}^{2} \\ x_{1}^{3} & x_{2}^{3} & x_{3}^{3} & x_{4}^{3} \end{bmatrix} \cdot \left[ \begin{array}{c} \Delta l_{1}^{1} \\ \Delta l_{1}^{2} \\ \Delta l_{1}^{3} \\ \Delta l_{1}^{4} \end{array} \right] $$ $$ \frac{\partial L}{\partial W} = \begin{bmatrix} x_{1}^{1} & x_{2}^{1} & x_{3}^{1} & x_{4}^{1} \\ x_{1}^{2} & x_{2}^{2} & x_{3}^{2} & x_{4}^{2} \\ x_{1}^{3} & x_{2}^{3} & x_{3}^{3} & x_{4}^{3} \end{bmatrix} \cdot \left[ \begin{array}{c} \Delta l_{1} \end{array} \right] $$ $$ \begin{align} \frac{\partial L}{\partial W} &= [X^{T}] \cdot \Delta l_{1} \\ \end{align} $$

Gradients - Part 2

Published:

Part 2 of computing gradients for training Neural Nets

1 layer network, 1 input (vector)

Our neural net still has 1 layer, but now the input is a vector.

Neural net with 1 layer, but input is vector

Input is \( (\vec{X}, y) \): \( \vec{X} \) is a vector, while \( y \) is a scalar.

Let \( X = [x^1 ~~x^2 ~~x^3] \) where \( x^i \) is the \( i^{th} \) component of \( \vec{X} \).

Thus, in matrix form, \( \vec{x} \) and \( y \) are represented as:

\( [X]_{1 \times 3} \) and \( [y]_{1 \times 1} \).

Let \( W \), the weight matrix, be:

$$ \begin{equation} W = \begin{bmatrix} w_{1} \\ w_{2} \\ w_{3} \\ \end{bmatrix} \end{equation} $$

Let \( \hat{y} \) be the predicted output. In matrix format, \( [\hat{y}]_{1 \times 1} \) is given by:

$$ \begin{align} \hat{y} & = \sigma([X] \cdot [W]) \label{eq:ref101} \tag{10.1} \\ & = \frac{1}{1 + e^{-[X] \cdot [W]}} \label{eq:ref102} \tag{10.2} \\ & = \frac{1}{1 + e^{-(x^1 w_1 + x^2 w_2 + x^3 w_3)}} \label{eq:ref103} \tag{10.3} \end{align} $$

For the loss function, we use half of the squared error loss:

$$ L = \frac{1}{2}(y - \hat{y})^{2} $$

Let's first compute the gradients:

$$ \begin{equation} \nabla_{W} L = \frac{\partial L}{\partial W} \\ \nabla_{W} L = \begin{bmatrix} \frac{\partial L}{\partial w_{1}} \\ \frac{\partial L}{\partial w_{2}} \\ \frac{\partial L}{\partial w_{3}} \\ \end{bmatrix} \tag*{11} \end{equation} $$

So, let's compute \( \frac{\partial L}{\partial w_{1}} \):

$$ \begin{align} \frac{\partial L}{\partial w_1} &= \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_1} \label{eqref12} \tag{12} \\ \frac{\partial L}{\partial \hat{y}} &= \frac{1}{2} \times 2 \times (y - \hat{y})^{1} \times (-1) \label{eqref13} \tag{13} \\ \frac{\partial \hat{y}}{\partial w_1} &= \left( \frac{1}{1 + e^{-[X] \cdot [W]}} \right) \times \left(1- \frac{1}{1 + e^{-[X] \cdot [W]}} \right) \times x_1 \dots \text{using \eqref{eq:ref102} \& \eqref{eq:ref103}} \label{eqref14} \tag{14} \\ & = \sigma ([X] \cdot [W]) \times (1- \sigma ([X] \cdot [W])) \times x_1 \dots \text{using \eqref{eq:ref101}} \label{eqref15} \tag{15} \\ & = \hat{y} \times (1- \hat{y}) \times x_1 \dots \text{using \eqref{eq:ref101}} \label{eqref16} \tag{16} \end{align} $$

Substituting \eqref{ref13} and \eqref{ref16} into \eqref{ref12}, we get:

$$ \begin{align} \frac{\partial L}{\partial w_1} &= \left( (-1) \times (y - \hat{y}) \right) \times \left( \hat{y} \times (1 - \hat{y}) \times x_1 \right) \\ & = -(y - \hat{y}) \times \hat{y} \times (1 - \hat{y}) \times x_1 \\ & = (\hat{y} - y) \times \hat{y} \times (1 - \hat{y}) \times x_1 \end{align} $$

Thus, in general:

$$ \begin{align} \frac{\partial L}{\partial w_i} &= (\hat{y} - y) \times \hat{y} \times (1 - \hat{y}) \times x_i \label{eqref17} \tag{17} \end{align} $$

Using \eqref{ref17} in \eqref{ref11}:

$$ \begin{equation} \frac{\partial L}{\partial W} = \begin{bmatrix} (\hat{y} - y) \times \hat{y} \times (1 - \hat{y}) \times x_1 \\ (\hat{y} - y) \times \hat{y} \times (1 - \hat{y}) \times x_2 \\ (\hat{y} - y) \times \hat{y} \times (1 - \hat{y}) \times x_3 \\ \end{bmatrix} \label{eqref18} \tag{18} \end{equation} $$ $$\begin{equation} \begin{bmatrix} x^1 \\ x^2 \\ x^3 \\ \end{bmatrix} = [(\hat{y} - y) \times \hat{y} \times (1 - \hat{y})] \label{eqref19} \tag{19} \end{equation} $$

Let \( \Delta l_{1} = (\hat{y} - y) \times \hat{y} \times (1 - \hat{y}) \).

Using \( \eqref{ref20} \) in \( \eqref{ref19} \), we get:

$$ \begin{align} \frac{\partial L}{\partial W} &= [X^{T}] . \Delta l_{1} \\ \end{align} $$

Gradients - Part 1

Published:

Part 1 of computing gradients for training Neural Nets

1 layer network, 1 training example (scalar)

Consider a simplest version of a neural net - 1 layer, 1 input node (scalar)

Input is (x,y): x, y both are scalars. (Later on everything will be a matrix, so just to be using the same notation. We will abuse the notation to express scalars as matrices of dimension \(1 \times 1\)). Thus, in matrix form x, y are \[ [X]_{\scriptscriptstyle 1\times 1} \] and \[ [y]_{\scriptscriptstyle 1\times 1} \]. Let \(W\) be the weight matrix. In this case, it's \[ [W]_{\scriptscriptstyle 1\times 1} \].

Let \( \hat{y} \) be the predicted output. Then,

\[ \hat{y} = \sigma (Wx) = \frac{1}{1 + e^{-[X] \cdot [W]}} \label{eq:ref0} \tag{0} \]

Let loss be squared error loss. For ease of math, we take \( \frac{1}{2} \) of it.

\[ L = \frac{1}{2} (y - \hat{y})^{2} \]

Let's compute gradients,

\[ \nabla_{W} L = \frac{\partial L}{\partial W} \]

\[ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial W} \label{ref1} \tag{1} \]

\[ \frac{\partial L}{\partial \hat{y}} = \frac{1}{2} \times 2 \times (y - \hat{y})^{1} \times (-1) \label{ref2} \tag{2} \]

\[ \frac{\partial \hat{y}}{\partial W} = \left( \frac{1}{1 + e^{-[X] \cdot [W]}} \right) \times \left(1- \frac{1}{1 + e^{-[X] \cdot [W]}} \right) * X \dots \text{using [Eq C, Part-0]} \label{ref3} \tag{3} \]

\[ = \sigma (Wx) \times (1- \sigma (Wx)) * X \dots \text{using \(\eqref{ref0}\)} \]

\[ = \hat{y} \times (1 - \hat{y}) * X \dots \text{using \(\eqref{ref0}\)} \label{ref33} \]

Substituting \(\eqref{ref2}\) & \(\eqref{ref3}\) in \(\eqref{ref1}\), we get

\[ \frac{\partial L}{\partial W} = \left( (-1) \times (y - \hat{y}) \right) \times \left( \hat{y} \times (1- \hat{y}) \times x \right) \]

\[ = -(y - \hat{y}) \times \hat{y} \times (1- \hat{y}) \times x \]

\[ = (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \times x \label{ref4} \tag{4} \]

Let,

\[ \Delta l_{1} = (\hat{y} - y) \times \hat{y} \times (1- \hat{y}) \label{ref5} \tag{5} \]

Then, eq \(\eqref{ref4}\) reduces to:

\[ \frac{\partial L}{\partial W} = \Delta l_{1} \times x \]

\[ = \Delta l_{1} * X \]

\[ = [X^{T}] . \Delta l_{1} \label{ref6} \tag{6} \]

Gradients for Neural Nets

Published:

Training neural nets is all about computin gradients In case you are new to this idea, refer to this awesome post by Andrej Karpathy. Briefly, deep down every ML problem is an optimization problem. We want to "learn" (find) the weights which will result in least average loss. The way we do it is - start with arbitrary wieghts and keep adjusting them in small quantities until we get them right i.e. arrive at a set of values for which loss function has least value. Gradients tells us by how much should we adjust each of the weights.

Not clear - check this

  • Video by Andrew NG
  • Blog by Sanjeev Arora
    • In this post we will focus on the maths that goes into computing these gradients - we will systematically derive gradients. The complexity of calculations depends on 3 things:

      • Depth of the network
      • Number of training examples (1 or more)
      • Number of components in input (1=scalar, >1=vector)

      Through out this post we assume:

      • There is no bias term.
      • `.` is matrix multiplication, `*` is element wise product and `X` is normal multiplication.
      • All activations are sigmoid a.k.a logistic. It is defined as \( f(u) = \frac{1}{1+e^{-u}} \). If you plot it, it comes as:.

    • Sigmoid function

      It easy to see it is smooth and differentiable and bound between 0 and 1 [No? not straight forward - need to fix this].

      \[ \begin{align} \frac{d}{dx}\sigma(x) &= \frac{d}{dx} \left[ \frac{1}{1+e^{-x}}\right] \label{refB} \tag{B}\\ &= \frac{d}{dx} (1+e^{-x})^{-1} \\ &= -(1+e^{-x})^{-2}(-e^{-x}) \\ &= \frac{e^{-x}}{(1+e^{-x})^2} \\ &= \frac{1}{(1+e^{-x})} \cdot \frac{e^{-x}}{(1+e^{-x})} \\ &= \frac{1}{(1+e^{-x})} \cdot \frac{1 + e^{-x} -1}{(1+e^{-x})} \\ &= \frac{1}{(1+e^{-x})} \cdot \left( 1 - \frac{e^{-x}}{(1+e^{-x})} \right) \\ &= \sigma(x)(1-\sigma(x)) \label{refC} \tag{C} \end{align} \] \[ \begin{align} \frac{d}{dx}\sigma(ax) = a(\sigma(ax))(1-\sigma(ax)) \label{refD} \tag{D} \end{align} \]

    We will be using the above result a lot, so make sure you understand it. If it is not clear, have a look at this post .

      To compute the gradients, we will start with the simplest case and increase the complexity gradually. To keep things simple we will complete it in 7 parts
    • 1 layer network, 1 training example (scalar).
    • 1 layer network, 1 training example (vector)
    • 1 layer network, batch training (>1 training examples where each is a vector)
    • 2 layer network with 1 node hidden layer, 1 training example (vector
    • 2 layer network with 2 node hidden layer, 1 training example (vector)
    • 2 layer network, batch training (>1 training examples where each is a vector)
    • Generalization and take home

    Since we will be dealing with matrices, a key step in every equation is to check if all matrix dimensions are consistent.