in Events

Spice up your apps with Vision and Speech APIs from Microsoft Cognitive Services

Today I was invited to host a presentation at CodeCamp 2016 in Cluj on Microsoft Cognitive Services (previously known as Project Oxford) and how easy it is to implement an application that uses Computer Vision and Speech APIs in Universal Windows Platform. This is a deep technical presentation about Cognitive Services and Universal Windows Platform development in C#. If you are new to these concepts or to programming in general, you can check this repository full with basic examples of programming in C# or this repository full with basic examples in Universal Windows Platform development.

You can find the whole source code for this blog post here

Microsoft Cognitive Services in details

Microsoft Cognitive Services let you build apps with powerful algorithms using just a few lines of code. They work across devices and platforms such as iOS, Android, and Windows, keep improving, and are easy to set up.

From the official documentation

cognitive-services-stack

Here we see the full stack that Microsoft provides as Cognitive Services. We have Vision APIs (analyzes images and gives your relevant information about them), Speech APIs (create sounds based on text and text based on sound), Language APIs (offer understanding of writing material and intents, Knowledge APIs (structure data to better understand client needs and explore more possibilities), Search APIs (gives you smart, organized and on topic data based on what you query).

Let’s talk code

We will be using Computer Vision, Emotion and Face from Vision APIs and Bing Speech from Speech APIs in this demo. I’ve forked my own versions of Vision, Emotion and Face APIs because I want to include actual code and not use the NuGets provided (that is a personal choice 😎).

The solution is made of two projects – CodeCamp2016 (in which all the magic happens) and ProjectOxford (class library project containing all the SDKs from each API). If you analyze the class library project in depth, you will find out that the SDKs themselves are no more than a WebRequest done in an elderly manner (Event-Based Asynchronous Programming — if you are interested in this topic, you can check this blog post).

The CodeCamp2016 project is interfaced in 3 big components: Storage, ImageSource and ImageProcessing. I’ve done this to make it easy, in the future, to change or extend the functionalities of the application.

The IStorage interface is implemented by the LocalPhotoStorage which makes use of the UWP storage APIs and saves the picture from the media device in the Pictures folder of the current user.

The IImageSource interface is implemented by the LocalCameraImageSource which makes use of the UWP audio-video APIs.

The IImageProcessing interface is implemented by the ProjectOxford partial class to keep the code separated by each API and to improve consistency. This class contains the actual service clients properties (you guessed it interfaced 🤓) which I instantiate in the constructor (I could have used Dependency Injection to remove this step — if you are interested in this topic you can check this blog post).

The ProjectOxford.Emotion is a wrapper over the the EmotionServiceClient in which I implemented, using IStorage and IImageSource, to save the current photo taken and feed it to the Emotion API, after which I get the highest emotion score, display the emotion to the user and feed it to Speech API.

public async Task<string> RecognizeEmotion(string imagePath)
{
    var result = string.Empty;
    Emotion[] emotionResult = null;

    return await Task.Run(() =>
    {
        if (!File.Exists(imagePath))
            return null;

        using (var stream = File.Open(imagePath, FileMode.Open))
            emotionResult = emotionClient.RecognizeAsync(stream).Result;

        return GetHighestEmotion(emotionResult[0].Scores);
    });
}

The ProjectOxford.Face is a wrapper over the the FaceServiceClient in which I implemented, using IStorage and IImageSource, to save the current photo taken and feed it to the Face API to recognize a face and get facial features, after which I check if the face is familiar by feeding two photos into Face API, display the result to the user and feed it to Speech API.

public async Task<bool> RecognizeFace(string imagePath)
{
    var faceDatabase = await GetFaceDatabase();
    var current = await DetectFace(imagePath);

    foreach (var face in faceDatabase)
    {
        var result = await faceClient.VerifyAsync(face.FaceId, current.FaceId);

        if (result.IsIdentical)
            return true;
    }

    return false;
}

The ProjectOxford.Vision is a wrapper over the the VisionServiceClient in which I implemented, using IStorage and IImageSource, to save the current photo taken and feed it to the Vision API to recognize a text or dominant foreground color (which I use as “get shirt color” 😎), display the result to the user and feed it to Speech API.

public async Task<string> RecognizeText(string imagePath)
{
    var result = string.Empty;
    OcrResults ocrResult = null;

    return await Task.Run(() =>
    {
        if (!File.Exists(imagePath))
            return null;

        using (FileStream stream = File.Open(imagePath, FileMode.Open))
        {
            ocrResult = visionClient.RecognizeTextAsync(stream, LanguageCodes.AutoDetect,true).Result;
        }

        return ocrResult.GetNaturalText();
    });
}

public async Task<string> GetDominantForegroundColor(string imagePath)
{
    return ((AnalysisResult)await AnalyzeImage(imagePath)).Color.DominantColorForeground;
}

The UI is done in XAML in the MainPage.xaml file. It basically has five buttons, a textbox and a media element in which we load the image source. If you want to learn more about the XAML language or the Universal Windows Platform, you can check out this blog post and this repository. In the MainPage.xaml.cs we have the implementation for each buttons which bassically are calls of the methods we defined in our interfaces. We also have the Speech API, implemented using SpeechSynthesizer which creates a stream of sound from a string.

ui-demo

For your application to function, you will need API keys provided from the Cognitive Services portal for each API you want to use. I have created a resource file called AppSettings.resw in which to store all the necesarry API keys and load them into the code without hardcoding anything. I’ve also setted Microphone, Pictures Library and Webcam capabilities in the Package.appxmanifest.

 In conclusion

As you may have observed, a bit of C# coding knowledge is required. For a better understanding of how powerful the C# language really is, you can check out this repository full with basic C# projects. If you want to go deeply into advanced C# topics, you can check out this repository. Stay tuned on this blog (and star the microsoft-dx organization) to emerge in the beautiful world of “there’s an app for that”.

Good to have resource: [Channel9] Give Your Apps a Human Side

Good to have resource: [Channel9] Build smarter and more engaging experiences

Post image source: linkedin.com

Leave a Reply