Microsoft's AI APIs add content moderation, speech recognition

New APIs for Microsoft's Cognitive Services cloud allow speech-to-text and vice versa, as well as provide tools to automatically moderate images, video, and text

If you want your apps to understand what someone’s saying or know if your user-content rules are being broken, Microsoft has you covered.

Microsoft is expanding its portfolio of Cognitive Services—in-the-cloud APIs that provide out-of-the-box versions of useful algorithms—to include two new services that go into general availability next month: the Content Moderator and Bing Speech APIs.

Talk to me, and I shall hear

Bing Speech converts audio into text and vice versa. It’s also able to apply contextual understanding to that speech or text. The Speech API’s demo page lets you try a limited sample of both text-to-speech and speech-to-text for yourself.

Both processes show their limits pretty quickly, though. Text-to-speech still sounds somewhat robotic; there’s always the sense that the speaker is emphasizing the wrong syllables. And speech-to-text still seems best suited for processing short command phrases rather than for performing transcriptions of longer texts. Google’s speech recognition API appears to be more accurate, although Microsoft offers competitive features like real-time streaming of results (as per Google’s Voice Typing function).

Intent Recognition, or the ability to return structured data about the captured speech rather than flat text, is another feature of the Speech API that Microsoft is touting as an improvement over existing speech recognition systems. This enables apps to “easily parse the intent of the speaker, and subsequently drive further action,” according to Microsoft, which calls the feature Language Understanding Intelligent Service (LUIS). Microsoft’s demo of LUIS includes the ability to parse command examples like “turn off all the lights” or “switch all lights to green” (for those of you with fancy multicolored LED bulbs).

Red light, green light, yellow light

With the Content Moderator API, Microsoft provides tools to help automate one of the more tedious and time-consuming jobs in creating services that accept user-submitted content. Content Moderator can check images, text, and video for “offensive and unwanted content that creates risks for businesses.”

Image moderation can check for “adult or racy content,” and can extract text from images by way of OCR—for example, to determine if meme-type images have offensive content. Both image and video moderation return simple “is/is not” checks for adult material, as well as confidence scores for more precise evaluation. Text moderation can check for profanity in more than 100 languages, as well as malware/phishing URLs. It can also return details about the original and corrected texts if needed.

The process isn’t intended to be entirely automatic; Microsoft provides both a tool and an API to allow individuals and teams to moderate submitted content and apply custom tags and workflows to data. But the underlying moderation APIs are meant to zero in on the content that needs at least some human oversight and to filter out the things that don’t.

Another aspect of the API—which might prove even more useful than checking for objectionable content—is flagging submissions that have personally identifiable information. Images, for instance, can be run through a face-detection algorithm; anything that has an identifiable face in it can be flagged. Future versions of the service could provide a line of defense against doxxing, or having personal information broadcast maliciously by third parties.

Microsoft says Cognitive Services provides ready-to-use APIs and data models so that companies don’t have to build their own data sets or trained models. Equally important in time will be that it provides convenient ways for organizations to build extensions to these services without writing from scratch applications that back-end into Microsoft’s APIs. The workflow/tagging mechanism in Content Moderator provides a hint at this; these systems could be customized for a specific environment by feedback from nontechnical users instead of developers alone.

Copyright © 2017 IDG Communications, Inc.