Computer Vision and Augmented Reality

Technology and solutions for Object Recognition and Augmented Reality applications on cloud and mobile platforms

Visual Operation Guidance


Technical workers often face challenging situations, struggling with equipment malfunctions on premises, while located far from a well-equipped lab and knowledgeable experts. Augmented-reality-based applications can replace tedious documentation search and phone conversations with colleagues or subject matter experts. We consider two corresponding modalities of such applications: self-guidance (fully automated AR services) and peer guidance (remote assistance by a colleague).

Self-guidance tool

AR enables the user to visualize information, associated with a particular object or its component, directly on their see-through display, such as a smartphone screen or AR glasses. The augmentation (e.g., a clickable small picture or text) appears in the relevant location even if the user (holding/wearing the device) moves around it.

This technology opens up a wide range of possibilities. Most relevant use cases at present are in the industrial domain, as assistance to field technicians working at remote sites with complex machinery. Direct visual augmentations enable making easy-to-use, step-by-step procedures, giving the technician instructions directly at the physical scene, rather than via pictures in a document. Another domain is home appliances. Owners of recently purchased washers or sophisticated kitchenware could enjoy an application guiding them through usage or troubleshooting scenarios with the equipment.

We have developed an SDK for building such self-Guidance applications, for the iOS and Android platforms. Developers can easily adjust the provided skeleton to their needs, specializing the UI, the annotations, and the screen-to-screen flow of instructions.

Visual manual for technical equipment

Possible applications for augmented-reality-based assistance include visual replacements for equipment manuals, either printed or digital. A major drawback of existing manuals is the lack of a direct link between visible attributes of the equipment (buttons, notifications, handles, etc.) and the corresponding pages explaining their meaning or use.

An AR-based manual provides a great shortcut for the initial stage of getting help. Besides guiding the used to the relevant equipment locations, such manual can adapt interactively to the situation, by selecting the next steps according to the user feedback.

Moreover, besides querying specific locations (What is it?), the user can ask to find a specific button or compartment (Where is it?); since at each given moment, the AR system knows the user’s position relative to the equipment (the dashboard, in this case). As a result, it can guide him, using flashing arrows, to move his device to the required location.

Peer guidance tool

In many cases, the issues faced by a field technician require live human support, provided by a domain expert located at the main office (or contacted via his mobile device). A standard phone call has limited utility when the technician does not have sufficient knowledge of the environment to describe the problem (and receive the solution) verbally. Often, such assistance must rely on visual details of the equipment, in a view shared by the technician and the expert.

We have developed an AR-based mobile application for Peer Guidance (PG), intended for remote support of field technicians. During the support session, the expert has the ability to point at various elements of the scene, at which the technician directs his mobile device, and annotate them with types of actions and complementary text. The annotations maintain their 3D positions while the technicians moves around the scene, comprehensively conveying the expert's instructions.

Augmented Reality content service

Augmented Reality content service

An intrinsic part of the Visual Operation Guidance platform is the AR content creation service, which enables the user to create representations of physical objects for IBM AR applications. It includes a 3D point cloud model, annotated anchor points, buttons, and more.

The service includes the following capabilities: 3D reconstruction from a video file (structure from motion); model cleaning; anchor points; and definition of AR data layers, including annotations and action buttons. Overall, the procedure for adding a new object is a fast sequence of manual processing steps. The service produces a data package ready for deployment on the user device.

Object recognition

Our group has developed object recognition technology in various frameworks, including cloud-based recognition services. Most of them are based on cutting-edge advances in deep learning technology.

The algorithm powering the logo and food recognition services (see below) is an ensemble of deep networks, including Faster R-CNN (used for region proposals, filtered by the classifier score) and a network computing an image embedding in a features space equipped with a metric trained with the triplet loss. The object recognition, and especially the custom learning, are done using similarity in this embedding.

Logo detection in the wild

We built the online Watson Cloud service for detecting logos of various companies appearing anywhere in an image. The service is useful for marketing analysis, as it allows current trends in specific regions and populations to be determined automatically.

The real-world appearance of logos often poses a challenge to recognition systems, which only have access to their ideal graphical representations. Moreover, often, the sought-after logos regions only represent a small fraction of available images. The power of deep learning helps overcome these issues, providing high-quality recognition for hundreds of brands.

In addition, the custom learning capability allows users to introduce a new brand into the system with just a few image examples of its logos.

Logo Detection
Logo Detection
Logo Detection

Food recognition

Food recognition

Recognizing food items is a very complex problem due to the absence of well-defined classifications and the high variability of food dish appearance. Nevertheless, it is quite possible to set up, using a good data set, commonly used food categories, and strong deep networks. In our efforts in this area, hundreds of food items are successfully recognized via a cloud-based service.

Moreover, this technology was successfully engaged in the IBM Research “What Did I Eat?” project, which presents a mobile application in the domain of personal health. The application enables the users to perform a fast food logging in order to monitor their diet and improve their eating habits.

Instance recognition for thousands of objects using a single example per object

Instance recognition

The technical specifications of our industrial projects put challenging requirements on object detection and recognition algorithms. For retail industries, we offer solutions involving multiple (tens of thousands) product recognition in still/video images, while hinging on a training data comprised of a single idealistic (catalog) image of each product. This allows the effortless integration of our technology into a client’s infrastructure. The same algorithmic core is used for projects requiring 3D pose recognition and tracking of industrial objects. Our solution scales to hundreds of 3D scenes, while running on mobile devices (tablets).

Most of the current top-performing learning-based approaches, especially CNN-based deep learning methods, rely on large amounts of annotated data to train effectively. This poses a significant challenge in situations when we need to recognize many thousands of visually similar (fine-grained) categories, for which only a few examples (or sometimes even just one) examples are available. This situation frequently arises in retail product categories recognition, which is inherently fine-grained, and where we usually have just a single studio-image example of a product to train on. Another example is detection of the pose of a single query image with respect to a large-scale 3D point-cloud model, where we have sparsely sampled partial views of the model for training (one training image per view) and are required to detect for completely unseen views.

Instance recognition

We developed an approach for the aforementioned limited training data, in a large-scale, fine-grained detection and recognition scenario. The method is designed to work for both image and video inputs and consists of three main components:

  1. A fast detection algorithm capable of simultaneously localizing and recognizing multiple instances of thousands of fine-grained categories within an unconstrained image (unknown scale, uncontrolled lighting, etc.), while spending less than a second per mega-pixel.
  2. A deep network for fine-grained refinement of the hypotheses returned by the detector, each accompanied with an accurate localization and a short list of potential categories.
  3. For video inputs: temporal integration that tracks the detection hypotheses, thus completing the detection gaps.

For further details, see the paper "Fine-grained recognition of thousands of object categories with single-example training", L.Karlinsky, J. Shtok, Y. Tzur, and A. Tzadok, CVPR, Hawaii, USA, July 2017.

AR product catalog

We built a catalog of car modules and gadgets, enabling the user to get instant information on each such module by capturing it in the app’s camera. The AR catalog is demonstrated at a large table, full of BMW car gadgets, in IBM IoT center in Munich. The technology powering the AR catalog is a deep-learning-based recognition system, designed for limited training data. A short video of each object suffices to introduce it to the catalog.

AR product catalogue
AR product catalogue

Refrigerator products detection

Instance recognition

In a joint project with a home appliance company, we address another difficult problem of analyzing the contents of a refrigerator and producing a list of the products and packages inside. As in the product catalog activity, we managed to develop a recognition system that requires only a few sample images per product.

Semantic segmentation

Semantic segmentation

The purpose of semantic segmentation is to identify semantically meaningful regions on an image (i.e., assign a label to each pixel), like people, cars, building, glass, etc. In this work, we produced results for the segmentation of human hands in egocentric images, while discriminating between left and right hands, as well as between hands of the person wearing the camera (“my hands”) and of the person opposing him or her (“his hands”).

Smart buildings

The smart building project is designed to provide building inhabitants with a single centralized tool for managing the various technical equipment found in the structure, such as illumination, air conditioning, fire alarm and control systems, etc. The system, built by our group, recognizes the specific technical item the user points at, and contacts the centralized building management system for related operations. The latter includes opening a ticket, controlling its state, and so on.

Smart buildings
Smart buildings
Smart buildings

Hand detection and analysis

We developed a DL-based solution for hand detection and analysis, such as detection of a pointing finger, from egocentric video. This technology enables a user wearing AR glasses to communicate his intention by pointing at objects of interest or press virtual buttons.

Hand detection

Body joint detection

Detecting the joints of the human body is a difficult problem for an automated system, due to the large variability in pose and appearance. Our team has built DL-based approaches for these types of tasks, performing on a par with the state-of- the-art. These techniques are also applicable to other types of articulated objects (e.g., hands).

Body joint detection
Body joint detection

Ground cover monitoring in the Scent project

Scent is a joint research program, initiated by the EU and involving a number of leading research bodies from different countries. IBM plays a leading role in the project, and our group is responsible for the core computer vision module performing ground cover classification.

Ground cover monitoring
Ground cover monitoring
Ground cover monitoring

The objective of the project is to set up a system for continuous monitoring of a few selected large areas, both rural and urban. The monitoring is performed by the public via a smartphone application, which people use to take photos of the various locations in the designated areas. The images are uploaded to our recognition engine, which classifies their content according to ground cover taxonomy. The engine, based on a deep network, recognizes trees, buildings, pastures, clean/cluttered storm drains, etc. These results help hydrologists build terrain maps used in flood modeling and prediction.

Solutions for retail business

Solutions for retail business

Our object recognition technology enables content analysis and the arrangement of dozens of retail products on the shelf from a single image. The tool recognizes products among thousands of possible retail items and verifies their placement according to a given planogram. In retail business, this translates into the automatization of shelf management activities, centralized monitoring of shelf arrangement over hundreds of branches, and validation of product license agreements for suppliers. In the demo video below, we demonstrate the detection and tracking of multiple products occupying a retail shelf, using a mobile device.

Multi-modal context-aware conversation engine

In joint work with the IoT and Wearables Group in IBM Research - Haifa, we built a conversation engine with multi-modal input. It includes object and pointing recognition, fully speech-based communication, and live telemetry readings. This tool is described here within the context of one of its manifestations, namely, as a self-guidance tool for performing technical tasks on a piece of equipment. Using this tool, the user can speak to the system and hear sentences in response; finger point at individual parts of the equipment as a part of the query; request full telemetry information relevant to the machine at hand; and more.

Contrary to standard communication protocol with assistant chat bots, where the input query consists just of the user`s textual requests, the multi-modal input provides the conversation engine with the following information: type of equipment, individual equipment part currently pointed by the user, and current and processed IoT telemetry values (average, maximal, etc.). This is enabled by the following components:

  • 3D object detection and recognition engine, which determines the type and position of the equipment faced by the user and its components
  • Pointing recognition engine, which determines to which part of the equipment the user is directing his index finger
  • Watson Speech-To-Text service, which interprets the user`s words into phrases
  • Watson Text-To-Speech service, which vocalizes the response obtained from the bot
  • IoT telemetry component, which aggregates the telemetry readings over time

This data allows the creation and maintenance of a rich context of the user’s environment and requests and allows for natural conversation between the human and the machine. Access to technical databases allows the conversational tool to satisfy a wide range of user queries, from information requests, such as - “What is this part (I am pointing at?) “, through “Guide me how to replace this valve”, to “order a replacement part for this gauge”.

Multi-modal context-aware conversation Multi-modal context-aware conversation

The capabilities of the Watson conversation engine include maintaining the context (current state) of the user’s situation and conversation. Thus, the visual conversation tool “knows” what piece of equipment the user is facing at the moment, what part of the equipment he pointed at recently, and what his last queries were about. This enables a smart processing of additional queries and visual information; for instance, if the user points at a pressure gauge, but asks “what was the maximal current at this point?”, there is definitely room for clarification. If the user has asked for disassembly instructions for the equipment, it is appropriate to remind him that it is still plugged into the electricity (as IoT devices indicate). When the user asks to order spare parts, the consistency of the part’s location, the user’s reference to it, and the catalog number should be maintained. These are just a few examples of a large variety of helpful features enabled by the technology.

The diagram below shows the basic flow of the user’s interaction with the tool. The scheme is generic and can serve as a basis for many use cases in maintenance and troubleshooting procedures.

Multi-modal context-aware conversation diagram