Why computer vision APIs won’t do the trick for verticalized applications. Heuritech’s take in Fashion
Beyond the AI hype, significant new possibilities in the world of computer vision have arisen in the last few years. However, deploying computer vision solutions still requires expert vision knowledge, business understanding, solid engineering and smart processes. I’ll expose the challenges of computer vision applied to a vertical such as fashion, and how we solved them at Heuritech.
The Fashion World: social media is a game-changing
The Fashion Industry has been undergoing a major transformation:
A few years ago, fashion trends would come from the top of the pyramid, while now, millions of people can potentially influence a brand’s reputation.
It has resulted in trends continuously popping up from everywhere, millions of new products being launched everyday and even more contents posted online… in the form of images.
This has put an extra pressure on Fashion teams. Their eye & intuition is their superpower but with millions of new insights everyday, we feel they need an extra help to spot only relevant waves & catch them on time.
That is where the need for solid and tailored technology comes in to make sense of all that content posted everyday in a relevant and actionable way. This is why we decided to fuel our energy into building a powerful Computer Vision technology that perfectly matches this precise need.
Could we have done it using today’s offers? Let’s dive into current offers.
Diving into the world of computer vision
There are generally 3 ways to go to tackle any computer vision problem at a business scale:
1. Use a generic API such as Google Cloud Vision, Amazon rekognition, Clarifai, etc.
2. Build your own system, starting from Open Source algorithms and your own dataset;
3. Use a domain-specific service trained specifically for your problem.
It’s difficult today to realize the power and weaknesses of the 3 solutions — mainly because these solutions are overly advertised compared to what they actually do.
Most importantly, no solution suits all problems, rather you should consider each solution with regards to the nature of your problem. We developed a very thorough methodology to understand precisely the nature of a problem, and find the most relevant solution — stay tuned here as we’ll release the full methodology soon.
Let’s apply it to our problem: the analysis of Fashion Images and videos.
Computer vision for Fashion
Here is the list of technical criteria to understand the inputs and outputs of our system, as well as the technical constraints.
Qualifying inputs (images/videos) and outputs (elements we want to detect):
1. Class Granularity: In computer vision, we define a “Class” as an element or attribute to detect in an image. Here, we want to detect precise elements of Fashion: Identify each cloth, accessory, shape, attribute, color, pattern, style and even the exact product when it is identifiable.
2. Image diversity: Images/Videos may come from any user, and in that way are not normalized. There are plenty of contexts, zooms & resolutions, lighting conditions, etc. This makes the problem much, much more difficult than if it was on standardized pictures.
3. Class (think object) variability: Many of the classes we want to detect, “handbag”, “floral texture”, etc. may take several forms: a handbag might be a tote bag or a backpack, which are very different visually.
4. Class (think object) deformations: All the objects we are detecting may be seen under many different angles, deformations, occlusions (i.e. the handbag is worn and partially hidden).
5. Class Evolvability: We have a very large set of classes, organised hierarchically, and constantly evolving, as we want to detect new products or new attributes.
What are our constraints in tech terms
6. Precision / recall: The output of our product and attributes detection should always be relevant (we need a near perfect precision), but it is ok if a few items are missed (we need a good recall). To be able to track trends and to compare the performance of one product compared to another, we need to control precisely the recall (= have the same level of recall, for instance 90% for each product).
Keep in mind that this could be very different for another problem. In image moderation for instance it shouldn’t miss any transgressive image (near perfect recall), but it’s ok if we reject a bit too many images (good precision).
7. Dataset availability and quality: There is no available good quality dataset corresponding to the different classes we want to detect. Unfortunately, this is almost always how it goes with Machine Learning.
8. Scaling, speed and deployment:
- We need to analyse 1M-10M images / videos a day.
- We don’t need to be realtime (we can process images each day and have the results the day after).
This analysis is necessary to have a much better assessment of our problem.
Is our problem solved by current offers?
As Images may come from any source, we absolutely need to use solutions which can handle large 2. Image diversity. We may then turn to APIs, but we encounter the following problems:
- Not a good 1. Class Granularity and 5. Class evolvability — this makes these solutions irrelevant for fashion trends spotting: knowing there’s a “dress” and a “tree” in a picture doesn’t bring any value. The business value only when we get to domain-specific (here Fashion, including specific styles, brands, patterns…) tagging.
- The 6. precision and recall for coarse grained classes is good, but recall is lacking for finer grained classes. By the way, if you want to know more about the actual performance of these APIs, here is an excellent article about them.
- APIs operate at the full image level — we want to operate at the level of each clothes and qualify them precisely. If an API gives ‘denim’ ‘jacket’ and ‘pants’, we wouldn’t know if it’s the jacket or the jeans that have the ‘denim’ fabric.
- Super expensive — when we get to large scale analysis of images, such as a few millions per day, it’s easy to climb up to a few 100K$ monthly bills.
To illustrate these points, we’ve made the following tests using public APIs from Clarifai, Google, Microsoft and Amazon. We report only the most confident scores provided by these APIs, and exclude colors, as they only provide colors analysis at a global image scale, and not on each object.
Comparison of general purpose APIs and Heuritech’s solution
Building your own system
As the public APIs do not match our needs, we might want to turn to open sources tools and quickly deploy a solution using them.
Open-Source tools are blossoming, and we hear about them as they are well advertised by internet giants. It’s now possible to quickly build a nice looking demo with pretrained networks. If you have a competent team, it’s also becoming easier to train your own model specific to your problem. You’d need to create a high quality dataset, usually with ~1000 images per class, and the output could be a nice looking demo tailored to your problem.
However, this is strongly deceptive.
Once you’ve played with the open source tool provided by Facebook and built a nice demo, you only did 1% of the work required to actually have a system that works in production at scale.
In addition to the engineering and research skills required to build such a system, major hindrances will come from 5. Class evolvability, 7. Dataset availability and quality and 8. Scaling, speed and deployment.
At Heuritech, we went through this hassle, and built a solution from scratch, specialized in Fashion, which identifies each part of the image with very high accuracy, and characterizes precisely these objects.
Technology at Heuritech
Our technology relies on a whole pipeline of deep learning algorithms more advanced than tagging, known as object detection and segmentation. This is critical to deal with the 2. Image diversity and 4. Class deformations.
State-of-the-art algorithms — we developed our own set of algorithms (both training methods and neural network architectures) which achieve superior performance in pre-training, open domain, and hierarchically structured outputs. This mainly improves the point 6. Precision / recall. We started to publish our methods in international conferences (ICCV 2017 [4, 5]), expect more to come!
Domain-specific dataset — defining accurately the different classes to detect, such as the different fabric of clothes, textures, shapes, styles, and gathering data to train the models is a difficult task which requires both expert domain knowledge and computer vision knowledge. We built a whole process enabling experts to quickly reach agreement. This enabled us to achieve 5. Class evolvability by growing our library of recognised attributes and products to 2000+, and adding more than 500 a month.
Active learning and knowledge distillation — The selection of relevant images to send to manual labeling is critical for good training performances. Rather than having armies of people labeling thousand of random images, we automatically select the images in which the current model is the most uncertain, and will give the most information during the training. This is coupled with several methods to make use of other available training information (weak labeling , knowledge distillation ).
Engineering and deploying — Our whole system is designed to scale: models are re-trained on a daily basis, new classes are added every week, and the whole process is versioned (training datasets, testing datasets, models, parameters, set of classes). The vision pipeline is deployed at scale, processing 2M+ images a day.
These are the main features enabling Heuritech to build a production ready visual analysis system.
We are also conducting cutting edge applied research focused on detection of trends (clustering of similar visual styles), detecting style influence, integrating both textual and visual information for detection.
If you’re interested in knowing more, please get in touch!
While generic APIs seem to be the future of AI for computer vision, today’s Computer vision API offer has limited use cases with strong value added. In fact, it’s probably going to remain that way, because specific problems call for tailored solutions, where the business value comes from the adequation between the computer vision system and the business needs.
Building your own solutions, even when relying on open source frameworks, takes a considerable amount of time, talents, and coordination between business and technical teams. Well, if you can afford 3 years of 15 engineers, 8 PhD in machine learning, this might be your best bet ;)
At Heuritech, we believe that only domain-specific solutions will bring a lot of value to businesses. In that light, we believe several strong tech teams will arise and tackle different domains, while it’s unlikely that generic APIs will become experts at all specific fields. Today, our domain is Fashion, and envision to build the best vision system for Fashion. Only then, we’ll apply all our technology and processes to other, related domains.
Some of the world leaders in Fashion already use Heuritech to predict trends and to measure the performance of their products in the market.
Tomorrow, we will empower them to predict demand for their products, including, for the first time, what’s happening outside, beyond their own perspective.
The Author Charles Ollion, Co-Founder and research director at Heuritech. Lecturer in Deep Learning ParisSaclay, Polytechnique and EPITA. Big thanks to Charlotte Fanneau for helpful insights. Please feel free to comment and notify me at email@example.com. Follow Heuritech here.
 Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neural Network.” stat 1050 (2015): 9.
 He, Kaiming, et al. “Mask r-cnn.” International Conference on Computer Vision (ICCV), 2017. https://research.fb.com/facebook-open-sources-detectron/
 Ollion Charles & Grisel Olivier, open source deep learning lectures and code labs, 2017. Polytechnique / ParisSaclay Datascience https://m2dsupsdlclass.github.io/lectures-labs/
 Corbiere, Charles, et al. “Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction.” Workshop Fashion at International Conference on Computer Vision (ICCV), 2017.
 Ben-younes, Hedi, et al. “Mutan: Multimodal tucker fusion for visual question answering.” The IEEE International Conference on Computer Vision (ICCV), 2017.