Starship is building a fleet of robots to deliver packages locally on demand. To achieve this, robots need to be safe, polite, and fast. But how do you achieve this with low computing resources and without expensive sensors such as LIDARs? It’s the engineering reality you have to face, unless you live in a universe where customers willingly pay $ 100 for a delivery.
To begin with, the robots start by detecting the world with radars, a multitude of cameras and ultrasound.
However, the challenge is that most of this knowledge is low level and not semantic. For example, a robot may sense that an object is ten meters away, but without knowing the object category, it is difficult to make safe driving decisions.
Machine learning through neural networks is surprisingly useful in converting this unstructured low-level data into higher-level information.
Starship robots mainly drive on sidewalks and cross streets when they need to. This poses a different set of challenges than self-driving cars. Traffic on automobile roads is more structured and predictable. Cars move along the lanes and don’t change direction too often while humans often stop abruptly, meander, may be accompanied by a dog on a leash, and do not signal their intentions with turn signals.
To understand the surrounding environment in real time, a central component of the robot is an object detection module – a program that captures images and returns a list of boxes of objects.
That’s great, but how do you write such a program?
An image is a large three-dimensional array made up of a myriad of numbers representing the intensities of pixels. These values change considerably when the image is taken at night instead of day; when the color, scale or position of the object changes, or when the object itself is truncated or occluded.
For some complex problems, teaching is more natural than programming.
In the robot software we have a set of trainable units, mainly neural networks, where the code is written by the model itself. The program is represented by a set of weights.
At the beginning, these numbers are initialized randomly and the program output is also random. Engineers present sample models of what they would like to predict and ask the network to improve the next time it sees a similar entry. By changing the weights iteratively, the optimization algorithm searches for programs that more and more accurately predict bounding boxes.
However, one needs to think deeply about the examples that are used to train the model.
- Should the model be penalized or rewarded when it detects a car in a reflection of window?
- What should he do when he detects an image of a human on a poster?
- Does a car trailer full of cars have to be annotated as an entity or must each of the cars be annotated separately?
These are all examples that occurred during the construction of the object detection module in our robots.
When learning a machine, big data is simply not enough. The data collected must be rich and varied. For example, using only uniformly sampled images and then annotating them would show many pedestrians and cars, but the model would lack examples of motorcycles or skaters to reliably detect these categories.
The team should specifically look for concrete examples and rare cases, otherwise the model would not progress. Starship operates in several different countries and varying weather conditions enrich the set of examples. Many people were surprised when the Starship delivery robots ran during the snowstorm “Emma” UK, however, airports and schools remained closed.
At the same time, annotating data takes time and resources. Ideally, it is better to train and improve models with less data. This is where architectural engineering comes in. We encode prior knowledge into architecture and optimization processes to narrow search space to programs that are more likely in the real world.
In some computer vision applications such as segmentation by pixel, it is useful for the model to know if the robot is on a sidewalk or a railway crossing. To provide a clue, we encode global clues at the image level into the architecture of the neural network; the model then determines whether or not to use it without having to learn it from scratch.
After data and architecture engineering, the model might work well. However, deep learning models require a significant amount of computing power, and this is a big challenge for the team as we cannot take advantage of the most powerful graphics cards on low cost delivery robots. battery powered.
Starship wants our shipments to be cheap, which means our equipment has to be cheap. This is the same reason Starship doesn’t use LIDAR (a detection system that works on the radar principle, but uses light from a laser) that would make the world easier to understand – but we don’t want that. our customers pay more. than what they need for delivery.
State-of-the-art object detection systems published in academic papers operate at around 5 frames per second [MaskRCNN], and real-time object detection papers do not report rates significantly above 100 FPS [Light-Head R-CNN, tiny-YOLO, tiny-DSOD]. In addition, these figures are shown on a single image; however, we need a 360 degree understanding (the equivalent of processing about 5 single images).
To provide perspective, Starship models run over 2000 FPS when measured on a consumer GPU and process a full 360-degree panoramic image in a single pass. This equates to 10,000 FPS when processing 5 single images with batch size 1.
Neural networks are better than humans at many visual issues, although they can still contain bugs. For example, a selection frame may be too wide, the trust too low, or an object may be hallucinated in a place that is actually empty.
Fixing these bugs is a challenge.
Neural networks are seen as black boxes that are difficult to analyze and understand. However, to improve the model, engineers must understand the failure cases and delve into the specifics of what the model has learned.
The model is represented by a set of weights, and one can visualize what each specific neuron is trying to detect. For example, the first layers of the Starship network activate in standard patterns such as horizontal and vertical edges. The next block of layers detects more complex textures, while the upper layers detect car parts and complete objects.
Technical debt takes on another meaning with machine learning models. Engineers are constantly improving architectures, optimization processes, and datasets. The model thus becomes more precise. However, changing the detection model for a better model does not necessarily guarantee the success of the overall behavior of a robot.
There are dozens of components that use the output of the object detection model, each requiring different precision and level of recall defined based on the existing model. However, the new model can act differently in a number of ways. For example, the exit probability distribution could be biased towards higher values or be wider. Even though the average performance is better, it can be worse for a specific group like big cars. To avoid these obstacles, the team calibrates the probabilities and checks the regressions on several sets of stratified data.
Monitoring trainable software components poses a different set of challenges than monitoring standard software. Few concerns are given regarding inference time or memory usage, as they are mostly constant.
However, moving the dataset becomes the primary concern – the distribution of data used to train the model is different from where the model is currently deployed.
For example, all of a sudden electric scooters are driving on the sidewalks. If the model did not take this class into account, the model will have trouble classifying it correctly. The information derived from the object detection module will be in disagreement with other sensory information, which will lead to the request for assistance from human operators and therefore a slowdown in deliveries.
Neural networks allow Starship robots to be safe on level crossings avoiding obstacles like cars and on sidewalks by understanding all the different directions humans and other obstacles may choose to go.
Starship robots achieve this using inexpensive hardware that poses a lot of engineering challenges, but makes robot deliveries a strong reality today. Starship robots make real deliveries seven days a week to cities around the world, and it’s gratifying to see how our technology is continually bringing more convenience to people in their lives.