This article is licensed under a CC BY-SA license.
An e-bike rider varieties “reset Kiox 300 show,” and the reply should land in a heartbeat—not as a 200-page handbook or a dozen near-miss FAQ hyperlinks. The identical expectation applies to mechanics updating brake firmware in a loud workshop and to gross sales reps searching torque specs on weak showroom Wi-Fi. Bosch eBike Techniques an impartial enterprise division throughout the Bosch Group that serves hundreds of thousands of pages of manuals, launch notes, and CAD drawings in twenty-seven languages. Roughly 5 p.c of that content material modifications each month. However for Bosch eBike Techniques, this wasn’t nearly effectivity; it was about elevating the shopper expertise and guaranteeing seamless help for riders, sellers, and repair companions worldwide. Assembly expectations like these pressured us from Bosch Digital to depart plain key phrase search behind and construct a retrieval engine that understands intent throughout languages, retains prices predictable, and nonetheless solutions in beneath a second.
Let’s discuss why the previous method simply couldn’t sustain. The world of bikes—and bike documentation—is wild with synonyms, half nicknames, and shifting terminology. “Show,” “NYON2,” or “BUI350” would possibly all imply the identical factor to a rider, however a bag-of-words search engine treats every as a stranger. Recall falls off a cliff until you’re keen to hand-craft limitless synonym lists.
Typographical quirks and voice-to- textual content slip-ups don’t assist. Actual-world queries present up as “Kioxx 300,” “réinitialiser kios,” or, due to smeared microphones, as voice-recognition garble like “reset chaos 300.” Actual-token searches? They only shrug and present “No outcomes.” In distinction, embedding-based search is much extra forgiving of noisy enter.
Intent additionally will get misplaced in translation, particularly for complicated or constraint-laden queries. Somebody would possibly sort, “Replace brake firmware with no laptop computer” or “max torque beneath rain mode solely.” Key phrase search latches onto negated phrases (“laptop computer”) and dredges up the fallacious docs. Fashionable transformer fashions, against this, grasp what the person actually meant and rank outcomes accordingly.
Mix all these complications—synonyms, noisy enter, intent confusion, quickly altering languages—and also you’ve received the primary causes key phrase search stored lacking the mark. For Bosch Digital, shifting to a vector-based, multilingual SmartSearch wasn’t an improve. It was survival.
As soon as we mapped out each pitfall of conventional key phrase search, it was time to rethink the pipeline from the bottom up. As we speak, each reply SmartSearch delivers takes a exact, three-step journey from uncooked doc to ranked result-a journey engineered for pace, accuracy, and multilingual scale.
The 1st step: Crawling. Our self-developed Rust-based crawler zips by means of about 25 webpages per second, swiftly navigating huge documentation libraries whereas remaining well mannered sufficient by no means to journey charge limits—a digital librarian who reads quick however by no means ruffles feathers.
Step two: Chunking earlier than embedding. HTML will get dissected to separate titles from contents, and semantically coherent subjects are stitched collectively utilizing LLMs. Then come embeddings. Because of OpenAI’s Ada 002 mannequin (with a hefty 1536 dimensions), each content material chunk lands precisely in semantic house. If it quacks like “reset Kiox 300,” our system will floor solutions, even when the precise language is wildly totally different.
Step three: Rank searches utilizing a hybrid method. Semantic search isn’t at all times one of the best method. Dense vectors stay in a vector database, whereas BM25 retains traditional key phrase search within the combine. At question time, we mix the 2—70% semantic, 30% sparse—then run the finalists by means of a MiniLM cross-encoder for the decisive type. The outcome? Solutions sometimes seem in about 750 ms, with 95% delivered in beneath a second and a half—even throughout these notorious firmware launch stampedes.
However all this efficiency wasn’t with out ache. Constructing SmartSearch meant ramming into laborious limits: a 10-million vector per assortment cap, painful re-indexing each time we added metadata, storage payments bloated by 32-bit floats, and no elegant methods to compress, quantize, or tier out storage to cheaper SSDs. Scale a lot past eight million vectors and all the pieces slowed to a crawl.
SmartSearch pressured us to evolve—crawl, construction, characterize, and rank—leaving the constraints of generic search infrastructure behind. The result’s nimble, cost-effective, and fluent in each dialect your e-bike manuals throw at it.
With search bars, ten imperfect hyperlinks would possibly do. However for Bosch eBike Techniques to deploy this as a conversational assistant for its international person base , there’s no room for error—the bot normally solely has one shot. The very first retrieval have to be laser-accurate, as a result of each token we hand off to an LLM prices actual cash—and person belief evaporates if the bot’s opening assertion misses the mark. Chat additionally explodes the info scale. Now we’re not simply retrieving from documentation, however juggling huge conversational histories and real-time follow-ups. Lots of of 1000’s of chat snippets within the type of a short-term and long-term reminiscence have to be saved, searched, and surfaced in milliseconds. Right here, the cracks in our earlier vector retailer yawned open: laborious vector depend limits, glacial re-index occasions, zero help for quantization or in-build multi-stage queries, and an insistence on maintaining all vectors on DISC—bloating budgets and bottlenecking pace. Each shortcoming of the previous structure was amplified by chat’s relentless demand for cheaper, smarter, and scalable retrieval.
Enter Qdrant. After pitting a number of vector databases in opposition to our most punishing workloads, Qdrant received hands-down. On a 25k-query, multilingual check set, it delivered recall above 0.96 with quantization, stored p95 latency beneath 120 ms with 400 concurrent chats, and we diminished the storage prices for our 10M dataset by means of quantization by 16x. Qdrant didn’t simply deal with chat’s challenges—it thrived on them. Now, out of the blue, lightning-fast, chat-scale retrieval was not solely doable, it was reasonably priced.
Our first prototype spoke fluent relevance however was a glutton for storage. Each textual content chunk wrapped itself in an enormous 1536-dimensional Ada-002 vector—hundreds of thousands of high-precision floats devouring our SSDs by the rackful. One thing needed to give.
The breakthrough got here with Jina Embeddings v3. Flip a flag and also you get binary quantized embeddings with 1024-dimension vector, flip one other flag the 1024-dimension vector may be diminished with the ability of Matryoshka Illustration Studying right down to 64. With numerous inner testing on recall high quality, we discovered one of the best efficiency to high quality ratio at 256-dimensions. In a single day, the footprint dropped by ninety-eight p.c, and search high quality even crept up over Ada-002. In latest evaluations, this setup outperformed Ada-003 and left a couple of MTEB chart-toppers within the mud (we’ll consider the Qwen3 embeddings mannequin subsequent). Moreover, due to our fine-tuned ModernBERT re-ranker, any minuscule loss vanishes utterly.
Qdrant turns these slimmed vectors into lightning solutions. As a result of it natively understands multi-stage retrieval, we now run a two-stage search: a blistering-fast 256-dimension recall section fused with BM25, then a fine-tuned reranker primarily based on ModernBERT for pinpoint precision. That is how an ultra-lean operation ought to appear to be.
Most significantly, Qdrant’s tiered storage lets us preserve scorching shards in RAM and chilly vectors chilling on SSD, once more chopping storage making it a complete storage discount of 5x whereas p95 latency stays effectively beneath 400 ms. Hybrid search? Dense scores mix seamlessly with BM25 in the identical API name, so typo-riddled or excellent queries get equal love.
The outcome: the reply to “reset Kiox 300” flashes onto a rider’s display screen earlier than the site visitors gentle turns inexperienced—lighter vectors at the moment, headroom for even slimmer tomorrow, and no compromises in high quality. That is SmartSearch at chat-speed-fast, frugal, and fiercely exact, completely suiting as a spine for our assistant.
By now, our assistant may discover related information with spectacular pace and accuracy—however it nonetheless stumbled the place it mattered most: names. “My Kiox 300 flashes 503 after the v1.7.4-B replace” and “Nyon freezes on boot” appeared nearly equivalent to a language mannequin that didn’t really see merchandise, error codes, or firmware variations—only a blur of nouns and verbs. Context received misplaced; precision suffered. And bringing in a multi-billion-parameter AI hammer for this downside was pure overkill.
The breakthrough got here from an sudden place—a doomscroll by means of LinkedIn. There it was: GLiNER, promising common, light-weight NER (named-entity recognition). Few-shot studying, CPU-fast inference, and a footprint sufficiently small (800 MB) to slot in our Docker picture—GLiNER checked each field we didn’t even know we had.
It wasn’t simply “simple”—it was transformative. With solely a handful of annotated examples—simply two for merchandise, two for error codes, and two for firmware—GLiNER realized our whole area in minutes. Inference was practically on the spot: lower than 30 ms per paragraph, even on a single laptop computer core.
With labels persisting throughout chat turns, context sticks. So when a rider says, “Kiox 300 reveals 503 after v1.7.4-B,” then follows up with, “Does it additionally hit CX Gen4?” the assistant retains each product, error, and firmware straight. Every reply is routed with surgical precision, no extra mistaking a Kiox for a Nyon, no extra guesswork.
All due to a LinkedIn scroll, a 800 MB mannequin, and few strains of labeled textual content. Names matter. Now, lastly, the assistant is aware of them chilly.
Discovering the precise paragraph is one factor. For the Bosch eBike Techniques assistant, tasked with supporting numerous person wants from easy inquiries to complicated troubleshooting, finishing up a real-world process—submitting a guaranty declare, gathering the most recent firmware hyperlinks for 3 totally different drive models, or guiding a mechanic step-by-step by means of a “show reset” on chat—calls for one thing extra. A easy pipeline falls quick: trendy assistants have to motive, plan, coordinate, and act, not simply retrieve.
That is the place agentic workflows are available.
As a substitute of funneling each question by means of a single, monolithic language mannequin (and hoping it by no means drops a element), our platform orchestrates a staff of specialised AI brokers, every with an outlined duty. Image a person asking, “My Kiox 300 flashes error 503. Are you able to verify if my firmware is outdated, inform me repair it, and draft a message to help if that doesn’t work?” Within the previous days, that threw a tangle of ambiguous directions at a black-box chatbot. Now, agentic workflows break the request into manageable, coordinated steps—every agent selecting up what it does greatest.
The method begins with an orchestrator agent that parses person intent into subtasks: error code lookup, firmware verification, troubleshooting information retrieval, and, if wanted, help ticket drafting. Every subtask is routed to a specialist agent—e.g. a customized reasoning workflow primarily based on product variants and corresponding info. These brokers seek the advice of our retrieval spine (constructed for precision, even with noisy queries), collect information, cross-check variations, and piece collectively the findings.
The upshot? Agentic workflows let our assistant transcend answering “what”—they let it do “how” and “what’s subsequent,” chaining information, actions, and even human handover, seamlessly. Whether or not it’s a easy spec lookup, a multi-step troubleshooting process, or orchestrating real-world follow-ups, agentic workflows are the connective tissue behind our assistant’s leap from search field to conversational accomplice.
We’ve discovered that this modular, clear method does not simply enhance speed-it brings new peace of thoughts. When one thing breaks, the scratchpad log reveals precisely what was finished (and why). If a course of hits a wall, the orchestrator pivots—by no means leaving the person in limbo, and by no means letting essential particulars fall between the cracks.
The outcome: duties dealt with start-to-finish, person intent truly understood, and the arrogance that, beneath the hood, each reply isn’t simply the luck of a generative roll, however the well-planned output of brokers working in live performance. That’s agentic workflow in action-the step change from solutions to actual help.
Scars train deeper than trophies, so listed here are the three that also itch (in all the precise methods):
We as soon as spent a strong week deduplicating near-identical paragraphs, chopping out boilerplate (“© 2021 Bosch eBike Techniques-All rights reserved”), and flattening FAQ echo-chambers till they stopped swallowing contemporary questions complete. The advance in search high quality? Larger than any new encoder, mannequin drop, or intelligent agent may handle—by a mile. Lesson realized: a clear, well-structured corpus is the most affordable improve you’ll by no means discover on Hugging Face, and it makes each downstream agent that a lot sharper.
Binary quantization and dimension-slimming saved a small fortune on storage and inference. However we bolted these options on after launch, which meant re-encoding 10 million chunks whereas customers have been looking out stay—a gnarly headache no one wants. Subsequent time, the compression and measurement targets go on the primary whiteboard, proper up there with recall, latency, and now, agent handoff compatibility. Diets work higher earlier than the group photograph. And it’s not simply storage: your embedding mannequin, vector database, chunking technique, and, sure, agent workflows and communication schemes all have to work collectively from the beginning.
LLMs are each a blessing and a price range breaker—latency, price, and “intelligence” all turn out to be make-or-break variables in a multi-agent system. As workflows get agentic—planning, delegating, maintaining state—the problem shifts from “Can we reply this?” to “Can we coordinate this, auditable and environment friendly?” Preserve the info clear, plan your storage and compute weight loss plan early, and by no means skimp on individuals who can learn between the strains and deal with the sting circumstances. Every thing else is simply one other line on a mannequin card, or now, an agent manifest. This collaborative endeavor, made doable by the strategic funding and shut partnership with Bosch eBike Techniques, has really reshaped how info is accessed and utilized inside their ecosystem.
In the long run, it’s the painful classes—not simply the beautiful graphs—that formed SmartSearch into the system it’s now. And with every spherical of studying, our solutions get just a little sooner, just a little sharper, and maybe-one day—just a bit nearer to excellent.

