At the end, the author thinks about adding Common Crawl data. Our ranking information, generated from our web graph, would probably be a big help in picking which pages to crawl.
I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.
This is really really cool. I had earlier wanted to entirely run my searches on it and though that seems possible, I feel like it would be sadly a little bit more waste of time in terms of searches but still I'll maybe try to run some of my searches against this too and give me thoughts on this after doing something like this if I could, like, it is a big hit or miss but it will almost land you to the right spot, like not exactly.
For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.
Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.
Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..
I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...
Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable
But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.
Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!
I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.
The title should be “10x engineer creates Google in their spare time”
But seriously what an amazing write up, plus animations, analysis etc etc. Bravo.
It was also ironic to see AWS failing quite a few use cases here. Stuff to think about.
Not sure where you are based, but if you were in the EU and had no commercial intentions, you might want to consider adding the crawls from OpenWebSearch.eu, an EU-funded research project to provide an open crawl of a substantial part of the Web (they also collaborate with Common Crawl), its plain text and an index:
https://openwebsearch.eu/
It would be fantastic if someone could provide a not-for-profit decent quality Web search engine.This wasn't even in the realm of what I thought is possible for a single person to do. Incredible work!
It doesn't seem that far in diatance from a commercial search engine? Maybe even Google?
50k to run is a comically small number. I'm tempted to just give you that money to seed.
Very cool project!
Just out of interest, I sent a query I've had difficulties getting good results for with major engines: "what are some good options for high-resolution ultrawide monitors?".
The response in this engine for this query at this point seems to have the same fallacy as I've seen in other engines. Meta-pages "specialising" in broad rankings are preferred above specialist data about the specific sought-after item. It seems that the desire for a ranking weighs the most.
If I were to manually try to answer this query, I would start by looking at hardware forums and geeky blogs, pick N candidates, then try to find the specifications and quirks for all products.
Of course, it is difficult to generically answer if a given website has performed this analysis. It can be favourable to rank sites citing specific data higher in these circumstances.
As a user, I would prefer to be presented with the initial sources used for assembling this analysis. Of course, this doesn't happen because engines don't perform this kind of bottom-to-top evaluation.
It's incredible. I can't believe it but it actually works quite nicely.
If 10K $5 subscriptions can cover its cost, maybe a community run search engine funded through donations isn't that insane?
Mad respect. This is an incredible project to pull together all these technologies. The crown jewel of a search engine is its ranking algorithm. I'm not sure how LLM is being used in this regard in here.
One effective old technique for ranking is to capture the search-to-click relationship by real users. It's basically the training data by human mapping the search terms they entered to the links they clicked. With just a few of clicks, the ranking relevance goes way up.
May be feeding the data into a neural net would help ranking. It becomes a classification problem - given these terms, which links have higher probabilities being clicked. More people clicking on a link for a term would strengthening the weights.
Thank you for sharing! This is one of the coolest articles I have seen in a while on HN. I did some searches and I think the search results looked very useful so far. I particularly loved about your article that most of the questions I had while reading got answered in a most structured way.
I still have questions:
* How long do you plan to keep the live demo up?
* Are you planning to make the source code public?
* How many hours in total did you invest into this "hobby project" in the two months you mentioned in your write-up?
A vector-only search engine will fail for a lot of common use cases where the keywords do matter. I tried searching for `garbanzo bean stew` and got totally irrelevant bean recipes.
This was a great write-up!
Didn't you run into Cloudflare blocks? Many sites are using things like browser fingerprinting. I'd imagine this would be an issue with news sites particularly, as many of them will show the full content only to Google Bot, but not anyone else. Which I have long thought of as an underappreciated moat that Google has in the search market. I was surprised that this topic wasn't mentioned at all in your article. Was it not an issue, or did you just prefer to leave it out?
And you also mentioned nothing about URL de-duplication. Things like "trailing slash or no trailing slash", "query params or no query params", "www or no www". Did you have your crawlers just follow all URLs as they encountered them, and handled duplication only on the content level (e.g. using trigrams)? It sound like that would be wasteful, as you might end up making requests to potentially 2x or more the number of URLs that you'd need to.
Thanks.
One of the most insightful posts I’ve read recently. I especially enjoy the rationale behind the options you chose to reduce costs and going into detail on where you find the most savings.
I know the post primarily focuses on neural search, but I’m wondering you tried integrating hybrid BM-25 + embeddings search and if this led to any improvements. Also, what reranking models did you find most useful and cost efficient?
That stack element is amazing
I wish more people showed their whole exploded stack like that and in an elegant way
Really well done writeup!
This is incredibly, incredibly cool. Creating a search engine that beats Google in quality in just 2 months and less than a thousand dollars.
Really great idea about the federated search index too! YaCy has it but it's really heavy and never really gave good results for me.
"There was one surprise when I revisited costs: OpenAI charges an unusually low $0.0001 / 1M tokens for batch inference on their latest embedding model. Even conservatively assuming I had 1 billion crawled pages, each with 1K tokens (abnormally long), it would only cost $100 to generate embeddings for all of them. By comparison, running my own inference, even with cheap Runpod spot GPUs, would cost on the order of 100× more expensive, to say nothing of other APIs."
I wonder if OpenAI uses this as a honeypot to get domain-specific source data into its training corpus that it might otherwise not have access to.
The author claims "It should be far less susceptible to keyword spam and SEO tactics." however anyone with a cursory knowledge of the limitations of embeddings/LLM's knows the hardest part is that there is no seperation between the prompt and the content to be queried (e.g "ignore all previous instructions" etc...). It would not be hard to adversarially generate embeddings for SEO, in-fact it's almost easier since you know the maths underlying the algorithm to fit to.
This was super cool to read. I'm developing something somewhat similar, but for business search, and ran into a lot of the same challenges. Everyone thinks crawling, processing, and indexing data is easy, but doing it cost effectively at scale is a completely different beast.
Kudos wilsonzlin. I'd love to chat sometime if you see this. It's a small space of people that can build stuff like this e2e.
This is so cool. A question on the service mesh - is building your own typically the best way to do things?
I'm new to networking..
I been doing a smaller version of the same idea for just domain of job listings. Initially I looked at HNSW but couldn't reason on how to scale it with predictable compute time cost. I ended up using IVF because I am a bit memory starved. I will have to take at look at coreNN.
This then begs the question for me, without an LLM what is the approach to build a search engine? Google search used to be razor sharp, then it degraded in the late 2000s and early 2010s and now its meh. They filter out so much content for a billion different reasons and the results are just not what they used to be. I've found better results from some LLMs like Grok (surprisingly) but I can't seem to understand why what was once a razor exact search engine like Google, it cannot find verbatim or near verbatim quotes of content I remember seeing on the internet.
Getting a CORS error from the API - is the demo at https://search.wilsonl.in/ working for anyone else?
This is really well written, especially considering the complexity
> RocksDB and HNSW were sharded across 200 cores, 4 TB of RAM, and 82 TB of SSDs.
Was your experience with ivfpq not good? I’ve seen big recall drops compared to hnsw, but wow, takes some hardware to scale.
Also did you try sparse embeddings like SPLADE? I have no idea how they scale at this size, but seems like a good balance between keyword and semantic searches.
I love this and think that your write-up is fantastic, thank you for sharing your work in such detail.
What are you thinking in terms of improving [and using] the knowledge graph beyond the knowledge panel on the side? If I'm reading this correctly, it seems like you only have knowledge panel results for those top results that exist in Wikipedia, is that correct?
Tried this search: What is an sbert embedding?
Google still gave me a better result: https://towardsdatascience.com/sbert-deb3d4aef8a4/
Nevertheless this project looks great and I'd love to see it continue to improve.
Is search.wilconl.in still up? I do not get any search result.
As I've become more advanced in my career I've grown more frustrated with search engines for the same problems you described in your write up. This is a fantastic solution and such a refreshing way to use LLMs. I hope this project goes far!
Adding my kudos to the other commenters here - the polymath skills necessary to take on something like this is remarkable as a solo effort. I was hoping for more detail on the issues found during the request/parsing at a domain/page level.
Man! This incredible.. It gives me motivation to continue with my document search engine..
Such a big inspiration! One of the few times where I genuinely read and liked the work - didn't even notice how the time flew by.
Feels like it's more and more about consuming data & outputting the desired result.
Wow, looks like a tremendous commitment and depth of knowledge went into this one-man project. I couldn't even read the whole write up, I had to skim part of it. I'm super impressed.
This must be the best technical article I read on HN in months!
This is awesome, and the low cost is especially impressive. I rarely have the motivation after working on a side project to actually document all the decisions made along the way, much less in such a thorough way. Regarding your CoreNN library, Clearview has a blog post [1] on how they index 30 billion face embeddings that you may find interesting. They combine RocksDB with faiss.
[1] https://www.clearview.ai/post/how-we-store-and-search-30-bil...
I couldn't get the search working (there was some cors error) . But what a feat and writeup. Wonder Stuck!
how much did it cost ?
This really inspired me, thanks so much for building and sharing this!
Impressive! Surely some company or startup would jump at continuing this project?
Search engines are really fascinating in how they work, this is impressive
Very nice project. Do you have plans to commercialize it next?
Brilliant write up - learnt a lot.
please, I want to pay for this. 10x better than Kagi which I stopped paying for
good post, thanks for sharing
Incredibly cool. What a write-up. What an engineer.
Just wow. My greatest respect! Also an incredible write up. I like the take-away that an essential ingredient to a search engine is curated and well filtered data (garbage in garbage out) I feel like this has been a big learning of the LLM training too, rather work with less much higher quality data. I'm curious how a search engine would perform where all content has been judged by an LLM.