Show HN: I put PubMed in a vector DB
Very cool!
Related: the NIST TREC (Text REtrieval Conference) has had several competitions over the years related to improving the searchability of medical data: https://www.trec-cds.org/
If you have novel ideas in this area, you should consider participating. https://trec.nist.gov/
This is very cool! 2 questions spring to mind:
1. How much did it cost to embed all those vectors and how many articles did you process? PMC is quite large.
2. Could elaborate a little more on your approach to ranking articles? Because I'm familiar with semantic search via embeddings put did you weight those with impact factors/citations? Like how does one even calculate that?
Anyhow, love the idea.
Congrats on shipping!
I'm curious how the search results rankings work, doesn't look like it's based on date or number of citations, but seems to be deterministic (persists over multiple searches). I did a keyword search using one word.
Nice!
Out of curiosity what model(s) are you using to generate the embeddings?
What are you embedding exactly? Chunks of documents?
Very promising tool based on a couple of questions I asked it! How did the cleaning of documents look like?
All 10 thumbs up!
Edit: One suggestion: in the results list, please make the headings links to the articles, too.
Maybe a stupid question, but how do you compare this against GPT-based search engines?
What storage did you go for, and what search approach?
Hey mate, should search by PMID work? Like 35982160 is PMID for "Rare coding variation provides insight into the genetic architecture and phenotypic context of autism" - not seeing this publication at all in search results...