Show HN: I put PubMed in a vector DB

mpmisko | 96 points

Hey mate, should search by PMID work? Like 35982160 is PMID for "Rare coding variation provides insight into the genetic architecture and phenotypic context of autism" - not seeing this publication at all in search results...

bdangubic | 9 days ago

Very cool!

Related: the NIST TREC (Text REtrieval Conference) has had several competitions over the years related to improving the searchability of medical data: https://www.trec-cds.org/

If you have novel ideas in this area, you should consider participating. https://trec.nist.gov/

dpifke | 9 days ago

This is very cool! 2 questions spring to mind:

1. How much did it cost to embed all those vectors and how many articles did you process? PMC is quite large.

2. Could elaborate a little more on your approach to ranking articles? Because I'm familiar with semantic search via embeddings put did you weight those with impact factors/citations? Like how does one even calculate that?

Anyhow, love the idea.

lucas_crocker | 9 days ago

Congrats on shipping!

I'm curious how the search results rankings work, doesn't look like it's based on date or number of citations, but seems to be deterministic (persists over multiple searches). I did a keyword search using one word.

rkwz | 9 days ago

Nice!

Out of curiosity what model(s) are you using to generate the embeddings?

kkielhofner | 9 days ago

What are you embedding exactly? Chunks of documents?

grumpopotamus | 9 days ago

Very promising tool based on a couple of questions I asked it! How did the cleaning of documents look like?

madhatter999 | 8 days ago

All 10 thumbs up!

Edit: One suggestion: in the results list, please make the headings links to the articles, too.

mharig | 7 days ago

Maybe a stupid question, but how do you compare this against GPT-based search engines?

drycabinet | 8 days ago

What storage did you go for, and what search approach?

alex_duf | 9 days ago