Show HN: I put PubMed in a vector DB

mpmisko | 97 points

Hey mate, should search by PMID work? Like 35982160 is PMID for "Rare coding variation provides insight into the genetic architecture and phenotypic context of autism" - not seeing this publication at all in search results...

bdangubic | a year ago

Very cool!

Related: the NIST TREC (Text REtrieval Conference) has had several competitions over the years related to improving the searchability of medical data: https://www.trec-cds.org/

If you have novel ideas in this area, you should consider participating. https://trec.nist.gov/

dpifke | a year ago

This is very cool! 2 questions spring to mind:

1. How much did it cost to embed all those vectors and how many articles did you process? PMC is quite large.

2. Could elaborate a little more on your approach to ranking articles? Because I'm familiar with semantic search via embeddings put did you weight those with impact factors/citations? Like how does one even calculate that?

Anyhow, love the idea.

lucas_crocker | a year ago

Congrats on shipping!

I'm curious how the search results rankings work, doesn't look like it's based on date or number of citations, but seems to be deterministic (persists over multiple searches). I did a keyword search using one word.

rkwz | a year ago

Nice!

Out of curiosity what model(s) are you using to generate the embeddings?

kkielhofner | a year ago

What are you embedding exactly? Chunks of documents?

grumpopotamus | a year ago

Very promising tool based on a couple of questions I asked it! How did the cleaning of documents look like?

madhatter999 | a year ago

All 10 thumbs up!

Edit: One suggestion: in the results list, please make the headings links to the articles, too.

mharig | a year ago

Maybe a stupid question, but how do you compare this against GPT-based search engines?

drycabinet | a year ago

What storage did you go for, and what search approach?

alex_duf | a year ago