Have you considered using an LLM model API? You could just send it the posting text and come with a good prompt to extract the most likely required skills.
diehunde | 13 days ago
Have you considered using an LLM model API? You could just send it the posting text and come with a good prompt to extract the most likely required skills.
The #1 thing you need to think about is training data. (Going forward you are going to do a huge amount of manual work like it or not, the important thing is that you do it efficiently)
My take is that the PhraseMatcher is about the best thing you will find in spaCy, and I don't think any of the old style embeddings will really help you. (Word vectors are a complete waste of time! Don't believe me, fire up scikit-learn and see if you can train a classifier that can identify color words or emotionally charged words: related words are closer in the embedding space than chance but that doesn't mean you've got useful knowledge there) Look to the smallest and most efficient side of LLMs, maybe even BERT-based models. I do a lot of simple clustering and classification tasks with these sorts of models
https://sbert.net/
I can train a calibrated model on 10,000 or so documents in three minutes using stuff from sk-learn.
Another approach to the problem is to treat it as segmentation, that is you want to pick out a list of phrases like "proficiency in spreadsheets" and you can then feed those phrases through a classifier that turns that into "Excel". Personally I'm interested in running something like BERT first and then training an RNN to light up phrases of that sort or do other classification. The BERT models don't have a good grasp on the order of words in the document but the RNN fills that gap.