Sitecore XM Cloud, Ordercloud, CDP, Personalize, ContentHub and Send: Solr - Basic algorithm for TFIDF, LTR and common functions.

It's really very interesting to understand how Solr by default is giving you a result in a particular order.

Let's say if you search for a keyword BbQ ( B - Capital Letter, b -Small Letter and Q as capital letter), How you are getting the result and why you are getting a few result on top and what all are options available to change the order of the results.

So if you are queries to understand the whole flow, THIS BLOG IS FOR YOU :)

First,We should understand the Solr query flow.

Here is a high-level view of existing Solr Algorithm, mainly it uses term frequency and inverse document frequency as a based and BM25 as base.

Solr by default use Lucene as a core and the default ranking model is known as tf.idf model.

First, Let's understand what is this model in general.

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection
or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times
a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently
in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

Term Frequency - The weight of a term that occurs in a document is simply proportional to the term frequency.

Inverse Document Frequency - The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

Here is a list of all Solr available functions -

Few useful functions are -

docfreq(field,term) returns the number of documents that contain the term in the field.
termfreq(field,term) returns the number of times the term appears in the field for that document.
idf(field,term) returns the inverse document frequency for the given term, using the Similarity for the field.
tf(field,term) returns the term frequency factor for the given term, using the Similarity for the field.
norm(field) returns the “norm” stored in the index, the product of the index time boost and then length normalization factor.
maxdoc() returns the number of documents in the index, including those that are marked as deleted but have not yet been purged.
numdocs() returns the number of documents in the index, not including those that are marked as deleted but have not yet been purged.

For more reference -

Ranking of query results is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user.

Query Re-Ranking - Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B).

Here is the high level flow diagram.

LIBSVM and LIBLINEAR are two popular open source machine learning libraries - We can start with a simple development for LIBLINEAR

Before going further , basically we need to define the proper model, feature. common feature to implement the ML.

An example below:-

here is a github link - https://github.com/apache/lucene-solr/tree/releases/lucene-solr/6.4.0/solr/contrib/ltr/example

Steps to define the Features, stores and Models

After plugin the above libraries - We can define model and use in the search queries like this

http://localhost:8983/solr/collectionname/query?q=test&rq={!ltr model= *currentModel* *reRankDocs* =100}&fl=id,score,[*features* store= *nextFeatureStore*]

`model= *currentModel*`
`store= *nextFeatureStore*`
`*features*`

Sample store/ Common feature -

{"store": "commonFeatureStore","name": "documentRecency","class": "org.apache.solr.ltr.feature.SolrFeature","params": {"q": "{!func}recip( ms(NOW,last_modified), 3.16e-11, 1, 1)"}}

Sample Model

{"store": "commonFeatureStore","name": "ModelA","class": "org.apache.solr.ltr.model.LinearModel","features": [{"name": "FeatureA"},{"name": "FeatureB"},{"name": "FeatureC"}],"params": {"weights": {"documentRecency": 1,"isBook": 0.1,"originalScore": 0.5}}}

For more details -

How are documents scored

By default, a "TF-IDF" based Scoring Model is used. The basic scoring factors:

tf stands for term frequency - the more times a search term appears in a document, the higher the score
idf stands for inverse document frequency - matches on rarer terms count more than matches on common terms
coord is the coordination factor - if there are multiple terms in a query, the more terms that match, the higher the score
lengthNorm - matches on a smaller field score higher than matches on a larger field
query clause boost - a user may explicitly boost the contribution of one part of a query over another.

For details can be found here - Solr Wiki for Ranking

A simple example -

TF- Term Frequency-

TF(w)=(Number of times word w appears in a document/ total number of words in the document)

IDF- Inverse document frequency-

DF(w)= log (total number of documents/ Number of documents with word w)

TF-IDF is the multiplication of Term frequency and inverse document frequency.

sentence 1– earth is the third planet from the sun
sentence 2– earth is the largest planet

TF IDF is zero for stop word and it's being configured here -

There is a open source library available to implement the TFIDF - https://code.google.com/archive/p/tfidf/

There is a drawback in this algorithm, As discussed here, Basically when we have more documents it's recommended to split those documents in multiple shards, there are a few example here like when you will decide to created multiple shards -

Particular to this example , Let's say if you search for the keyword - unique jacket

These two terms may have different TFIDF and may effect the final outcome in case of huge data.

Let me know if you have any questions :)

Reference -

1. https://www.sciencedirect.com/science/article/abs/pii/S0167865516303178?via%3Dihub

2. https://www.arunprakash.org/2016/12/basic-information-retrieval-search-using-vector-space-model-tf-idf-cosine-similarity.html
4. https://blog.expertrec.com/what-is-tf-idf-and-how-does-it-work/

3. https://www.analyticsvidhya.com/blog/2015/04/information-retrieval-system-explained/

don't forget to check this blog for a quick python example for TFIDF.

I hope you have enjoyed these details, Please let me know if you have any questions.