A Data Science Central Community
This post is the second part of the multi-part series on how to build a search engine –
Just a sneak peek into how the final output is going to look like –
In case you are in a hurry you can find the full code for the project at my Github Page
In this post we will focus on configuring the elasticsearch bit.
I have chosen the Wikipedia people dump for the dataset. This is the wiki pages of a subset of people on Wikipedia. This dataset consists of three columns – URI, name, text. As the column names suggest, URI is the actual wiki link to that person’s page, name is the person’s name. Text is the field where the entire content has been dumped on that person. This column is unstructured as it contains free text.
We would mainly be focusing on how to build a search engine to search through the text column and display the corresponding name and URI.
If you are already sitting on some other dataset you want to build the search engine on, then everything that we are going to do still applies; it’s just changing the column names to make it work!
So let’s get started. The first thing that we need to do is configure elasticsearch so that we can load some data into it.
If you are going to move forward with the Wikipedia data, just save the below script in a .py file and give it a run. Caution: Elasticsearch must be up and running on your system (find details in Part 1).
import requestssettings = '''
{
"settings" : {
"analysis" : {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 5,
"max_gram": 8
}
},
"analyzer" : {
"stem" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "stop", "porter_stem","trigrams_filter"]
},
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "4",
"max_gram" : "8"
}
}
}
},
"mappings" : {
"index_type_1" : {
"dynamic" : true,
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "stem"
} ,
"name" : {
"type" : "string",
"analyzer" : "simple"
}
}
},
"index_type_suggest" : {
"properties" : {
"search_query" : {
"type" : "completion"
}
}
}
}
}
'''
url = 'http://localhost:9200/wiki_search'
resp_del = requests.delete(url)
print resp_del
resp = requests.put(url,data=settings)
print resp
Let’s try going through some of the code. settings is a doc string in python that dictates all that we want to do to our data.
The n-grams filter is for subset-pattern-matching. This means if I search “start”, it will get a match on the word “restart” (start is a subset pattern match on restart)
Before indexing, we want to make sure the data goes through some pre-processing. This is defined in the line "filter" : ["standard", "lowercase", "stop", "porter_stem","trigrams_filter"] They are as follows –
“search_query” will be the property responsible for the autocomplete suggestions which we will integrate with the UI later. It’s type is “completion” which lets elasticsearch know that this data will be retrieved as we type, and that it has to be optimized for speed.
Finally, we create a new elasticsearch index called ”wiki_search” that would define the endpoint URL where we would be interested in calling the RESTful service of elasticsearch from our UI.
In the next segment of how to build a search engine we would be looking at indexing the data which would make our search engine practically ready.
Originally posted here
© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge