query

Solr Search - The Solr Query Process and How to Interpret Output



Sharing buttons:

welcome I'm Paul and today I'll

demonstrate how the query process works

so you can start searching faster with

Apache Solr I'll start with an overview

of five steps and then I'll show you how

and as we head towards a website search

application we need to modify query

parameters and refine results and that's

what we'll do here our first step is to

analyze query output from the film's

data set second describe the search

process and its three key elements third

use the schema API to review the fields

created by the schema list configuration

fourth describe each query parameter and

fine-tune our search and then fifth play

with search queries in this solar admin

UI and at the command line with curl so

that is how we will learn to implement

search queries in solar now let's walk

through each step and this is my 218

video and subscribers suggested that I

make code examples and everything I say

in videos available in print so if this

would help you the first link in the

description will take you to it ok so

for many of us searches where it's at in

2017 scores of developers are evaluating

solar for website search as a possible

replacement for Google Site Search

Google custom search or alternatives

like elastic search or amazon

cloudsearch

to make a thorough evaluation of Apache

Solr search many of us are building out

test environments to learn functionality

before moving on to a development

environment this tutorial will go a long

way to help you make a more informed

choice in previous tutorials we have

taken steps to get a dataset loaded and

ready to field queries in the Apache

Solr user interface in a browser it is

from here that we can start to see an

application take shape also this is

where the mechanical part of getting the

systems and data ready is replaced by

search analytics and where for many the

fun begins here we will pick up where we

left off in the last tutorial from there

we will walk through the process of

search in solar behind this

and get a better understanding of

configuration files impacting results if

you have yet to get your system up and

going with this film's data set I

suggest going back a few videos to get

caught up and with that well let's get

in and have some fun with Apache Solr

search moving on to step one I want to

remind you how we got here we asked our

index to return films in our search term

and the cue box was Spike Lee with the

goal of finding which films he directed

a quick word about where we left off

with our films core and index by

clicking on overview we can see it

contains those 1100 documents and it

provides particulars about the server

instance data and index directory

locations in our case these documents

relate to records in an XML file we

posted to the film's core in the last

tutorial after we posted it we clicked

on query and perform two searches in the

first we use the default of star dot

star in the cue box which selects all

records so if we hit execute query here

we see the output in JSON format the

section under common lists what are

called search parameters the first and

only parameter we used was Q which is

the box for search terms and we gave it

start out start or return all records

now shift your focus to the right side

of the query tab and at the top is the

request URL that when used in a browser

will query the solar server and bring up

these results so a website search

application using this would provide

results for you to render as you see fit

and clicking on the link will kick that

query right into a web page if you want

to try that out and the format looks

like this where host name refers to the

IP address of the solar server in most

cases this would be localhost for a

local installation in my case I

installed solar 7 on a server and I'm

communicating through SSH here in this

window the port number is 89 83 which is

a default and it can be modified the

path that points to the core we built

using bin solar create core command is

right there

solar slash films select with a question

mark calls a request handler named

select which is identified at the top

left identified with Q T and Q equals

star dot star is that query parameter in

search term returning all documents with

the default setting the output has two

sections at a minimum first is a

response header which provides basic

information about the status with zero

here indicating no errors q time for the

query time and params listing the

parameters of the query the next section

is called the response and it provides

summary information like num found for

the total 1,100 documents the start

requests that the output start at the

first record are 0 and show 10 rows

which it did rather than dump out all

1100 and these defaults like almost

everything in solar can be modified next

we have the section for the first record

a film called 45 as indicated by the

field called name and remember we

imported a document with five fields ID

4 unique ID

directed by for one or several directors

of the film the initial release date and

one of several genre the film was

classified under and because this was a

schema less configuration Solar created

three fields on its own underscore

version underscore than genre underscore

STR and directed by underscore STR which

are not really relevant to us now in our

second query our goal was to find movies

directed by Spike Lee by inputting his

name in the Q parameter clicking on

execute query returns 11 documents this

time as identified by num found and does

that mean he directed 11 films here

let's see the first one 25th hour was

directed by Spike Lee

the second bamboozled was as well the

third adaptation was directed by Spike

Jonze's so interesting not a match but

we can understand that this is search

and it identified the first name as a

match on Spike right what is going on

with the fourth it was directed by Lee

Sang L so a man

john lee and you get the point solar is

finding matches but not perfect matches

but this is an open-ended search so

using these parameters we shouldn't

expect that it will we just want it to

show relevant information if we wanted

an exact match we could modify this

query so similar to a search engine it

is returning documents near the top that

are more relevant to your request so

that's good

later we will see numeric scores on each

of these films for step two we will

cover the three part workflow that goes

on behind the scenes for every search

query in solar first is the request

Handler and that's passive off the

information to a query parser which

moves on to a request writer and the

request handler is a plug-in that

organizes requests the select in the URL

to the select request handler and there

is another for update that provides

function functionality to update an

index among others and from the solar

admin UI the plugins slash stats tab

provides a link to several installed by

default and the query parser comes next

and it interprets the parameter selected

and the query terms or what you are

searching for and how the search is

performed then it sends those requests

to the index in solar seven there are

three widely used alternatives the

standard query parser also named Lucene

is the default in most cases and is best

suited for structured data dis max as an

alternative that is well suited for

searching unstructured data like you

might find in a website search and

extended dis max is another alternative

for unstructured data there are about 25

other parsers available for special

needs offering flexibility to create

fields on-the-fly give more weight to

some parameters and even geospatial

queries used for finding the nearest

coffee shop for example and keep in mind

that different parsers require different

parameters so what you see on the query

tab are those parameters that are common

across all query parsers which we will

cover in step 4

after the query partial sum

the request to the index additional

transformations occur before results are

returned and these are performed by the

response writer

examples include how much data in to

include additional filters or groupings

to apply and finally which format to

present the data with from the query tab

if you click on the drop-down marked WT

you will see the most common six

response writers of the 21 available and

this customizes the output for the

eventual next destination for the data

and for example if you perform

additional processing in Python then the

output would be customized for Python

the default here is JSON so for step

three let's look at the schema created

automatically by solar if you recall we

didn't spend time analyzing the fields

instead we jump right into our first

query in the last tutorial so let's do

that here a schema is an XML file that

tells solar how to ingest documents into

the core process them into fields and

spit out an index we hope is usable for

our audience in our films case with a

schema list configuration by default it

automatically interpreted rules for

field types meaning text or numeric and

it also has rules for how to process

punctuation capitalized words and web

addresses as well and solar will also

create fields on the fly that combine

other fields together to aid in search

using what is called copy fields and we

added one in the last tutorial before we

look at the schema we should bring back

two points first the XML schema files

can be managed by hand or with the help

of two solar tools the solar admin UI or

the solar schema API from the command

line

second recall that the last tutorial we

mentioned a method for editing the

schema and how that dictates the name of

the schema so if it is hand edited it is

called schema XML if it is managed using

the tools to prevent us from making

mistakes then it is called manage -

schema and the schemas configuration

dictated that we use the latter so let's

look at that the manage - schema file is

located in the directory called conf

within the home of the core and let's

look the installation directory

first using a PWD followed by an LS - og

and I usually keep this installation

directory as my working directory so

it's easy to access the bins solar

script straight from there and if this

looks confusing don't worry quickly you

will memorize all of the file locations

also within the server directory sits

the film's core and all of the settings

and data so the managed - schema file is

quite long about 500 lines so instead of

confusing ourselves by opening up the

whole thing

and searching for the fields I find it

easier to use the solar schema API using

the curl command to pull out just the

parts we need let's try that first and

focus on the field solar created when we

posted documents to it okay so this is

the list of fields in JSON format and

let's make a few observations first we

can see the output here also has a

response header and the second section

shows the fields what we are interested

in at this point is the five fields

imported all at the bottom of the output

directed by was given as a field type

text general which seems logical genre

is also a text general field ID is the

field that has more going on it was

assigned a string field type it's not

multi value meaning each record has a

unique ID it is index required and

stored so this has all of the

characteristics of a unique

identification number initial release

date was assigned the type P dates for a

date field and the name fields if you

recall from the last tutorial we

assigned to text general when we edited

the schema using the solar admin UI we

did this because the first film in the

films XML data file we posted to the

index was named 45 and if we hadn't

named it or classified it as a text

general category then solar would have

assigned it a numeric field instead so

we know that worked here we also didn't

index it and stored here means that it

can be reviewed in queries in our second

modification we set up a copy field and

we can use the schema ap

to pull these as well so the field

underscore text underscore was the one

we created and the second two solar

created on its own this explains the

extra fields we saw in that output

earlier if you recall the point here is

to circle back and see what solar did

behind the scenes we mentioned that the

schema-less configuration is not meant

for production but helps us get started

quickly for production we would want to

fine-tune each of the fields and field

types and this topic will take on more

meaning down the road when we use

unstructured data like you might see in

a traditional website search application

now for step four let's walk through the

default query parameters common across

query parsers and for that keep us keep

it simple by using the solar admin UI

def type this selects the query parser

with the three most common being leucine

dis Max and etus Max and the default

here is leucine Q this is where you

enter the search terms and the default

is star dot star FQ this is used to set

a query filter query which creates a

cache of potential sub queries so if

your user will look at more granular

information it helps with speed to

identify the fields ahead of time so the

results are ready in cache and the

default here is none so sort the next

one here you enter how you would like

results stored with common options being

ASC or des c4 ascending and descending

and the default is descending start

common rows these parameters are used

like a search in Google that provides

the top 10 results by default and allows

you to resubmit the query to find the

next 10 results so think of it as a way

to paginate query results the 0 starts

at the first record and 10 shows 10

records or rows and that's the default

FL the field list parameter lists the

results to a field a specific list of

fields in order for them to show up the

schema must have one of two settings for

the field stored equals true or dock

values equals true and multiple fields

can be selected and are separated by

commas or spaces and you

can also return the score given a

measure of the relevance of the search

results and a star shows all stored

fields which is the default DF in the

default field parameter you could enter

a specific field you would like solar to

search in our case the default search

field is a copy field we created called

underscore text underscore the ROC query

parameter section is for advanced you

slide for query debugging and the

default there is none and NWT we talked

about that that's the parameter that

selects the six popular response writers

and there are 21 total and JSON is the

default they're the checkboxes are

fairly self-explanatory with indent off

and debug query referring to the visual

format and items returned from the query

you can also select the dis Max and etus

max query parsers here in addition

specific functions for highlighting

search text faceting for results in

buckets or group geospatial results for

those navigational search queries and

spell checking capabilities as well and

the best way to learn is with practice

of course so in step 5 we will walk

through and modify search parameters in

the solar admin UI and after that we'll

try one straight from the command line

for our first query let's type Spike Lee

in that cue parameter again we want only

the first 5 results this time so type 5

in the rows parameter box and by default

they will be returned in descending

order based on the relevancy score of a

search term and this time let's view the

fill name and the score by entering name

comma score in the FL parameter clicking

on execute query and this presents the

results in the default JSON format so to

summarize we see the same number of

documents found as before 11 this time

we have a max score of eleven point two

seven and we will cover document

relevancy in scoring in future tutorials

but this is a helpful to see by how far

the two films that Spike Lee actually

directed are separated from those in

which the name spike or Lee were present

in the directed by field

in our original datasets so in our

second example we will add the directed

by field to the FL parameter and use the

WT drop drop-down to select the CSV

output so this output is nice for those

who are comfortable in a spreadsheet

type of format like those of us with

finance and statistics background so you

could easily kick data like this into

Excel and analyze away also note in this

format it returns three directors with

double within double quotes for the last

film basic emotions it is worth noting

that the URL is updated with the code

used to request this data from the solar

server and we could use this to train

ourselves how to write more advanced

queries and because the solar admin UI

query tab only scratches the surface of

what can be done at the command line

let's walk through one example from

there and include a parameter that

allows us to remove the response header

from the output and select the dis Max

query parser the curl command is used to

communicate with servers using a variety

of protocols and we will use it here to

submit this search request to the solar

server directly and let me also point

you to the web page I mentioned earlier

so if you miss the code here you can

always find it there and again for a

local installation use localhost instead

of an IP address and the port 89-83 was

our default but it could be different so

assuming we entered this properly we get

five films minus the response header in

JSON format and very good if you stick

around I will explain more about

relevancy scores so this data set is

yours to play with and I suggest

adjusting the dials and dive right into

the results this is the best way to

learn about the analytical process of

search and will come in handy whether

your goal is enterprise search or

evaluating website search as a

replacement for Google custom search

amazon cloudsearch or even an

alternative search tool like elastic

search in the net

few tutorials we will look at a dataset

which will be unstructured and come from

a website crawl so more similar to what

you would find on websites this will

require a more thorough exploration of

the bin post tool to perform the web

crawl also being unstructured we will

need to dive into field analysis and a

topic we haven't discussed yet which is

how the index is built with analyzers

tokenizer z' and filters that will be a

lot of fun so stay tuned for that and

with that you should have a nice base of

knowledge about search to move on to

tackle more advanced topics you now know

about query output the overall search

process how it ties in with the schema

and query parameters and how to

customize search to suit your needs

yes there are a lot of aspects to

creating a useful search tool and I'm

here to help if you need a customized

solution so please feel free to reach

out to me and that should be enough to

get you going please comment with

questions and feedback and subscribe for

more tips like this one thank you for

your time today