Learn to Store and Analyze Documents on Windows File System With SQL Server Semantic Search — Part 2
The focus of the article is on comparing documents that can be stored on Windows File System in one respect and the other respects their comparative analysis that can be performed with Semantic Search in SQL Server.
Additionally, the readers will learn how to store unstructured data by exploring File Table and creating MS Word documents on the fly (instantly) to be consumed by Semantic Search.
This part of the article is related to the use of Semantic Search on unstructured data for the extraction of basic level business-crucial information provided standard naming is in place.
Prerequisites
The readers need to go through Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1 to proceed further with it.
This article assumes the readers are familiar with most (if not all) of the following:
- File Table Concepts
- Semantic Search Concepts
- How to use SQL Server Semantic Search
- Basics of Full-Text Search
This article also assumes that the following steps have already been completed.
- FILESTREAM Enabled Database EmployeesFilestreamSample has been set up.
- File Table has been set up.
- MS Word document Asif Permanent Employee and Peter Permanent Employee have been stored in a windows folder managed through File Table.
- Querying the database File Table shows the information about the stored documents.
Please refer to the article Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1 to complete any of the above missing steps, which are mandatory to cover the walkthrough completely.
Performing Name-Based Documents Comparative Analysis
As we are preparing to perform documents comparative analysis based on their standard naming, at this point it is worth doing a quick check by querying the FILESTREAM Enabled database we set up in the first part of the article.
Run the following script against EmployeesFilestreamSample database:
-- View stored documents managed by File Table to check
SELECT stream_id
,[name]
,file_type
,creation_time
FROM EmployeesFilestreamSample.dbo.EmployeesDocumentStore
The results must show the stored documents as follows:
Semantic Search Checklist
We have successfully set up a FILESTREAM enabled database and created and stored two sample MS Word documents on File System using File Table.
However, that does not automatically qualify the documents for the Semantic Search usage scenario.
If you were following the earlier Semantic Search and Full-Text Search articles referenced in this article, then you must be well aware of why we cannot yet apply Semantic Search.
Semantic Search can be enabled in one of the following ways:
- If you have already setup Full-Text Search, then you can enable Semantic Search with a single step.
- You can directly set up Semantic Search, but that means you also have to take the steps required to set up Full-Text Search.
Full-Text Search Test Before Semantic Search Setup
If a Full-Text query works, then we just need to enable Semantic Search, because this means we have met most of the requirements.
To check that, you can also run a Full-Text query against the desired table as follows:
-- Searching word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
WHERE CONTAINS(name,'Employee')
The output is as follows:
This quick check shows us that we first need to fulfill the requirements of Full-Text Search and then enable Semantic Search.
Enabling Semantic Search for Use
We require at least the two following points before we can use Semantic Search to perform document comparative analysis:
- Unique Index.
- A Full-Text catalog.
- A Full-Text index.
Write the following T-SQL script to first create a unique index:
-- Create unique index required for Semantic Search
CREATE UNIQUE INDEX UQ_Stream_Id
ON EmployeesDocumentStore(stream_id)
GO
Then create a Full-Text catalog based on the newly created unique index followed by creating a Full-Text index as shown below:
-- Getting Semantic Search ready to be used with File Table
CREATE FULLTEXT CATALOG EmployeesFileTableCatalog WITH ACCENT_SENSITIVITY = ON;
CREATE FULLTEXT INDEX ON EmployeesDocumentStore
(
name LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_type LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_stream TYPE COLUMN file_type LANGUAGE 1033 STATISTICAL_SEMANTICS
)
KEY INDEX UQ_Stream_Id
ON EmployeesFileTableCatalog WITH CHANGE_TRACKING AUTO, STOPLIST=SYSTEM;
The results are as follows:
Full-Text Search Test After Semantic Search Setup
Let us now run the same Full-Text query to search for the word Employee in the stored documents:
-- Searching (after Semantic Search setup) word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
WHERE CONTAINS(name,'Employee')
The output is as follows:
It is all fine for Full-Text queries to work against the File Table while we are getting it ready for Semantic Search. As mentioned earlier, Semantic Search is an add-on to Full-Text Search, that’s why you get an added benefit of using Full-Text queries.
Add More MS Word Documents
Now, go to the EmployeesDocumentStore File Table and click Explore FileTable Directory:
Create and store a new document called Sadaf Contract Employee as follows:
Next, add the following text to the newly-created document after ensuring the first line is the title of the document:
Sadaf Contract Employee (title)
Sadaf is a very efficient business analyst who does contact based work. She is fully capable of handling business requirements and turning them into technical specifications for the developers to work on. She is a very experienced business analyst.
Add another document called Mike Permanent Employee:
Update the document with the following text:
Mike Permanent Employee (Title of the document)
Mike is a fresh programmer whose expertise include web development. He is a quick learner and happy to work on any project. He has strong problem solving skills but he has less business knowledge. He requires assistance from other developers or business analysts to understand the problem and to meet the requirements.
He is good when he works on small projects but he struggles if he is given big or complex project.
We now have four documents stored on Windows File System managed by File Table, which are to be consumed by Semantic Search (including Full-Text Search).
Although we have just stored four MS Word documents in the folder as a sample, you can imagine the importance of using Semantic Search when hundreds of such documents are maintained by a SQL Server database, and you need to query those documents to find valuable information.
Please note that the standard naming of the documents matters a lot for the successful implementation of this approach keeping in mind the scope (title based Semantic Search) of this article.
Simple Counting of Documents
We can now compare these documents to find out the differences and similarities between them based on their standard naming using Semantic Search.
For example, a simple query can tell us the total number of documents stored in Windows Folder:
-- Getting total number of stored documents
SELECT COUNT(*) AS Total_Documents FROM EmployeesDocumentStore
Permanent vs Contract-Based Employees Comparison
This time we are using Semantic Search to compare how many permanent and contract-based employees we have got in our sample organization:
-- Creating a summary table variable
DECLARE @Documents TABLE
(DocumentType VARCHAR(100),
DocumentsCount INT)INSERT INTO @Documents -- Storing total number of stored documents into summary table
SELECT 'Total Documents',COUNT(*) AS Total_Documents FROM EmployeesDocumentStoreINSERT INTO @Documents -- Storing total number of permananent employees documents stored into summary table
SELECT 'Total Permanent Employees',COUNT(*)
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Permanent'INSERT INTO @Documents --Storing total number of permananent employees documents stored
SELECT 'Total Contract Employees',COUNT(*)
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Contract'SELECT DocumentType,DocumentsCount FROM @Documents
The output is as follows:
Documents Keyphrase Score
Please remember the relative score is a very solid criterion to find out the relevance of key phrases in your documents, but it may get complicated as well.
Let us run a simple (document name-based) Semantic Search query to view the keyphrase and its relative score for each of the documents:
-- Getting keyphrase and relative score for all the documents
SELECT * FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
ORDER BY score
The output is as follows:
Add More Details to the Names of Documents
Let us add a little more details to the names of the documents by renaming the documents as follows:
- Asif Permanent Employee — Experienced Project Manager
- Mike Permanent Employee — Fresh Programmer
- Peter Permanent Employee — Fresh Project Manager
- Sadaf Contract Employee — Experienced Business Analyst
Finding Fresh Employees (Documents)
Let us now find the documents related to the fresh employees based on their (standard naming of ) titles:
-- Getting document name-based scoring to find fresh employees for a new project
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName
,keyphrase,score FROM semantickeyphrasetable(EmployeesDocumentStore, NAME) where keyphrase='fresh'
order by DocumentName desc
The results are as follows:
Finding Experienced Employees (Documents)
Now, suppose we want to quickly review all the experienced employees’ details for quite a complex project ahead.
You can find this out using the following Semantic Search query:
-- Getting document name-based scoring to find all experienced employees
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
where keyphrase='experienced' order by DocumentName
The output is as follows:
Finding all Project Managers (Documents)
Finally, if we would like to quickly go through the documents for all the project managers, then the following Semantic Search query needs to be issued:
-- Getting document name based scoring to find all project managers
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
where keyphrase='Project'
The results are shown below:
Continue reading to learn more about the limitation of Semantic Search and get a to-do list to apply both Full-Text Search and Semantic Search (for name-based queries) — https://codingsight.com/learn-to-store-and-analyze-documents-on-windows-file-system-with-sql-server-semantic-search-part-2/.
Thank you and stay tuned!