Learn to Store and Analyze Documents on Windows File System With SQL Server Semantic Search — Part 2

Image for post
Image for post

The focus of the article is on comparing documents that can be stored on Windows File System in one respect and the other respects their comparative analysis that can be performed with Semantic Search in SQL Server.

Additionally, the readers will learn how to store unstructured data by exploring File Table and creating MS Word documents on the fly (instantly) to be consumed by Semantic Search.

This part of the article is related to the use of Semantic Search on unstructured data for the extraction of basic level business-crucial information provided standard naming is in place.

Prerequisites

The readers need to go through Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1 to proceed further with it.

This article assumes the readers are familiar with most (if not all) of the following:

This article also assumes that the following steps have already been completed.

  • FILESTREAM Enabled Database EmployeesFilestreamSample has been set up.
  • File Table has been set up.
  • MS Word document Asif Permanent Employee and Peter Permanent Employee have been stored in a windows folder managed through File Table.
  • Querying the database File Table shows the information about the stored documents.

Please refer to the article Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1 to complete any of the above missing steps, which are mandatory to cover the walkthrough completely.

Performing Name-Based Documents Comparative Analysis

As we are preparing to perform documents comparative analysis based on their standard naming, at this point it is worth doing a quick check by querying the FILESTREAM Enabled database we set up in the first part of the article.

Run the following script against EmployeesFilestreamSample database:

-- View stored documents managed by File Table to check
SELECT stream_id
,[name]
,file_type
,creation_time
FROM EmployeesFilestreamSample.dbo.EmployeesDocumentStore

The results must show the stored documents as follows:

Image for post
Image for post

Semantic Search Checklist

We have successfully set up a FILESTREAM enabled database and created and stored two sample MS Word documents on File System using File Table.

However, that does not automatically qualify the documents for the Semantic Search usage scenario.

If you were following the earlier Semantic Search and Full-Text Search articles referenced in this article, then you must be well aware of why we cannot yet apply Semantic Search.

Semantic Search can be enabled in one of the following ways:

  1. If you have already setup Full-Text Search, then you can enable Semantic Search with a single step.
  2. You can directly set up Semantic Search, but that means you also have to take the steps required to set up Full-Text Search.

Full-Text Search Test Before Semantic Search Setup

If a Full-Text query works, then we just need to enable Semantic Search, because this means we have met most of the requirements.

To check that, you can also run a Full-Text query against the desired table as follows:

-- Searching word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
WHERE CONTAINS(name,'Employee')

The output is as follows:

Image for post
Image for post

This quick check shows us that we first need to fulfill the requirements of Full-Text Search and then enable Semantic Search.

Enabling Semantic Search for Use

We require at least the two following points before we can use Semantic Search to perform document comparative analysis:

  1. Unique Index.
  2. A Full-Text catalog.
  3. A Full-Text index.

Write the following T-SQL script to first create a unique index:

-- Create unique index required for Semantic Search
CREATE UNIQUE INDEX UQ_Stream_Id
ON EmployeesDocumentStore(stream_id)
GO

Then create a Full-Text catalog based on the newly created unique index followed by creating a Full-Text index as shown below:

-- Getting Semantic Search ready to be used with File Table
CREATE FULLTEXT CATALOG EmployeesFileTableCatalog WITH ACCENT_SENSITIVITY = ON;
CREATE FULLTEXT INDEX ON EmployeesDocumentStore
(
name LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_type LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_stream TYPE COLUMN file_type LANGUAGE 1033 STATISTICAL_SEMANTICS
)
KEY INDEX UQ_Stream_Id
ON EmployeesFileTableCatalog WITH CHANGE_TRACKING AUTO, STOPLIST=SYSTEM;

The results are as follows:

Image for post
Image for post

Full-Text Search Test After Semantic Search Setup

Let us now run the same Full-Text query to search for the word Employee in the stored documents:

-- Searching (after Semantic Search setup) word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
WHERE CONTAINS(name,'Employee')

The output is as follows:

Image for post
Image for post

It is all fine for Full-Text queries to work against the File Table while we are getting it ready for Semantic Search. As mentioned earlier, Semantic Search is an add-on to Full-Text Search, that’s why you get an added benefit of using Full-Text queries.

Add More MS Word Documents

Now, go to the EmployeesDocumentStore File Table and click Explore FileTable Directory:

Image for post
Image for post

Create and store a new document called Sadaf Contract Employee as follows:

Image for post
Image for post

Next, add the following text to the newly-created document after ensuring the first line is the title of the document:

Sadaf Contract Employee (title)
Sadaf is a very efficient business analyst who does contact based work. She is fully capable of handling business requirements and turning them into technical specifications for the developers to work on. She is a very experienced business analyst.
Image for post
Image for post

Add another document called Mike Permanent Employee:

Image for post
Image for post

Update the document with the following text:

Mike Permanent Employee (Title of the document)
Mike is a fresh programmer whose expertise include web development. He is a quick learner and happy to work on any project. He has strong problem solving skills but he has less business knowledge. He requires assistance from other developers or business analysts to understand the problem and to meet the requirements.
He is good when he works on small projects but he struggles if he is given big or complex project.
Image for post
Image for post

We now have four documents stored on Windows File System managed by File Table, which are to be consumed by Semantic Search (including Full-Text Search).

Although we have just stored four MS Word documents in the folder as a sample, you can imagine the importance of using Semantic Search when hundreds of such documents are maintained by a SQL Server database, and you need to query those documents to find valuable information.

Please note that the standard naming of the documents matters a lot for the successful implementation of this approach keeping in mind the scope (title based Semantic Search) of this article.

Simple Counting of Documents

We can now compare these documents to find out the differences and similarities between them based on their standard naming using Semantic Search.

For example, a simple query can tell us the total number of documents stored in Windows Folder:

-- Getting total number of stored documents
SELECT COUNT(*) AS Total_Documents FROM EmployeesDocumentStore
Image for post
Image for post

Permanent vs Contract-Based Employees Comparison

This time we are using Semantic Search to compare how many permanent and contract-based employees we have got in our sample organization:

-- Creating a summary table variable
DECLARE @Documents TABLE
(DocumentType VARCHAR(100),
DocumentsCount INT)
INSERT INTO @Documents -- Storing total number of stored documents into summary table
SELECT 'Total Documents',COUNT(*) AS Total_Documents FROM EmployeesDocumentStore
INSERT INTO @Documents -- Storing total number of permananent employees documents stored into summary table
SELECT 'Total Permanent Employees',COUNT(*)
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Permanent'
INSERT INTO @Documents --Storing total number of permananent employees documents stored
SELECT 'Total Contract Employees',COUNT(*)
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Contract'
SELECT DocumentType,DocumentsCount FROM @Documents

The output is as follows:

Image for post
Image for post

Documents Keyphrase Score

Please remember the relative score is a very solid criterion to find out the relevance of key phrases in your documents, but it may get complicated as well.

Let us run a simple (document name-based) Semantic Search query to view the keyphrase and its relative score for each of the documents:

-- Getting keyphrase and relative score for all the documents
SELECT * FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
ORDER BY score

The output is as follows:

Image for post
Image for post

Add More Details to the Names of Documents

Let us add a little more details to the names of the documents by renaming the documents as follows:

  1. Asif Permanent Employee — Experienced Project Manager
  2. Mike Permanent Employee — Fresh Programmer
  3. Peter Permanent Employee — Fresh Project Manager
  4. Sadaf Contract Employee — Experienced Business Analyst
Image for post
Image for post

Finding Fresh Employees (Documents)

Let us now find the documents related to the fresh employees based on their (standard naming of ) titles:

-- Getting document name-based scoring to find fresh employees for a new project
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName
,keyphrase,score FROM semantickeyphrasetable(EmployeesDocumentStore, NAME) where keyphrase='fresh'
order by DocumentName desc

The results are as follows:

Image for post
Image for post

Finding Experienced Employees (Documents)

Now, suppose we want to quickly review all the experienced employees’ details for quite a complex project ahead.

You can find this out using the following Semantic Search query:

-- Getting document name-based scoring to find all experienced employees
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
where keyphrase='experienced' order by DocumentName

The output is as follows:

Image for post
Image for post

Finding all Project Managers (Documents)

Finally, if we would like to quickly go through the documents for all the project managers, then the following Semantic Search query needs to be issued:

-- Getting document name based scoring to find all project managers
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
where keyphrase='Project'

The results are shown below:

Image for post
Image for post

Continue reading to learn more about the limitation of Semantic Search and get a to-do list to apply both Full-Text Search and Semantic Search (for name-based queries) — https://codingsight.com/learn-to-store-and-analyze-documents-on-windows-file-system-with-sql-server-semantic-search-part-2/.

Thank you and stay tuned!

Written by

Awesome blog focused on databases and Microsoft, .NET and cloud technologies. http://codingsight.com/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store