examine lucene index problem

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Harm Holtackers 2 posts 22 karma points

Jun 13, 2014 @ 14:28

0

Examine - lucene index problem

Hello,

I have a problem while lucene is indexing Umbraco.

The following html is indexed wrong

test<br />test2 will be testtest2 in the search results the <br /> is stripped.

Is it possible to replace the <br /> with \n instead of stripping it.

Best regards,

Harm Holtackers

Copy Link
Jamie Pollock 172 posts 846 karma points c-trib

Jun 13, 2014 @ 16:42

0

Hey Harm,
By default HTML is stripped by Lucene. However you'll find in the SearchResult Fields collection a __raw_<propertyAlias> version which contains the original HTML version.

I recent discovered this myself in my own journey into Lucene/Examine. :)

I hope this answers your question.

Thanks,
Jamie

Copy Link
Harm Holtackers 2 posts 22 karma points

Jun 13, 2014 @ 16:54

0

Thanks Jamie,

I discovered this also but the problem is that I do a Fuzzy search on the contentText and when the items will be concatenated some results won't show up.
It's strange that te words will be concatenated when only a <br /> is in between.

Best regards,

Harm Holtackers

Copy Link

Jamie Pollock 172 posts 846 karma points c-trib

Jun 13, 2014 @ 17:13

I guess you could add an event to alleviate the situation. I'm not suggesting this is the best solution mind you. There might be a better solution as I'm fairly new to Lucene.

First of all assign your GatheringNodeData which allow you to edit data before its indexed.

var nameOfYourIndexer = "MyCustomExternalIndexer";
ExamineManager.Instance.IndexProviderCollection[nameOfYourIndexer].GatheringNodeData += ExamineEvents_GatheringNodeData;

Then in the indexer event handler.

void ExamineEvents_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) {
    GenerateSearchableHtmlContent(e);
}

private void GenerateSearchableHtmlContent(IndexingNodeDataEventArgs e) {
    var node = e.Node;
    var htmlContentPropertyAlias = "yourPropertyAlias";

    var htmlContentFromXmlNode = node.Descendants(htmlContentPropertyAlias).FirstOrDefault();
    if (htmlContentFromXmlNode != null && string.IsNullOrEmpty(htmlContentFromXmlNode .Value) == false) {
        var contentWhereTheClosingTagAndLinebreakTagsAreRemovedAndReplacedWithAnAdditionalSpace = Regex.Replace(elementContent.Value, @"(\</[a-z]+\>|\<br\/\>))", " ");

        var strippedHtml = umbraco.library.StripHtml(contentWhereTheClosingTagIsRemovedAndReplacedWithAnAdditionalSpace);

        var htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField = Regex.Replace(strippedHtml, @"\s+", " ").Trim();

        e.Fields.Add("searchableSanitizedHtmlField", htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField);
    }
}

I hope this helped. Note: I've not tested the code at all...

Thanks,
Jamie

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Examine - lucene index problem