Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Harm Holtackers 2 posts 22 karma points
    Jun 13, 2014 @ 14:28
    Harm Holtackers
    0

    Examine - lucene index problem

    Hello,

     

    I have a problem while lucene is indexing Umbraco.

    The following html is indexed wrong

    test<br />test2 will be testtest2 in the search results the <br /> is stripped.

    Is it possible to replace the <br /> with \n  instead of stripping it.

     

    Best regards,

    Harm Holtackers

  • Jamie Pollock 172 posts 846 karma points c-trib
    Jun 13, 2014 @ 16:42
    Jamie Pollock
    0

    Hey Harm,
    By default HTML is stripped by Lucene. However you'll find in the SearchResult Fields collection a __raw_<propertyAlias> version which contains the original HTML version.

    I recent discovered this myself in my own journey into Lucene/Examine. :)

    I hope this answers your question.

    Thanks,
    Jamie

  • Harm Holtackers 2 posts 22 karma points
    Jun 13, 2014 @ 16:54
    Harm Holtackers
    0

    Thanks Jamie,

    I discovered this also but the problem is that I do a Fuzzy search on the contentText and when the items will be concatenated some results won't show up.
    It's strange that te words will be concatenated when only a <br /> is in between.

    Best regards,

    Harm Holtackers

  • Jamie Pollock 172 posts 846 karma points c-trib
    Jun 13, 2014 @ 17:13
    Jamie Pollock
    0

    I guess you could add an event to alleviate the situation. I'm not suggesting this is the best solution mind you. There might be a better solution as I'm fairly new to Lucene.

    First of all assign your GatheringNodeData which allow you to edit data before its indexed.

    var nameOfYourIndexer = "MyCustomExternalIndexer";
    ExamineManager.Instance.IndexProviderCollection[nameOfYourIndexer].GatheringNodeData += ExamineEvents_GatheringNodeData;
    

    Then in the indexer event handler.

    void ExamineEvents_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) {
        GenerateSearchableHtmlContent(e);
    }
    
    private void GenerateSearchableHtmlContent(IndexingNodeDataEventArgs e) {
        var node = e.Node;
        var htmlContentPropertyAlias = "yourPropertyAlias";
    
        var htmlContentFromXmlNode = node.Descendants(htmlContentPropertyAlias).FirstOrDefault();
        if (htmlContentFromXmlNode != null && string.IsNullOrEmpty(htmlContentFromXmlNode .Value) == false) {
            var contentWhereTheClosingTagAndLinebreakTagsAreRemovedAndReplacedWithAnAdditionalSpace = Regex.Replace(elementContent.Value, @"(\</[a-z]+\>|\<br\/\>))", " ");
    
            var strippedHtml = umbraco.library.StripHtml(contentWhereTheClosingTagIsRemovedAndReplacedWithAnAdditionalSpace);
    
            var htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField = Regex.Replace(strippedHtml, @"\s+", " ").Trim();
    
            e.Fields.Add("searchableSanitizedHtmlField", htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField);
        }
    }
    

    I hope this helped. Note: I've not tested the code at all...

    Thanks,
    Jamie

  • This forum is in read-only mode while we transition to the new forum.

    You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies