Wednesday, October 16, 2024

Indexing Shared Data Items from Search Result Pages in Sitecore

 We encountered challenges indexing individual shared data items (e.g., cards from a listing page) as independent Sitecore Search items. The specific issue is that not all items are being indexed, even though the extractor logic seems correct.


Scenario

  • The search result page lists shared data items (not individual pages, so they don’t have distinct URLs).
  • The goal is to index each card on the page as an independent item in Sitecore Search with additional tags and logic for extra data values.
  • A flag was added to bypass the "Load More" functionality (?showAll=true) to make all cards visible on a single page for indexing.

Extractor Code

Here’s the extractor being used:


function extract(request, response) { const $ = response.body; const results = []; if (request.url.includes('image-gallery')) { $('div.row.layout').each((i, row) => { $(row).find('div.col-12.column.col-lg-4[gallery-layout="true"]').each((j, col) => { const dataAnchorTitle = $(col).attr('data-anchor-title'); const imgSrc = $(col).find('div.gallery figure img').attr('src'); if (dataAnchorTitle && imgSrc) { results.push({ title: dataAnchorTitle, image: imgSrc, type: 'gallery' }); } }); }); } return results; }

Issues Observed

  1. Only One Item Indexed: Despite having multiple items on the page, only one item is being indexed.
  2. Mandatory Fields: Missing mandatory fields in the index configuration could cause items to fail indexing.
  3. Order of Extractors: The sequence of extractors (JavaScript extractor vs. XPath extractor) might be causing conflicts.
  4. ID Generation: The ID for items was not being generated consistently, leading to indexing failures.

Resolution Steps

  1. Validate Extractor Logic:

    • Ensure the extractor captures all items on the page.
    • Hardcode mandatory fields temporarily to check if items get indexed correctly.
  2. Adjust Extractor Sequence:

    • Reorder the JavaScript Document Extractor to run before other extractors (e.g., XPath extractor) or remove unnecessary extractors.
  3. Generate IDs Programmatically:

    • Add logic to generate unique IDs for each item in the JavaScript extractor:

      const id = `${dataAnchorTitle}-${i}-${j}`; results.push({ id, title: dataAnchorTitle, image: imgSrc, type: 'gallery' });
  4. Re-Index and Validate:

    • Re-index with only the relevant extractor and verify results.
  5. Address Errors:

    • If errors like "heartbeat error" occur, retry indexing after resolving underlying connectivity or configuration issues.

Outcome

After implementing these changes:

  • All cards on the page were indexed as independent items.
  • Proper field mappings and unique ID generation ensured data integrity.
  • The reordering of extractors resolved conflicts during the indexing process.

These steps should help streamline indexing shared data items from listing pages in Sitecore Search.

Page Editor Doesn’t Show Current or Shared Site Name in Data Source

 While working with the Page Editor, we noticed an inconsistency in how data source information is displayed compared to the Content Editor. In the Content Editor, data sources clearly indicate whether they belong to the "Current site" or a "Shared site." For example:

  • Carousels (Current site)
  • Carousels (Shared: Shared)
  • Media carousel 1
  • Data (Current site)

However, in the Page Editor, the same data sources are displayed without this distinction:

  • Carousels
  • Carousels
  • Media carousel 1
  • Data

This lack of clarity makes it challenging to differentiate between site-specific and shared data sources within the Page Editor.


Update from Sitecore

Sitecore Support confirmed that the Page Editor currently does not display the "Current site" or "Shared site" labels in the data source selection dialog, unlike the Content Editor.

To address this, Sitecore has created a Feature Request (PGS-2562) to enhance the Page Editor with similar functionality in a future update.


Next Steps

While we wait for the feature to be implemented:

  1. Cross-Verify in Content Editor: Use the Content Editor to confirm the origin of data sources until this feature is available in the Page Editor.
  2. Track Updates: Monitor the progress of the feature request (PGS-2562) with Sitecore Support.

This improvement will bring much-needed clarity and consistency to the Page Editor, simplifying workflows for content editors.

Tuesday, October 15, 2024

Persistent "Heartbeat Error" in Sitecore Search Indexing Jobs

 We’ve been encountering a recurring issue during Sitecore search indexing jobs where they fail with the error: "Job failed due to heartbeat error." This error disrupts indexing and impacts production, creating a significant bottleneck in search functionality.




Key Observations

  1. Threshold Limitation:

    • The issue often relates to a threshold where if more than 30% of documents fail (e.g., due to 404 errors), the entire indexing job is discarded.
    • This threshold was confirmed by Sitecore Support as a hard limit with no current configuration option to adjust it.
  2. Unclear Error Reporting:

    • Errors like the "heartbeat error" provide no detailed context, making it difficult to determine whether the issue is due to resource limitations, 404 errors, or other failures.
  3. Sitemap Issues:

    • Delayed sitemap refreshes sometimes lead to invalid or outdated URLs being crawled, resulting in multiple 404 errors.
  4. Random Behavior:

    • In some cases, even with valid configurations and no 404s, jobs fail due to the heartbeat error without any clear cause.

Temporary Fixes Applied

  1. Adjusting Sitemap:

    • Updated and validated the sitemap to ensure all links are functional and up to date.
    • Excluded problematic URLs and broken links to reduce 404 errors.
  2. Simplified Extractor Configuration:

    • Removed unnecessary extractors to streamline the indexing process.
    • Combined document extractors to manage resources more efficiently.
  3. Retry Mechanism:

    • Reran jobs after addressing sitemap and threshold-related issues, which resolved some occurrences of the error.

Discussed below pionts with the Sitecore Team

During our investigation and resolution efforts for the persistent "heartbeat error," we requested the Sitecore team to address the following critical questions:

  1. Can the 30% Threshold Be Adjusted?

    • The current threshold causes the entire indexing job to fail if more than 30% of documents encounter issues (e.g., 404 errors). We inquired whether this threshold is configurable to accommodate different scenarios and prevent unnecessary failures.
  2. Improved Error Reporting:

    • We requested more descriptive error messages to clearly identify the root cause of failures, such as whether the issue is due to the 30% threshold, resource capacity, or another specific reason.
  3. Clarification on Heartbeat Error:

    • We asked whether the heartbeat error is solely tied to the threshold or if other factors, such as resource limitations or system configurations, could contribute to this issue.
  4. Sitemap Handling:

    • Given the potential delays in sitemap refreshes, we asked if there are recommendations for ensuring the crawler processes the most up-to-date sitemap without being affected by temporary 404 errors.
  5. Proactive Notifications:

    • We raised the need for proactive notifications for global incidents affecting crawlers, such as the ones reported in the Sitecore Status portal, to minimize downtime and ensure teams are informed promptly.
  6. Future Roadmap for Improvements:

    • We requested insights into Sitecore's plans for enhancing the search platform, including:
      • Allowing configurable thresholds.
      • Improving error handling and reporting.
      • Addressing known bugs or feature requests related to crawlers and extractors.