Thursday, February 8, 2024

Extracting HTML from PDFs in Sitecore Search: A Step-by-Step Implementation


Recently, I was tasked with implementing a solution to extract structured content from PDFs using Sitecore Search. PDFs, by nature, are not HTML-based, and since Sitecore Search document extractors only parse HTML, I needed a way to transform PDFs into an HTML-readable structure. Here’s how I approached and implemented this requirement effectively.


The Challenge

The primary goal was to extract key attributes (like titles, descriptions, and tags) from PDFs by leveraging their HTML structure. However, identifying the HTML structure of a PDF isn’t straightforward. I needed to:

  1. Extract the complete HTML structure of sample PDFs.
  2. Analyze their HTML patterns to configure accurate document extractors.

Implementation Steps

1. Creating a Dummy Attribute for HTML Extraction

To extract the entire HTML content of a PDF, I created a dummy attribute that would hold the HTML structure.

  • Entity: Content
  • Display Name: PDF to HTML
  • Attribute Name: pdf_to_html
  • Placement: Standard
  • Data Type: String

This attribute acted as a placeholder for the extracted HTML content, making it easy to reference later.

2. Setting Up a Temporary Web Crawler

To analyze the HTML structure of PDFs, I created a temporary advanced web crawler to process sample PDF URLs.

  • Source Configuration:

    • Source Name: Temp PDF to HTML
    • Connector Type: Web Crawler (Advanced)
    • Max Depth: Set to 0 to prevent the crawler from indexing linked URLs.
  • Request Triggers:
    I added URLs for a few representative PDFs (covering diverse structures such as text-heavy, image-heavy, and Q&A formats) as request triggers.

This temporary crawler ensured that only the selected PDFs were processed.

3. Configuring the JavaScript Document Extractor

To extract the HTML structure, I configured a JavaScript document extractor with the following logic:


// Sample document extractor function to get HTML from PDF content. function extract(request, response) { $ = response.body; return [{ 'pdf_to_html': $('html').html(), // Extracts the HTML structure of the document 'type': "pdf" // Mandatory attribute for defining the document type }]; }

This function uses the html() jQuery method to fetch the root HTML structure of each PDF and assigns it to the pdf_to_html attribute.

4. Viewing the Extracted HTML in Content Collection

After publishing and running the crawler, I inspected the indexed content to analyze the HTML structure.

  • Navigated to Content Collection in the Sitecore CEC.
  • Filtered by the source Temp PDF to HTML.
  • Opened individual content items to examine the PDF to HTML attribute, which displayed the full HTML structure of the PDF.

By repeating this process for multiple sample PDFs, I was able to identify patterns in the HTML content.

Key Insights and Learnings

  1. Understand HTML Structures Early:
    Scanning diverse PDFs early helped identify common patterns and edge cases, enabling accurate document extractor configurations.

  2. Iterative Testing:
    Viewing the extracted HTML in the Content Collection ensured that the configurations were capturing the desired data accurately.

  3. Temporary Sources for Exploration:
    Using a temporary crawler isolated the exploration process without impacting production content or configurations.

Final Outcome

With the HTML structures in hand, I configured specialized document extractors tailored to extract meaningful attributes like titles, descriptions, and tags from PDFs. The solution is now scalable to handle a variety of PDF types while ensuring consistent indexing in Sitecore Search.

Closing Thoughts

This experience highlights the flexibility and power of Sitecore Search’s document extractors. By leveraging tools like JavaScript extractors and advanced web crawlers, extracting structured content from non-HTML formats like PDFs becomes a manageable and efficient process.

Wednesday, February 7, 2024

Sitecore XM Cloud - Fixing URL Redirections in Next.js on Vercel Using vercel.json



When URLs don't redirect correctly, it can be frustrating. Recently, we faced this challenge in our Next.js app on Vercel. Let's walk through how we tackled it.

Our URLs containing spaces or special characters weren't redirecting properly on Vercel. This glitch disrupted our users' flow and needed fixing ASAP.

Exploring Solutions:

At first, we tried a JavaScript-based approach within our app code. It worked fine locally but didn't cut it on Vercel.

Leveraging vercel.json:

Next, we discovered Vercel's powerful vercel.json file. With it, we could define redirect rules to map old URLs to new ones. Simple yet effective!

Crafting the Solution:

We created a vercel.json file with rewrite rules. These rules told Vercel how to handle URLs with spaces or special characters, ensuring they redirected correctly.


Sample vercel.json Code:


{
    "rewrites": [
      {        
        "source": "/(.*)%20(.*)",
        "destination": "/$1-$2"
      },
      {        
        "source": "/(.*)\\s(.*)",
        "destination": "/$1-$2"
      }
    ]
  }
 

Deployment and Validation:

After updating the vercel.json file, we deployed our Next.js app on Vercel. Through testing, we confirmed that our URLs now redirected smoothly, no more hiccups!

Conclusion:

In our journey to fix URL redirection issues, we learned the power of vercel.json. By leveraging its capabilities, we resolved our problem and improved our users' experience. It's a reminder that sometimes, simple solutions are the best.

Troubleshooting Deployment of Build Artifacts to Another Environment Using Gitflow in Sitecore XM Cloud



When using Sitecore XM Cloud with a Gitflow branching strategy, managing deployments across environments like Development, UAT, and Production can sometimes lead to issues. A common challenge is the inability to promote builds between environments, with the platform indicating that the environment is unavailable.



Scenario Overview

  1. Setup Details:

    • Development environment linked to the development branch.
    • UAT environment linked to the release/xx branch.
    • Production environment intended to receive builds promoted from the UAT branch.
  2. Issue:
    The Promote option becomes inactive, indicating the target environment is unavailable. This issue persists even when reverting branches to default configurations (e.g., linking all to master).

Root Cause

The Gitflow branching strategy assigns environments to specific branches, which can cause conflicts when attempting to promote builds between environments. The promotion process requires flexibility in branch linkage, and if a target environment is bound to a branch, the promotion may fail.

Solution

To resolve this, you must configure the environment to allow promotions without branch constraints:

  1. Update Environment Configuration:

    • For the target environment (e.g., Production), set the branch selection to None.
    • This detaches the environment from any branch, enabling seamless promotion of builds.

    Example:

    • Development: Linked to the development branch.
    • UAT: Linked to release/xx branch.
    • Production: Branch set to None.
  2. Retry Promotion:

    • After adjusting the branch configuration, attempt to promote the build from UAT to Production.
    • The promotion should now proceed without any issues.

Why This Works

By setting the branch selection to None for the target environment:

  • The environment no longer expects artifacts directly from a specific branch.
  • This allows the system to accept promotions as a build artifact rather than enforcing a direct branch linkage.

Best Practices for Gitflow in Sitecore XM Cloud

  1. Use Branching Strategically:

    • Link Development and UAT to their respective branches for testing and validation.
    • Set Production's branch to None for flexibility during promotions.
  2. Validate Build Configurations:

    • Ensure build artifacts are correctly generated and tagged for promotion.
  3. Monitor Environment Status:

    • Before promotions, confirm that both source and target environments are available and properly configured.
  4. Document Changes:

    • Keep a record of branch configurations and environment settings to avoid confusion in future deployments.

Conclusion

This approach ensures smooth promotions while maintaining the integrity of the Gitflow strategy in Sitecore XM Cloud. Adjusting the target environment’s branch configuration to None resolves promotion-related issues, enabling seamless transitions across Development, UAT, and Production environments.

How to Publish a Specific Version in Content Editor and Page Editor


When working in Sitecore, publishing a specific version of an item can be crucial, especially when managing content across multiple versions. In the Content Editor, Sitecore provides a direct option to publish a specific version of an item, streamlining the process for content managers.


To publish a specific version in Content Editor, simply:



Navigate to the desired item.

Open the 'Versions' tab and select the specific version you wish to publish.

Use the 'Publish' options available to push that version live.

For the Experience Editor, the process is a bit different. Follow these steps:


Open the page in the Page Editor.


From this page select the versions, you need to explicitly, you have to make all other version items unavailable