Family, friends, and even colleagues from other departments often picture geoscientists with a rock in hand, sometimes in the office but mostly out in the field a sort of Dr. Jones holding an ancient treasure. However, the reality is that a significant part of our work takes place between books, folders, diagrams, and, at best, conferences. Our role is to prepare and reiterate studies and reports for decision-makers, gathering as much information as possible from available knowledge sources, adding context, and drawing out key insights. One of the most challenging tasks in this process is understanding of the reliability of the found information, establishing cause-effect relationships between large number of project’s variables and properly referencing each source of information. You may have learned a critical fact from a book years ago or during an overseas event, and now you must understand its value and document implications of the new information on the uncertainties in the context of a new project, as decisions need to be supported by best evidence with understood risks and uncertainties. The conclusions in subsurface characterization projects are never black and white: there are all shades of grey in each parameter.

Preserving the Historic Knowledge  from Paper Sources

Geologic knowledge was accumulated over the last few hundreds of years in form of geologic samples collection and their descriptions. While multiple models of our understanding of the planet and universe were proposed, the original information and the history of evolution of human analysis are preserved in books, articles, hand drawings, photographs, maps, cross-sections and videos. Anyone working with scientific publications, technical reports, public records, or bibliographic materials is aware of the challenges of handling physical documents. The first hurdle is accessing the document when it’s available only as part of a book or magazine. You must be able to physically reach the location during specific hours, have authorization to view the material, and hope an available copy exists. Once access is granted, searchability becomes the next challenge. The document's length, the language it's written in, and the presence or absence of an index of titles or authors all come into play. If you're fortunate, you’ll navigate these obstacles and locate the information you need. After that, you must capture the data, whether through a photocopy, handwritten notes, or even hand-drawing maps or diagrams. And this is just the star if you own the original document or even a copy, you'll also face concerns about storage, preservation, access restrictions, and more.

A Tedious and Time-Consuming Task

Providing well-documented references for every piece of information in a report is a time-consuming and often tedious task. Geoscientists gather data from a wide range of sources, including public and private databases, conference proceedings, and industry-standard test reports each performed at a specific time, using the technology available at that moment, and following either current or now outdated procedures. The credibility of a source, the accuracy of the methodology, and the precision and accuracy of the instruments used in the research all contribute to the quality of a reference. Therefore, it’s not just about finding a document or information within it, but also evaluating its relevance to the study and determining how much trust can be placed in it. Additionally, valuable insights are sometimes lost, either because key information wasn't captured, or the relevance was misinterpreted.

Digitization is Just the First Step

Digitization converting physical documents into digital format has been the primary and, until recently, the only solution to address this challenge. It began with microfilms in the 20th century, serving as simple photographic reproductions of documents. Later, we transitioned from those images to digital formats. More recently, tools like OCR (Optical Character Recognition) have enabled us to convert these images into machine-readable text with variable success record. While it may seem like a simple step, digitization has revolutionized the way we handle information. It allows users to access data electronically via remote databases or services like email, create copies with a single click, and easily share, store, and preserve large volumes of information at a fraction of the cost saving on space, security, and maintenance of the physical records. However, with the increased ease of access to information, the demand for proper referencing in studies also grew. This is where indexing became crucial with multiple variables to be captured to describe geography of interest, discipline, experience of the author, to name the few. The challenge of handling vast amounts of information is knowing how to find what you need when you need it and how to find it again. To address this, search tools were developed, leveraging indexation. Key details such as authors, publication year, publisher, title but also additional data like technological domains or geographic references were linked to the document and as a result, larger and larger databases were built.

Indexation Without Context is Not Enough

Despite these advances, context was often missing, and cross-referencing related documents remained challenging. As Information Technology progressed, algorithms were developed to address these issues, enhancing search and retrieval capabilities. However, it still wasn't enough.

In geoscience, when reviewing literature, a single document often contains multiple geographical references. Sometimes, these refer to the same location at different scales such as a country, basin, oil field, or a continent, mountain range, valley, or river. Other times, the references are to entirely different locations for comparison purposes. A similar challenge exists across various technical domains, such as paleogeography, hydrology, or mineralogy. Valuable insights and conclusions often come from lateral domains. In this context, simply counting how many times a word appears in a text does not adequately capture its significance or relevance.

Let’s pause here, and we’ll revisit this point shortly.

The Best Interface for the Greatest Experience.

A key point often overlooked when modernizing legacy tasks is not just how we approach the work but also the interfaces we use to perform it. Despite all the advancements discussed earlier, we still tend to envision the 2024 geoscientist as someone searching through documents on a desktop computer, equipped with powerful search tools. While the process has become less manual, many still find it difficult to imagine doing things differently.

This prompted us, during the conceptualization of U3 Explore, particularly for the U3 Venezuela project, to rethink how we access information (Figure 1). We believe that the most common  intuitive interface  for geoscientists is a map. The ideal scenario is to access all relevant information for a specific location fully indexed and contextualized at the click of a button. For this reason, we chose to use a GIS(Geographical Information System) platform as our default interface, enabling geoscientists to work directly within a map-based environment. This approach has already proven successful in the U3 Exploration Project along the Somali Coast (reference).

Figure 1. Geoscience documents typically have multiple maps, cross sections, charts and data tables not easily recognized and captured by common character recognition tools

LLMs, RAG, and Graph RAG. The Secret Sauce.

In 2022, on the cusp of the Large Language Model (LLM) era, U3 initiated a pilot project that leveraged various AI tools to gather context and assess the relevance of old conference proceedings—not just from the text, but also from the images within each document. This approach allowed documents and images to be linked to multiple locations, creating various entry points for our indexed search.

As LLMs became publicly available, they brought the ability to extract context, insights, and summaries from documents, and to query individual or sets of documents within that context. As we fine-tuned the model, the technology advanced from using prompts or tokens to solutions like Retrieval Augmented Generation (RAG), where LLMs are fed with information extracted from a vectorized database built on insights rather than just keywords. More recently, Graph RAG has emerged, allowing the connections identified during the pilot phase to be represented as weighted relationships within a Knowledge Graph. This step is the most critical for success of the identification of the trustworthily information for our business decisions.

While this may sound straightforward, the 'secret sauce' lies in one key concept: the pre-existing contextual knowledge in linking seemingly unrelated streams of project into an integrated story with a conclusive outcome. The expertise of Subject Matter Experts (SMEs) and project integrators play a crucial role. Their supervision of the process shapes the initial information selection, validates algorithm outputs and parameters, and adjusts the weights assigned to connectors in the Knowledge Graphs . With every new input, the process has to be repeated and models readjusted just like in a human life where we learn from experience.

Conclusions

Contextual relationships in the analytical tools of U3D link different scales in subsurface evaluations from Basin scale to play, to the production/injection site location selection. Generative AI tools can help us with tedious and time-consuming tasks of sorting through large volumes of information and validating our finds. More than ever experts are needed to validate documents produced by GAI and its conclusions.

While Legal or financial ChatGPT suggestions for highly regulated templates and reports have binary outputs, the objectives of the subsurface studies with multi-variant dependencies in the conclusions of the multidisciplinary geoscience analysis is an iterative process. Original subsurface models of the discovery never remain the same and change with more production information and new wells drilled in the assets and/or basin or similar play elsewhere in the world.

Historic knowledge in the basin can be preserved and exposed in the integrated studies with a proper use of Generative AI tools with the Large Language Models and Knowledge Graphs designed and managed with SME inputs.