NASA incorporates advanced machine learning and artificial intelligence (AI) into projects across a broad array of teams and missions. This includes state-of-the-art large language models (LLMs) and generative AI. My internship for Goddard Space Flight Center involved developing software applications that infused LLMs with cutting-edge tooling, allowing the programs to ingest, process, and transform extremely large data corpora of a variety of formats into readable and concise information.
The main objective was to improve the quality and accuracy of answers from LLMs when prompted with questions relative to sizable PDF documents provided by the user. A secondary objective was to provide the software the ability to transform structured data (i.e., Excel spreadsheets) into an interactive knowledge graph, which provided its own challenges stemming from embeddings and header formatting often found in NASA Excel files.
At the heart of these projects were widely-used software languages and tools such as Python, JavaScript, Excel, and SQL. On top of these technologies sat more advanced components and frameworks such as LangChain, Neo4j, GraphQL, Chainlit, and numerous cloud computing and containerization tools. With this architecture, the programs leveraged advanced LLMs enhanced with speech-to-text/text-to-speech capabilities and LightRAG. LightRAG is a state-of-the-art, graph-based retrieval-augmented generation (RAG) system that enables LLMs to accurately answer both broad and minute questions about very large documents. The project also required the transformation of text from PDFs into data that was LLM-ingestible, often requiring the implementation of sophisticated object character recognition (OCR) technologies.