👋 Hey,
"A good data model is like a map, guiding you through the complexities of your data landscape."
- Kent Graziano, Chief Technical Evangelist at Snowflake
In a world inundated with vast amounts of information, a beacon of clarity emerges with the power of a good data model. Just as a map illuminates uncharted territories, a well-crafted data model acts as a guiding compass, leading us through the intricate landscapes of data. It empowers us to navigate complexities, unlocking insights, and paving the way for innovation. Harness its potential and embark on a transformative journey of knowledge.
Let's dive into this week's DataPro#48, which is jam-packed with valuable resources on building and implementing top-notch data models. Our main focus will be on "Data Modeling with Snowflake," skillfully presented by Serge Gershkovich. Additionally, we will explore the use of generative AI and modern data architecture to unlock insights. We'll also take a closer look at OpenAI's Function calling and other API updates, as well as the fascinating topic of bootstrapping data labels with GPT-4. Furthermore, we'll delve into the convergence of the Unadjusted Langevin Algorithm and discover techniques for measuring drift in ML embeddings.
But wait, there's more! We'll explore the limits of prompt engineering and share what people learned by building VoxelGPT. We'll also delve into MLOps Tempo, exploring how strategic goals can accelerate iteration. In addition, we'll cover column generation in linear programming and provide a comprehensive guide to mastering the art of machine learning workflows, featuring Transformer, Estimator, and Pipeline. Prepare yourself for an immersive learning experience!
Would you be interested in connecting with our DataPro Newsletter Editor-in-Chief for a user interview, where you can share your ideas and feedback? We value your input and would love to customize the content modules according to your preferences. To get started, simply fill in your email ID in the survey below, and we'll be in touch with you soon!
Be sure to join our feedback program! Complete a call and claim free credit as a reward. On top of that, those who complete the survey below would also receive a FREE Packt ebook, "The Applied Artificial Intelligence Workshop," in PDF format. Let's make the DataPro Newsletter even better together! Don't miss out!
Share your Feedback!
Key Highlights:
Cheers,
Merlyn Shelley
Editor-in-Chief, Packt
facebookresearch/galactic: Galactic is a framework for RL in robotic mobile manipulation, simulating large-scale indoor environments.
facebookresearch/audiocraft: Audiocraft is a PyTorch library for audio generation research, featuring MusicGen, a cutting-edge controllable text-to-music model.
HazyResearch/TART: TART is a versatile package for training and deploying task-agnostic reasoning modules, enhancing in-context learning for classification tasks with any model and domain.
mbzuai-oryx/Video-ChatGPT: Video-ChatGPT is a video conversation model using LLMs and a pretrained visual encoder for meaningful video discussions.
Lightning-AI/lit-llama: LLaMA's pretraining, finetuning, and inference code is independently implemented and fully open source under the Apache 2.0 license.
Bring SageMaker Autopilot into your MLOps processes using a custom SageMaker Project: This post demonstrates how to achieve a repeatable process using low-code tools like Amazon SageMaker Autopilot. The approach involves automating and standardizing ML pipeline using SageMaker Projects, Data Wrangler, Autopilot, Pipelines, and Studio. The solution enables AutoML tasks in a standardized repository structure, allowing data scientists to customize the workflow and generate a pipeline template for integration with a SageMaker project.
Reinventing the data experience: Use generative AI and modern data architecture to unlock insights: The post presents a scenario where a company implements a modern data architecture with multiple databases and APIs. It leverages generative AI and LLMs in Amazon SageMaker to enhance productivity and enable fact-based querying without knowledge of underlying data channels. The solution integrates various models, databases, and tools like Snowflake, offering an intelligent and unified approach to gain insights from multiple data stores. More details and code can be found in the GitHub repository.
AWS Inferentia2 builds on AWS Inferentia1 by delivering 4x higher throughput and 10x lower latency: This post highlights the advancements in the second generation of AWS Inferentia, a purpose-built accelerator for deep learning inference. AWS Inferentia2 offers improved performance, reduced costs, and optimized distributed inference for large-scale generative AI models, delivering higher throughput, lower latency, and up to 50% better performance/watt compared to other inference-optimized EC2 instances.
As the analytical requirements of a data-driven organization are notoriously complex and constantly evolving, modeling must keep pace and accompany data teams from idea to execution.
Before we continue, we need to formally delineate three distinct concepts often used together in the service of modeling to make it simpler to refer to a specific tool in the modeling toolkit. The three components are listed here:
Natural language semantics: Terminology employed in communicating details of a model between people. These are agreed-upon words that employ pre-defined conventions to encapsulate more complex concepts in simpler terms.
Technical semantics: SQL is a domain-specific language used to manage data in a Relational Database Management System (RDBMS). Unlike a general-purpose language (for example, YAML or Python), domain-specific languages have a much smaller application but offer much richer nuance and precision.
Visual semantics: Through their simplicity, images can convey a density of information that other forms of language simply cannot. In modeling, diagrams combine the domain-specific precision of SQL with the nuance of natural language.
In a data warehouse scenario, the PERSON and ACCOUNT tables would not be defined from scratch—they would be extracted from the source in which they exist and loaded—bringing both structure and data into the process. Then, the analytical transformations begin in answer to the organization’s business questions. This is a process known as Extract Transform Load (ETL). The business requirement for ACCOUNT_TYPE_AGE_ANALYSIS in this example purposely excludes the source key fields from the target table, preventing the possibility of establishing any relational links.
The logic could then be constructed by joining PERSON and ACCOUNT, as shown here:
CREATE TABLE account_types_age_analysis AS
SELECT
a.account_type,
ROUND(DATEDIFF(years, p.birth_date, CURRENT_DATE()), -1
) AS age_decade,
COUNT(a.account_id) AS total_accounts
FROM account AS a
INNER JOIN person AS p
ON a.person_id = p.person_id
GROUP BY 1, 2;
Paired with the SQL logic used to construct it, the lineage graph gives a complete picture of the transformational relationship between sources and targets in an analytical/warehousing scenario. Read more here...
This excerpt is taken from the recently published book titled "Data Modeling with Snowflake," written by By Serge Gershkovich, and published in May 2023. To get a preview of the book's content, be sure to read the whole chapter available here or sign up for a 7-day free trial to access the complete Packt digital library. To explore more, click on the button below.
Discover Fresh Concepts, Keep Reading!
OpenAI’s Function calling and other API updates: OpenAI has announced several updates. They have introduced new function calling capability in the Chat Completions API. Updated versions of gpt-4 and gpt-3.5-turbo are now more steerable. A new 16k context version of gpt-3.5-turbo has been released. OpenAI has reduced the cost of their state-of-the-art embeddings model by 75% and the input tokens for gpt-3.5-turbo by 25%. The deprecation timeline for gpt-3.5-turbo-0301 and gpt-4-0314 models has been announced. All models maintain data privacy and security guarantees. The popular gpt-3.5-turbo's input tokens are now priced at $0.0015 per 1K tokens, allowing developers to use it more affordably. The new gpt-3.5-turbo-16k is priced at $0.003 per 1K tokens.
What I Learned Pushing Prompt Engineering to the Limit: The author and his team developed VoxelGPT, an open-source application integrating LLMs with FiftyOne's computer vision query language. VoxelGPT allows users to search image and video datasets using natural language queries and also provides information about FiftyOne. The code is available on GitHub, and VoxelGPT can be accessed for free at gpt.fiftyone.ai. The team's approach involved leveraging various LLMs, utilizing new tools, and employing prompt engineering techniques.
Bootstrapping Labels with GPT-4: The blog post explores how GPT-4, a powerful language model developed by OpenAI, can be utilized to streamline data labeling tasks. By leveraging GPT-4's contextual understanding and text generation capabilities, users can bootstrap labels for various tasks, particularly focusing on sentiment classification. This approach can reduce both time and cost involved in the labeling process, providing a starting point for labelers and allowing them to focus on adding further value to the data.
On the Convergence of the Unadjusted Langevin Algorithm: The Langevin algorithm, also known as Langevin Monte Carlo, is a powerful method used to sample from probability distributions. It plays a crucial role in machine learning techniques like diffusion models and differential private learning. In this blog post, the author provides a derivation of the algorithm's convergence analysis, focusing on the case when the target distribution is Gaussian. The post explores the algorithm's applications in generative AI, such as diffusion models, and its role in training models with privacy guarantees in the context of differential privacy.
MLOps Tempo: How Do Strategic Goals Create Faster Iteration? MLOps tempo refers to the speed at which an organization plans, develops, and deploys machine learning models. A faster tempo enables organizations to leverage machine learning more quickly, gaining a significant competitive advantage. Strategic goals play a crucial role in guiding the timing and coordination of resources for creating an ML solution. It is essential to leverage specific resources, such as funding, shared tools, specialists, and sponsors, at the appropriate time to avoid detracting from other projects.
How to Measure Drift in ML Embeddings: Detecting prediction and data drift can provide early warnings for machine learning models. Various methods exist for structured data, such as tracking descriptive statistics or using statistical tests. However, for unstructured data in NLP or LLM-powered applications, numerical representations like embeddings are used. This article introduces five methods for embedding monitoring, including Euclidean and Cosine distance, Maximum Mean Discrepancy, model-based drift detection, and numerical drift detection. These methods are implemented in the open-source Evidently Python library.
Column Generation in Linear Programming and the Cutting Stock Problem: The article explores the application of delayed column generation as a solution for linear programming problems with a large number of decision variables. It specifically focuses on the one-dimensional cutting stock problem, demonstrating the theoretical aspects of column generation and providing a Python implementation using the scipy library. The complete code can be accessed in a corresponding Github repository.
Mastering the Art of Machine Learning Workflows: A Comprehensive Guide to Transformer, Estimator, and Pipeline: This comprehensive guide aims to simplify the process of data wrangling, feature transformation, and model training by introducing the practice of using Pipelines with Estimators and Transformers. By adopting this technique, developers can achieve elegant and efficient code while seamlessly integrating multiple steps in their data preprocessing and modeling journey. Pipelines allow the chaining of various transformers and estimators, ensuring a clear and automated flow from data preprocessing to model training and evaluation.
See you next time!