AI-Powered Data Science Workflows in Snowflake with the Posit Team Native App

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, and welcome to today's Workflows with Posit Team demo. My name is Ashley Bynum, and I'm a Solutions Architect at Posit. Today I'm excited to walk you through a real-world financial services workflow that demonstrates how data science teams can build, deploy, and monitor production machine learning systems using the Posit ecosystem integrated with Snowflake.

In the next, you'll see how to first, stream real-time transaction data from Snowflake into a fraud detection pipeline, second, develop machine learning models using Python with XGBoost for fraud prediction, third, using Positron 's AI-powered feature, Positron Assistant and Data Bot to accelerate development and data exploration, fourth, automatically score incoming transactions in real-time, fifth, build interactive dashboards with Shiny for Python, and finally, deploy everything to Posit Connect for production monitoring.

This demo features AmeriFirst Banking, a fictional financial institution, processing thousands of transactions daily and needing to detect fraud in real-time. We'll have a live Q&A at the end of this session, so feel free to drop questions in the chat as we go, and I'll address them at the end.

Before we dive in, I want to mention that all of the code from today's demo is available at GitHub. You can follow along or clone it later at github.com slash Posit dash dev slash workflows.

are statistically significant and may indicate fraudulent activity as legitimate customers rarely transact during these hours.

AI-powered workflow benefits

What you saw demonstrates how Positron transforms the data science workflow. The traditional approach would be write code manually, debug errors by searching stack overflow, create visualizations one at a time, manually calculate statistics and document findings separately. This takes hours.

Posit AI writes boilerplate code instantly, explores data interactively with natural language. Visualizations are generated automatically. Insights are explained in plain English and all work stays in one environment. The time savings are significant. Feature engineering that used to take hours now takes minutes. Data exploration that took hours now takes minutes. Debugging that took hours now takes minutes. The result is that data scientists spend more time on high value work, model design, business insights, and less time on repetitive coding tasks.

Simulating real-time transaction data

Now let's look at how we're simulating real-time transaction data. I'm going to open the stream simulator.py from our simulation folder. This script generates realistic banking transactions. First, it defines realistic transaction types, debit, credit, transfer, withdraw, deposit, and payment. Second, it defines channels where transactions occur, online banking, mobile app, ATM, branch, and point of sale. Third, it includes a list of merchants with their merchant category codes like Amazon with MCC-5411 for grocery stores or Shell Gas with MCC-5541 for gas stations. And fourth, it generates realistic transaction amounts using a log-normal distribution. Most transactions are small, under $100, but occasionally we'll get large transactions of several thousand dollars.

Now here's something important. Notice what we're not including in this simulator, fraud scores. This is intentional. In a real system, raw transactions arrive without fraud indicators. The machine learning model scores them separately. This separation of concerns is critical for production systems. If we generated fraud scores here, we'd be mixing data ingestion with fraud detection, which makes the system harder to find and maintain.

Running the simulator, you can see that it's configured to insert three transactions every few seconds into our Snowflake transaction staging table. Look at the output. Each line shows the timestamp when the transaction was inserted, the transaction ID, and the transaction amount in dollars. The channel, like mobile app or banking, and the transaction type, like debit or transfer. Here's an international transaction. These are flagged because international transactions have a higher fraud risk. This simulator will keep running in the background, continuously inserting transactions into Snowflake. In a real system, this would be replaced by actual transaction data from your banking systems.

Training the fraud detection model

I'll leave this running and move on to show you how we train the fraud detection model. Looking at our fraud detection model, this script demonstrates a complete machine learning workflow. I'll walk you through the key components.

First, let's look at feature engineering. I'll scroll to the engineer fraud features function. We're creating features that capture fraud patterns. These features fall into three categories. First, time-based features. We extract the hour day because late night transactions between midnight and 5 a.m. are riskier. We also flag weekend transactions because fraud patterns differ on weekends versus weekdays.

Second, amount-based features. We flag high-value transactions over $5,000. We calculate transaction velocity. How many transactions has this customer made in the last 24 hours? And we can calculate rolling sums of transaction amounts over time windows. Third, behavioral features. We calculate z-scores that compare the current transaction amount to the customer's typical behavior. If someone who normally spends $50 suddenly makes a $5,000 purchase, that's a high z-score and potentially fraudulent.

We also encode categorical variables. Transaction types are encoded as numbers. Debit is zero, credit is one, transfer is two, and so on. We do the same for channels. Online banking is zero, mobile app one, ATM two, branch three, point of sale four. We also include customer-level features. and a production system. We join this with the customer table to get credit scores, account tenure, and risk ratings. For this demo, we're using simplified default values, but the structure is there for real customer data.

Looking at the model training section, we're using XGBoost, which is an excellent choice for fraud detection. After training, we evaluate the model on a held test set. We calculate several metrics. First, accuracy. What percentage of predictions are correct? Second, precision. Of the transactions we flagged as fraudulent, how many are actually fraud? Third, recall. Of all the actual fraud transactions, how many did we catch? High recall means we're not missing fraud. And fourth, the F1 score, the harmonic mean of precision and recall. This gives us a single number to optimize.

Let me run the script so I can show you the model training in action. The script then connects to Snowflake, loads the training data, engineers features, and trains the model. You can see it's processing the data. Feature engineering is complete. Now it's training the XGBoost model. And it's done. The model trained in just a few seconds. In production with millions of transactions, this might take a few minutes longer, but it's still very fast.

With the metrics for fraud detection, we'll be able to save the model to our models artifact. And in production, you would schedule this training script to run weekly or monthly on Posit Connect. This allows you to retrain the model as fraud patterns evolve and new data becomes available.

With the model trained, we can now identify which new transactions need to be scored. We use a left ring between the transaction staging table and the fraud scorers table. The wear cloth filters out transactions where the key is null, meaning they haven't been scored yet. This ensures that we score each transaction once. We're not wasting compute resources rescoring transactions that already have predictions. We'll then write the predictions back to Snowflake. So what happens is that raw transactions come in, get scored, and predictions are written back to Snowflake where they can be queried by dashboards and other applications.

The fraud detection dashboard

Now we'll see how business users interact with this fraud detection system. First, I'll open up my app.py file for my dashboards folder. This is my Shiny for Python application that provides real-time monitoring for fraud analysis and executives. We'll show some key features of the code before we run it. Let's look at how the dashboard connects to Snowflake. This function queries Snowflake to get recent transactions. Notice we're joining on transaction staging table with ML fraud scores using a left join. This means we get all transactions whether they've been scored yet or not. Transactions that haven't been scored will have a null value for fraud probability and prediction. This dashboard is configured to auto refresh every few seconds, so as new transactions are scored, they automatically appear in the dashboard.

Users can filter the data by several dimensions. They can select time range, last hour, last 24 hours, last 7 days, last 30 days. They can filter by channel, transaction type, and they can set a fraud risk threshold, showing only transactions above a certain fraud probability. All of these filters are reactive, meaning when you change a filter, all of the charts and tables update instantly.

Going to the dashboard, we have this now deployed on Connect, and looking at the live interface, we'll be able to see the real-time monitoring tab. Let me walk you through what we're seeing. At the top, we have the key performance indicators, total transactions, and the average transaction amount. The chart also shows transaction volumes over time, so you can see the pattern of transactions flowing in.

Let me switch to the fraud detection tab. This is where fraud analysts spend most of their time. At the top, we see fraud-specific metrics. We've scored 3,215 transactions. The model has predicted 47 is potentially fraudulent. That's about 1.5%, which is typical for fraud rates. The average fraud probability across all transactions is 0.12. This histogram shows the distribution of fraud probabilities. Most transactions cluster near zero. They're clearly legitimate, but we have this long tail of higher probability transactions that need reviewed.

These transactions with probabilities above 0.7 are the ones that fraud analysts investigate first. Where there's a high-risk transaction table, there are transactions that are flagged for high fraud probability. All of these kinds of transactions need immediate review. Large amount, international, late at night, all red flags.

In a real system, clicking on these transactions would show more details and allow analysts to approve or decline them. This chart shows fraud predictions accumulating over time. You can see the count increasing as our scoring engine processes more transactions.

What makes this dashboard powerful is that it's first, real-time. It auto-refreshes every few seconds, pulling the latest data from Snowflake. Second, interactive. Users can filter and explore the data without writing any code. Third, production-ready. This same code can be deployed to Posit Connect and accessed by hundreds of users simultaneously. And fourth, pure Python. There's no JavaScript, no HTML, no CSS required. Everything you saw is written in Python and using the Shiny framework.

Deploying to Posit Connect

So now I'll show you how we deploy this to production on Posit Connect. Deploying to Connect is incredibly simple. From Positron, I can deploy with a single command. I'll navigate to the dashboards folder and deploy with my one-button push publishing. This will package up the application code and requirements text file. It'll upload it to Posit Connect and install the Python packages in an isolated environment. And then it'll start the application.

For this demo, I've already deployed the application. In addition to the dashboard, we can also deploy the scoring script as a scheduled job. Here's our fraud scoring job. It's configured to run every five minutes, continuously scoring new transactions. I can see the execution history. Each run shows whether it succeeded or failed, how long it took, and how many transactions were scored. If a run fails, Connect can send email alerts to the data science team. This ensures we catch issues quickly before they impact the business.

Let me show you how the architecture works one more time. Now we see the complete production loop. First, transactions flow into Snowflake from banking systems. Second, the scoring job runs on Connect every five minutes, processing new transactions. Third, predictions are written back to Snowflake. Fourth, the dashboard queries Snowflake and displays real-time metrics. Fifth, broad analysts investigate high-risk transactions. And sixth, feedback from analysts can be used to retrain and improve this model. This is a production-grade machine learning system running entirely on Posit and Snowflake. It's scalable, secure, and maintainable.

Key takeaways

Now let me summarize the key takeaways from today's demo. First, we demonstrated seamless Snowflake integration. Using Snowpark container services, we have native integration with automatic credential management and workbench, and direct data access without ETL pipelines. Our data scientists can query Snowflake as easily as they query a local database.

Second, we showed how Positron's AI features accelerate development. It can help you write code, debug errors, and explain complex logic. It can also enable exploratory data analysis using natural language, generating visualizations, and insights automatically. Together, these tools reduce development time by 60 to 70%.

Together, these tools reduce development time by 60 to 70%.

Third, we built a complete end-to-end machine learning workflow. We performed feature engineering in Python, trained models with XGBoost, deployed a real-time scoring pipeline, and version-controlled everything. This entire workflow runs in production without manual intervention.

Fourth, we deployed to production on Posit Connect. Our shiny dashboards serve business users with real-time insights, and scheduled jobs run the scoring engine continuously. Everything's secure, scalable, and monitored with detailed logs and metrics.

And fifth, we demonstrated proper separation of concerns. Raw data ingestion happens in the simulator. Machine learning scoring runs as a separate process. Visualization lives in the dashboard. Each component can be developed, tested, and deployed independently, making the system maintainable and scalable.

This workflow demonstrates three critical capabilities. First, speed to production. We went from notebook to production, dashboards, and hours, not weeks or months. Second, collaboration. Data scientists and analysts use the same tools and access the same data, eliminating handoffs and miscommunication. And third, governance. All code is version-controlled, all access is logged, all credentials are managed securely. This meets enterprise compliance requirements while maintaining developer productivity.

In real-world deployments, organizations using this architecture have achieved 50% reduction in fraud detection latency from hours to minutes, 30% improvement in fraud detection accuracy through faster model iteration, 70% reduction in infrastructure costs by eliminating redundant data pipelines, and 90% reduction in deployment time from development to production.

Before we go to Q&A, let me share some resources. Again, the complete code from today's demo is available at github.com slash Posit dash dev slash workflows. All of these links are going to be in the GitHub repository. For documentation, we have the Snowflake Plus Posit integration guide, Shiny for Python tutorials, Posit Connect deployment best practices, and Posit AI features documentation. I encourage you to try this yourself, clone a repository, connect to your Snowflake account, and deploy it to Posit Connect. The readme has step-by-step instructions. If you don't have Posit Connect yet, you can request a demo or trial at Posit.co. Now let's open it up for questions.

AI-Powered Data Science Workflows in Snowflake with the Posit Team Native App

Transcript#

Architecture overview

Positron Assistant in action

Databot for exploratory data analysis

AI-powered workflow benefits

Simulating real-time transaction data

Training the fraud detection model

The fraud detection dashboard

Deploying to Posit Connect

Key takeaways

Featured software#

plumber

Positron

Quarto

rstudio