Simplify Day 2 Operations on GCP — Active Assist

We have deployed the application, set up your Kubernetes cluster, or launched a new service. But now come the challenges of day 2 operations — managing configuration drift, service reliability, scaling up, and more.

We use the term Day 2 operations quite a bit, but what does that mean exactly?

What are Day 2 operations?

Imagine you’re moving into a house. If Day 1 operations are moving into the house (installation), Day 2 operations are the “housekeeping” stage of a software’s life cycle. The care and feeding of the software, maintaining the overall stability and health of your software in production.

We talk about three stages of operations: Day 0, Day 1, and Day 2. Day 0 is the “design” stage, where we figure out what resources and requirements are needed to get it all up and running.

Day 1 operations describe the “deployment” stage, where we actually install, set up, and configure our software.

Finally, Day 2 is the “maintenance” stage. Think of Day 2 as everything you do to make sure your software is cared for and healthy.

In addition to monitoring how your software is working, Day 2 operations might also entail routine tasks such as installing upgrades and updating systems.

The importance of planning for Day 2 operations

It’s important to have Day 2 operations in mind when laying the groundwork for your software in Day 0 and Day 1, especially for cloud-native technologies. Having the right maintenance tools in place earlier on will help your organization avoid issues in the future.

Some common challenges during Day 2 operations include having trouble visualizing the performance of your software and difficulty integrating updates. Managing all the moving parts of your software is especially challenging with cloud-native systems such as Kubernetes, as they become increasingly more complex with scale.

If you don’t have a way to easily visualize and view your system’s status and performance, it can ultimately spell the end of your new software initiative.

Active Assist

Active Assist refers to the portfolio of tools used in Google Cloud to generate insights and recommendations to help you optimize your cloud projects. This includes recommenders that generate recommendations and insights and analysis tools.

Recommenders currently generate recommendations that fall into six value categories that can help you optimize your cloud in a variety of ways.

COST

Help you manage your cost wisely, such as recommending to delete unused or idle resources, downsizing VMs to fit your workload needs, or using committed use discounts to save money.

Some of the sample solutions for COST pillar are:

1. VM machine type recommender

Compute Engine provides machine type recommendations to help you optimize the resource utilization of your virtual machine (VM) instances. These recommendations are generated automatically based on system metrics gathered by the Cloud Monitoring service over the previous 8 days. Use these recommendations to resize your instance’s machine type to more efficiently use the instance’s resources. This feature is also known as rightsizing recommendations.

2. Committed use discount recommender

The committed use discount (CUD) recommender helps you optimize the resource costs of the projects in your Cloud Billing account. Its recommendations are generated automatically based on historical usage metrics gathered by Cloud Billing. You can use these recommendations to purchase additional commitments and further optimize your Google Cloud costs.

Recommendations are available for both spend-based and resource-based commitments, for Cloud Billing accounts billed in US dollars (USD).

3. Idle VM recommender

Compute Engine provides idle VM recommendations to help you identify virtual machine (VM) instances that have not been used. These recommendations are generated automatically based on system metrics gathered by the Cloud Monitoring service over the previous 14 days. You can use idle VM recommendations to find and stop idle VM instances to reduce waste of resources and reduce your compute bill.

4. Cloud SQL overprovisioned instance recommender

The Cloud SQL over provisioned instance recommender helps you detect instances that are unnecessarily large for a given workload. It then provides recommendations on how to resize such instances and reduce cost. The Cloud SQL over provisioned recommender analyzes the usage metrics of primary instances that are older than 30 days. For each instance, the recommender considers the CPU and memory utilization based on the values of certain metrics within the last 30 days. The recommender does not analyze read replicas.

If the peak utilization of either or both the CPU and the memory within the observation period is low, the instance is estimated to be over provisioned. Recommendations are generated every 24 hours for rightsizing such instances when the estimated monthly cost savings are greater than or equal to $10.

The recommender uses conservative thresholds to ensure that it flags only instances that are significantly overprovisioned, which is usually a good indicator of waste. The recommender suggests a machine type that has at least 4 vCPUs and 26 GB.

Security

Harden your security posture by applying recommended actions to reduce over-granted permissions, enable additional security features, and help with compliance and security incident investigations.

Some of the sample solutions for SECURITY pillar are:

1. IAM recommender

Role recommendations help you identify and remove excess permissions from your principals, improving your resources’ security configurations. Role recommendations are generated by the IAM recommender. The IAM recommender is one of the recommenders that Recommender offers. Each role recommendation suggests that you remove or replace a role that gives your principals excess permissions. At scale, these recommendations help you enforce the principle of least privilege by ensuring that principals have only the permissions that they actually need. The IAM recommender identifies excess permissions using policy insights. Policy insights are ML-based findings about a principal’s permission usage.

Some recommendations are also associated with lateral movement insights. These insights identify roles that allow service accounts in one project to impersonate service accounts in another project.

2. Firewall insights

Firewall Insights helps you understand the usage patterns of your firewall rules. You can use these insights to support decisions about removing or modifying firewall rules to simplify and secure your firewall configuration.

You can view the following insights on the Google Cloud console Firewall Insights page and in several other places in the Google Cloud console:

  • Shadowed firewall rules: help you identify firewall rules that overlap with existing rules.

  • Overly permissive rules: help you identify allow rules with no hits, unused attributes, or overly permissive IP address or port ranges.

  • Deny rules: give you details about deny rules that had hits during the configured observation period.

3. Cloud Run recommender

Recommender is a service that automatically provides recommendations and insights for using resources on Google Cloud, based on heuristic methods, machine learning, and current resource usage. Each recommendation includes a link you can click to put the recommendation into effect for your service.

You can use Recommender to increase security by optimizing

Service accounts for a Cloud Run service so the service account has the minimal set of required permissions.

Security of the following items in environment variables:

  • Passwords

  • API keys

  • Google Application Credentials

4. Account security recommender

The account security recommender prompts Google Cloud project owners to protect their account with their phone’s built-in security key. If a project owner has an eligible device associated with their account and isn’t using a third party Identity provider for single sign-on, a notification shows in Google Cloud console’s Notifications notifications menu. The project owner can then choose to enable their phone as a security key for their account, or dismiss the notification.

Performance

Improve the performance of your cloud resources and workloads through prediction and automation that take your infrastructure one step ahead of what your applications need next.

Some of the sample solutions for PERFORMANCE pillar are:

1. Managed instance group machine type recommender

Compute Engine provides machine type recommendations for managed instance groups (MIGs) to help you improve workload performance and cost efficiency. Use these recommendations to determine whether you should resize the machine type of your instances to add or remove vCPU and memory resources.

Reliability

Increase the availability and reliability of your cloud resources and your workloads running on Google Cloud via various health checks, auto-scaling capabilities, and Business Continuity and Disaster Recovery options.

Some of the sample solutions for RELIABILITY pillar are:

1. Compute Engine predictive autoscaling

You can configure autoscaling for a managed instance group (MIG) to automatically add or remove virtual machine (VM) instances based on increases or decreases in load. However, if your application takes a few minutes or more to initialize, adding instances in response to real-time changes might not increase your application’s capacity quickly enough. For example, if there’s a large increase in load (like when users first wake up in the morning), some users might experience delays while your application is initializing on new instances.

You can use predictive autoscaling to improve response times for applications with long initialization times and whose workloads vary predictably with daily or weekly cycles.

When you enable predictive autoscaling, Compute Engine forecasts future load based on your MIG’s history and scales out the MIG in advance of predicted load, so that new instances are ready to serve when the load arrives. Without predictive autoscaling, an autoscaler can only scale a group reactively, based on observed changes in load in real time. With predictive autoscaling enabled, the autoscaler works with real-time data as well as with historical data to cover both the current and forecasted load.

2. Cloud SQL out-of-disk recommender

The Cloud SQL out-of-disk recommender proactively generates recommendations that help you reduce the risk of downtime that might be caused by your instances running out of disk space. You can apply these recommendations when a Cloud SQL instance is trending toward a storage limit.

3. Policy Troubleshooter

Policy Troubleshooter makes it easier to understand why a user has access to a resource or doesn’t have permission to call an API. Given an email, resource, and permission, Policy Troubleshooter examines all Identity and Access Management (IAM) policies that apply to the resource. It then reveals whether the principal’s roles include the permission on that resource and, if so, which policies bind the principal to those roles.

You can access Policy Troubleshooter using the Google Cloud console, the Google Cloud CLI, or the REST API.

4. Policy Analyzer

You can use the Policy Analyzer to find out which principals (users, service accounts, groups, and domains), have what access to which Google Cloud resources.

Manageability

Enhance your management experience on Google Cloud via simplification and automation so that you spend less time managing your cloud configuration and spend more time on innovating your digital businesses and delighting your customers.

Some of the sample solutions for MANAGEABILITY pillar are:

1. Network Intelligence Center

Network Intelligence Center provides a single console for managing Google Cloud network visibility, monitoring, and troubleshooting.

a. Network Topology overview

Network Topology is a visualization tool that shows the topology of your Virtual Private Cloud (VPC) networks, hybrid connectivity to and from your on-premises networks, connectivity to Google-managed services, and the associated metrics. You can also view metrics and details of network traffic to other Shared VPC networks and inter-region traffic. Network Topology combines configuration information with real-time operational data in a single view. This view makes it easier to understand networking relationships between various workloads on Google Cloud and their current state, such as the traffic paths and throughput between virtual machine (VM) instances.

Network Topology lays out information in a graph format, where the nodes and lines represent entities and connections in your network.

b. Connectivity Tests overview

Connectivity Tests is a diagnostics tool that lets you check connectivity between network endpoints. It analyzes your configuration and, in some cases, performs live data plane analysis between the endpoints. An endpoint is a source or destination of network traffic, such as a VM, Google Kubernetes Engine (GKE) cluster, load balancer forwarding rule, or an IP address on the internet.

To analyze network configurations, Connectivity Tests simulates the expected forwarding path of a packet through your Virtual Private Cloud (VPC) network, Cloud VPN tunnels, or VLAN attachments. Connectivity Tests can also simulate the expected inbound forwarding path to resources in your VPC network.

For some connectivity scenarios, Connectivity Tests also performs live data plane analysis. This feature sends packets over the data plane to validate connectivity and provides baseline diagnostics of latency and packet loss. If the route is supported for the feature, each test that you run includes a live data plane analysis result.

c. Performance Dashboard overview

Performance Dashboard gives you visibility into the performance of the entire Google Cloud network, as well as to the performance of your project’s resources.

With these performance-monitoring capabilities, you can distinguish between a problem in your application and a problem in the underlying Google Cloud network. You can also investigate historical network performance problems.

Performance Dashboard also exports data to Cloud Monitoring. You can use Monitoring to query the data and get access to additional information.

d. Firewall Insights overview

Firewall Insights helps you understand and optimize your firewall rules. It provides insights, recommendations, and metrics about how your firewall rules are being used. Firewall Insights also uses machine learning to predict future firewall rules usage.

Firewall Insights lets you make better decisions during firewall rule optimization. For example, Firewall Insights identifies rules that it classifies as overly permissive. You can use this information to make your firewall configuration stricter.

e. Network Analyzer overview

Network Analyzer automatically monitors your VPC network configurations and detects misconfigurations and suboptimal configurations. It provides insights on network topology, firewall rules, routes, configuration dependencies, and connectivity to services and applications. It identifies network failures, provides root cause information, and suggests possible resolutions.

Network Analyzer runs continuously and triggers relevant analyses based on near real-time configuration updates in your network. If a network failure is detected, it tries to correlate the failure with recent configuration changes to identify root causes. Wherever possible, it provides recommendations to suggest details on how to fix the issues.

2. Product suggestion recommender

The product suggestion recommender helps you to optimize your Cloud usage by providing you with product suggestions. This can help you improve performance and security, and manage your resources better. Based on best practices, it analyzes your current product usage within each project and determines any additional products that might optimize your usage. If the recommender identifies an opportunity to leverage a product within a project, a recommendation is generated for that project in the Recommendation Hub. All users who have the appropriate permissions can view the recommendations when logged in to the hub. Every product suggestion includes information about the recommendation, details on the product, and links to help you get started with the product.

Here are one example for a product suggestion recommendation:

  • A cloud logging suggestion for your Google Kubernetes Engine resources.

3. Policy Simulator

Policy Simulator reports the impact of a proposed change to an allow policy as a list of access changes. Each access change represents an access attempt from the last 90 days that would have a different outcome under the proposed allow policy than under the current allow policy.

Policy Simulator also lists any errors that occurred during the simulation, which helps you identify potential gaps in the simulation.

Sustainability

Offer you the insights and simple-to-use tools to allow you assess, manage, and reduce the carbon footprint of your workloads running on Google Cloud.

Some of the sample solutions for SUSTAINABILITY pillar are:

1. Unattended project recommender

The unattended project recommender analyzes usage activity on projects in your organization and provides recommendations that help you discover, reclaim or remove unattended projects.

In fast-moving organizations, it’s not uncommon for cloud resources, including entire projects, to occasionally be forgotten about. Such unattended resources can be difficult to identify and tend to result in unnecessary waste and security risks.

Unattended project recommender analyzes usage activity across all projects in your organization and provides you with the following features to help you discover, reclaim, and shut down unattended projects:

  • Usage insights for every project (networking, API, project owner, service activity, and more).

  • Recommendations to turn down projects having low usage activity.

  • Recommendations to assign a new owner to projects that have high usage activity but no active owner.

Shutting down or reclaiming unattended projects can provide the following impact and benefits to your organization:

  • Reduction in security risks (SECURITY)

  • Reduction in unnecessary spending (COST)

  • Reduction in carbon footprint associated with your workloads (SUSTAINABILITY)

Conclusion

Day 2 operations aren’t just an important part of a software’s life cycle. They’re the bulk of your software’s life cycle, keeping the workloads running day in and day out.

Maintaining, monitoring and upgrading software? Those are all Day 2 operations. Things that may not be immediately obvious when you test drive software, but are critical for production workloads.

In this post, we’ve outlined just a few examples of Day 2 operations and some hurdles you might run into while operating software day to day.

Additionally, we’ve looked at some ways in which Google Cloud supports Day 2 Operations leveraging Active Assist.