Total Cost of Ownership: The Theory
Do you know the real cost of your features?
We usually focus on initial development costs and, at best, future maintenance costs. But the reality is much more complex.
Total Cost of Ownership (TCO) model was created to address this complexity.
TCO is the overall cost of a product or service throughout its entire life cycle:
- Direct costs: easily measurable and directly tied to software development, deployment, and maintenance.
- Indirect costs: harder to quantify, but they significantly impact business operations and profitability over time.
Direct costs:
πΈ Initial Development: costs of planning, designing, coding, testing, and deploying the software.
πΈ Infrastructure & Hardware: the prices of on-premise or public cloud infrastructure, equipment, storages, network, deployment models, scalability patterns, workload predictability.
πΈ Licensing & 3rd party services: costs of using 3rd party services (proprietary or open-source), subscriptions, collaboration platforms like Confluence, Jira, Zoom, etc.
πΈ Compliance & Security: expenses for regulatory compliance, certifications, data protection, cybersecurity measures.
πΈ Maintenance & Support: regular updates, performance optimizations, security patching, incident management, bugfixing, customer support and helpdesk services.
πΈ Operations: everything is required to make the system up and running: personnel, data storage and backups, observability, reliability, usability.
Indirect costs:
πΈ Trainings & Documentation: user training, documentation preparation and keeping it up to date.
πΈ Scaling: costs to scale infrastructure and implement performance optimizations when business grows.
πΈ Future Enhancements: costs of adding new features and expand existing ones, technical debt management.
πΈ Downtime: the cost of a system outage or data loss. Itβs important to calculate all potential damages: lost productivity, users, money.
πΈ Upgrades: regular production upgrades and codebase maintenance, additional costs in case of technology license change or end of technology support.
The main goal of the model is to provide a comprehensive view of all costs associated with a software product. This value then can be used in budgeting, ROI and business value calculation, making strategic decisions.
#engineering #costoptimization
Do you know the real cost of your features?
We usually focus on initial development costs and, at best, future maintenance costs. But the reality is much more complex.
Total Cost of Ownership (TCO) model was created to address this complexity.
TCO is the overall cost of a product or service throughout its entire life cycle:
- Direct costs: easily measurable and directly tied to software development, deployment, and maintenance.
- Indirect costs: harder to quantify, but they significantly impact business operations and profitability over time.
Direct costs:
πΈ Initial Development: costs of planning, designing, coding, testing, and deploying the software.
πΈ Infrastructure & Hardware: the prices of on-premise or public cloud infrastructure, equipment, storages, network, deployment models, scalability patterns, workload predictability.
πΈ Licensing & 3rd party services: costs of using 3rd party services (proprietary or open-source), subscriptions, collaboration platforms like Confluence, Jira, Zoom, etc.
πΈ Compliance & Security: expenses for regulatory compliance, certifications, data protection, cybersecurity measures.
πΈ Maintenance & Support: regular updates, performance optimizations, security patching, incident management, bugfixing, customer support and helpdesk services.
πΈ Operations: everything is required to make the system up and running: personnel, data storage and backups, observability, reliability, usability.
Indirect costs:
πΈ Trainings & Documentation: user training, documentation preparation and keeping it up to date.
πΈ Scaling: costs to scale infrastructure and implement performance optimizations when business grows.
πΈ Future Enhancements: costs of adding new features and expand existing ones, technical debt management.
πΈ Downtime: the cost of a system outage or data loss. Itβs important to calculate all potential damages: lost productivity, users, money.
πΈ Upgrades: regular production upgrades and codebase maintenance, additional costs in case of technology license change or end of technology support.
The main goal of the model is to provide a comprehensive view of all costs associated with a software product. This value then can be used in budgeting, ROI and business value calculation, making strategic decisions.
#engineering #costoptimization
β2β€2
TCO: How to Sell Your Tech Debt
Right now we know what the Total Cost of Ownership is, let's try to understand if it can be useful for us in practice.
Let's take an example with technical debt and imagine we need to approve budget for it.
Inputs:
βοΈ The system consists of 15 services. The system contains 1 core microservice with basic functionality shared across other services.
βοΈ At least 6 microservices has direct dependencies on the logic of this core service.
βοΈ The core service has huge technical debt that slows down the overall feature development. There are at least 5 features per quarter that impacts the core service.
βοΈ 1 day of 1 engineer work = 5000$ /20 working days per month = 250$
βοΈ The team estimated that they need around 100 working days to perform the refactoring: 100Γ250 ~25 K $
Cost of ownership:
Initial development:
πΈ Average cost per features 10 days = 10Γ250= 2 500$
πΈ Speed of development is 2x without refactoring, so average cost per feature became 5 000$
πΈ Yearly extra cost on development = 2500 $Γ 5 features /quarter = 50 K $
Maintenance & Support
πΈ Average time to resolve bug 1.5 days
πΈ Bugfixing time also increases by 2x, so average cost per bug became 250$Γ3=750 $
πΈ Yearly extra cost on support = 375 $ Γ 10 bug/month = 45 K $
Downtime
πΈ Tech debt in core service increases deployment failure rate, incident duration, rollback frequency (if there are any real statistic from your project, it also can be used for calculation)
πΈ1 critical incident/year ~30 K$
Upgrades:
πΈ Upgrades are more risky and requires additional 1 hour for maintenance window
πΈ It means that max availability is 99.5% (that may not satisfy overall system requirements)
Summary:
πΈ Refactoring overhead costs company around 95K yearly, brings additional risks of incidents, reduce overall availability.
πΈ Cost of refactoring 25 K$
πΈ Time to pay back ~ 3 month
πΈ Refactoring is more beneficial then just continue doing features without it.
As you can see the overall strategy is simple: minimize TCO and maximize business benefits.
Maybe the approach looks a bit complex but the common problem between technical experts and management is different language.
Business talks in a language of money. Business is not interested in engineering best practices, clean architecture or the size of technical debt.
But we can use tools like TCO to unify our language, make communications much more productive and, of course, get the required budget π.
#engineering #leadership #costoptimization
Right now we know what the Total Cost of Ownership is, let's try to understand if it can be useful for us in practice.
Let's take an example with technical debt and imagine we need to approve budget for it.
Inputs:
βοΈ The system consists of 15 services. The system contains 1 core microservice with basic functionality shared across other services.
βοΈ At least 6 microservices has direct dependencies on the logic of this core service.
βοΈ The core service has huge technical debt that slows down the overall feature development. There are at least 5 features per quarter that impacts the core service.
βοΈ 1 day of 1 engineer work = 5000$ /20 working days per month = 250$
βοΈ The team estimated that they need around 100 working days to perform the refactoring: 100Γ250 ~25 K $
Cost of ownership:
Initial development:
πΈ Average cost per features 10 days = 10Γ250= 2 500$
πΈ Speed of development is 2x without refactoring, so average cost per feature became 5 000$
πΈ Yearly extra cost on development = 2500 $Γ 5 features /quarter = 50 K $
Maintenance & Support
πΈ Average time to resolve bug 1.5 days
πΈ Bugfixing time also increases by 2x, so average cost per bug became 250$Γ3=750 $
πΈ Yearly extra cost on support = 375 $ Γ 10 bug/month = 45 K $
Downtime
πΈ Tech debt in core service increases deployment failure rate, incident duration, rollback frequency (if there are any real statistic from your project, it also can be used for calculation)
πΈ1 critical incident/year ~30 K$
Upgrades:
πΈ Upgrades are more risky and requires additional 1 hour for maintenance window
πΈ It means that max availability is 99.5% (that may not satisfy overall system requirements)
Summary:
πΈ Refactoring overhead costs company around 95K yearly, brings additional risks of incidents, reduce overall availability.
πΈ Cost of refactoring 25 K$
πΈ Time to pay back ~ 3 month
πΈ Refactoring is more beneficial then just continue doing features without it.
As you can see the overall strategy is simple: minimize TCO and maximize business benefits.
Maybe the approach looks a bit complex but the common problem between technical experts and management is different language.
Business talks in a language of money. Business is not interested in engineering best practices, clean architecture or the size of technical debt.
But we can use tools like TCO to unify our language, make communications much more productive and, of course, get the required budget π.
#engineering #leadership #costoptimization
β€2β1π1
The Evolution of SRE at Google
Google not only pioneered SRE practices, they constantly improve SRE approach to keep Google's large-scale production up and running. SLOs\SLIs, error budgets, isolation strategies, postmortems are well-known tools for reliability.
But Google's team went beyond that and adopted systems theory and control theory. The main idea is to focus on the complex system as a whole rather than on individual elements and their failures.
A new approach is based on System-Theoretic Accident Model and Processes (STAMP) framework. In complex systems, most accidents are the result of interactions between components that are all functioning well, but collectively produce an unsafe state. STAMP shifts analysis from a traditional linear chain of failures (root cause) to a "control problem" (I think STAMP itself deserves a separate overview).
From very basic perspective it analyzes the system in terms of control-feedback loops to find issues both in the control path and the feedback path. The feedback path is usually less well understood, but just as important as control path from a system safety perspective.
For example, imagine the service "resizer" that changes resource quotas according to the resource usage. The logic responsible for calculating resource usage and delivering these values to the resizer is probably even more critical than the logic that changes the quota itself.
Google SRE team performs such analysis on a regular basis for their global services. Even the first results showed a set of scenarios that can potentially produce hazards in the future. But as they are "predicted" before the real incidents, improvements can be planned without rush as any other feature.
Overall, I think the new approach is a really great example of system theory in practice. Moreover, Iβve discovered thereβs a whole area of reliability theory behind this that I plan to explore in more detail in the future.
#engineering #reliability
Google not only pioneered SRE practices, they constantly improve SRE approach to keep Google's large-scale production up and running. SLOs\SLIs, error budgets, isolation strategies, postmortems are well-known tools for reliability.
But Google's team went beyond that and adopted systems theory and control theory. The main idea is to focus on the complex system as a whole rather than on individual elements and their failures.
A new approach is based on System-Theoretic Accident Model and Processes (STAMP) framework. In complex systems, most accidents are the result of interactions between components that are all functioning well, but collectively produce an unsafe state. STAMP shifts analysis from a traditional linear chain of failures (root cause) to a "control problem" (I think STAMP itself deserves a separate overview).
From very basic perspective it analyzes the system in terms of control-feedback loops to find issues both in the control path and the feedback path. The feedback path is usually less well understood, but just as important as control path from a system safety perspective.
For example, imagine the service "resizer" that changes resource quotas according to the resource usage. The logic responsible for calculating resource usage and delivering these values to the resizer is probably even more critical than the logic that changes the quota itself.
Google SRE team performs such analysis on a regular basis for their global services. Even the first results showed a set of scenarios that can potentially produce hazards in the future. But as they are "predicted" before the real incidents, improvements can be planned without rush as any other feature.
Overall, I think the new approach is a really great example of system theory in practice. Moreover, Iβve discovered thereβs a whole area of reliability theory behind this that I plan to explore in more detail in the future.
#engineering #reliability
USENIX
The Evolution of SRE at Google
π₯5π1
STAMP Framework
As I wrote previously I think that the STAMP framework deserves its own overview.
The framework was introduced by MIT professor Nancy Levenson in the book "Engineering a Safer World" in 2011.
STAMP (The System Theoretic Accident Model and Processes) is a functional model of controllers which interact with each other through control actions and feedback:
- A system is considered as a control system.
- The control system consists of hierarchical control-feedback loops.
- Control system enforces safety constraints and prevents accidents.
STAMP is based on 2 main methodologies: Systems-Theoretic Process Analysis (STPA) and Causal Analysis Using System Theory (CAST).
Systems-Theoretic Process Analysis (STPA) - a hazard analysis technique performed at the design stage of the development (proactive analysis):
πΈ Define the purpose. For example, meet RPO\RTO requirements, prevent data loss, meet GDPR requirements, etc.
πΈ Model control structure. Describe and document interactions between key components for this type of hazard.
πΈ Identify unsafe control actions. For example, deploying a new version before testing, failing to scale up during high load, routing traffic to unhealthy backend, etc.
πΈ Identify loss scenarios. Find scenarios that could lead to unsafe actions. For example, failed update causes service outage, broken autoscaling leads to dropped users, etc.
πΈ Define safety constraints. Create controls and design changes to prevent unsafe control actions. For example, any deployment must have rollback strategy, service load must be monitored and alarms must be sent if resources usage exceeds 80%, etc.
Causal Analysis Using System Theory (CAST) - an accident investigation method performed after the incident is occurred (reactive analysis):
πΈ Collect information about the incident.
πΈ Model control structure.
πΈ Analyze each component in loss. Define the reason why the component didn't prevent the incident.
πΈ Identify control structure flaws: communication and coordination, safety management, culture, environment, etc.
πΈ Create improvement program. Prepare recommendations for changes to prevent similar loss in the future.
To sum up, the STAMP framework suggests to enforce safety constraints instead of just trying to prevent system failures. What I really like about this approach is that it allows to incorporate reliability into the system design itself.
P.S. If the topic sounds interesting for you, "Engineering a Safer World" book is available for free at MIT Press.
#engineering #reliability #systemdesign
As I wrote previously I think that the STAMP framework deserves its own overview.
The framework was introduced by MIT professor Nancy Levenson in the book "Engineering a Safer World" in 2011.
STAMP (The System Theoretic Accident Model and Processes) is a functional model of controllers which interact with each other through control actions and feedback:
- A system is considered as a control system.
- The control system consists of hierarchical control-feedback loops.
- Control system enforces safety constraints and prevents accidents.
STAMP is based on 2 main methodologies: Systems-Theoretic Process Analysis (STPA) and Causal Analysis Using System Theory (CAST).
Systems-Theoretic Process Analysis (STPA) - a hazard analysis technique performed at the design stage of the development (proactive analysis):
πΈ Define the purpose. For example, meet RPO\RTO requirements, prevent data loss, meet GDPR requirements, etc.
πΈ Model control structure. Describe and document interactions between key components for this type of hazard.
πΈ Identify unsafe control actions. For example, deploying a new version before testing, failing to scale up during high load, routing traffic to unhealthy backend, etc.
πΈ Identify loss scenarios. Find scenarios that could lead to unsafe actions. For example, failed update causes service outage, broken autoscaling leads to dropped users, etc.
πΈ Define safety constraints. Create controls and design changes to prevent unsafe control actions. For example, any deployment must have rollback strategy, service load must be monitored and alarms must be sent if resources usage exceeds 80%, etc.
Causal Analysis Using System Theory (CAST) - an accident investigation method performed after the incident is occurred (reactive analysis):
πΈ Collect information about the incident.
πΈ Model control structure.
πΈ Analyze each component in loss. Define the reason why the component didn't prevent the incident.
πΈ Identify control structure flaws: communication and coordination, safety management, culture, environment, etc.
πΈ Create improvement program. Prepare recommendations for changes to prevent similar loss in the future.
To sum up, the STAMP framework suggests to enforce safety constraints instead of just trying to prevent system failures. What I really like about this approach is that it allows to incorporate reliability into the system design itself.
P.S. If the topic sounds interesting for you, "Engineering a Safer World" book is available for free at MIT Press.
#engineering #reliability #systemdesign
π₯2β€βπ₯1
Three Layers of Architectural Debt
Technical debt is a very popular topic in software development. Architectural debt is less popular, but I see more and more interest in it for the last year.
Today I want to share the article Architectural debt is not just technical debt by Eoin Woods. The author highlights the importance of identifying and managing architectural debt as it can impact the whole organization.
The author defines architectural debt as "structural decisions that come back to bite you six months later". It's not very scientific π, but provides a good feeling of what it is.
According to the article this debt can be split on 3 layers according to its impact: application layer, business layer and strategic layer (please, refer to the attached picture with the model in the next post).
Application Layer
It's the level of particular service, its integrations and technologies. Problems are easy to detect, issues there directly impact delivery time and day-to-day operations.
Business Layer
It's a level of organizational structure and team topologies, defined ownership and stewardship. Poorly designed structures produce heavy communication flows (Conway law, remember?) that can impact overall system architecture, produce duplication in functionality, and conflicts of interests between the teams. Issues here will multiply issues on the operational side.
Strategy Layer
Debt at this level may impact the whole organization. A single strategic misstep creates a cascade of misalignment that amplifies at each level:
The responsibility of an architect there is to raise a red flag, describe the debt with AS-IS and a TO-BE states and explain the risks to he business of not handling it.
One more important idea from the article is that modern architecture cannot be a responsibility of one person or a small group of people. To be successful, architecture should be a shared activity of understanding and learning, guided by common principles. And at this point it's more about company culture then technical knowledge:
#architecture
Technical debt is a very popular topic in software development. Architectural debt is less popular, but I see more and more interest in it for the last year.
Today I want to share the article Architectural debt is not just technical debt by Eoin Woods. The author highlights the importance of identifying and managing architectural debt as it can impact the whole organization.
The author defines architectural debt as "structural decisions that come back to bite you six months later". It's not very scientific π, but provides a good feeling of what it is.
According to the article this debt can be split on 3 layers according to its impact: application layer, business layer and strategic layer (please, refer to the attached picture with the model in the next post).
Application Layer
It's the level of particular service, its integrations and technologies. Problems are easy to detect, issues there directly impact delivery time and day-to-day operations.
Business Layer
It's a level of organizational structure and team topologies, defined ownership and stewardship. Poorly designed structures produce heavy communication flows (Conway law, remember?) that can impact overall system architecture, produce duplication in functionality, and conflicts of interests between the teams. Issues here will multiply issues on the operational side.
Strategy Layer
Debt at this level may impact the whole organization. A single strategic misstep creates a cascade of misalignment that amplifies at each level:
Strategy debt (wrong capability decisions) -> Wrong Business Assumptions -> Faulty Requirements -> Technical Issues -> Operational Chaos
The responsibility of an architect there is to raise a red flag, describe the debt with AS-IS and a TO-BE states and explain the risks to he business of not handling it.
One more important idea from the article is that modern architecture cannot be a responsibility of one person or a small group of people. To be successful, architecture should be a shared activity of understanding and learning, guided by common principles. And at this point it's more about company culture then technical knowledge:
Trust and curiosity allow principles to live and decisions to evolve. This turns architecture from a static artefact into an ongoing activity.
#architecture
Frederickvanbrabant
Architectural debt is not just technical debt
Explore the delirious rantings of Frederick Vanbrabant. A blog focused on the intersection of Enterprise Architecture, product, and business strategy.
π₯3
ML Nested Learning Approach
Recently Google published quite interesting research "Introducing Nested Learning: A new ML paradigm for continual learning" .
So let's check what it is about and why it can be interesting for ML society.
Modern LLMs tend to loose efficiency in old tasks while learning new tasks. This effect is called catastrophic forgetting (CF). Researchers and data scientists spend significant amount of time to invent some architectural tweaks or better optimizations to deal with it.
Authors define the root cause of this issue as a separation of model's architecture and optimization algorithms for two different things.
The Nested Learning treats ML model as a system of interconnected, multi-level learning problems that are optimized simultaneously. And in that approach the model's architecture and the rules used to train it are fundamentally the same concepts.
The overall idea is heavily based on associative memory concept: the ability to map and recall one thing based on another (like recalling a name when you see someone's face):
πΈ The training process is modeled as an associative memory. The model learns to map a given data point to the value of its local error, which serves as a measure of how "surprising" or unexpected that data point was.
πΈ Key architectural components (like transformers) are also formalized as simple associative memory modules that learn the mapping between tokens in a sequence.
Proof-of-concept tests shows superior performance in language modeling and demonstrates better long-context memory management than in existing models (you can check the measurements in the article or in the full text of the research there ).
ML keeps evolving. Whatβs interesting is that the best architecture ideas come from systems theory and attempts to copy human brain behavior.
#ai #architecture
Recently Google published quite interesting research "Introducing Nested Learning: A new ML paradigm for continual learning" .
So let's check what it is about and why it can be interesting for ML society.
Modern LLMs tend to loose efficiency in old tasks while learning new tasks. This effect is called catastrophic forgetting (CF). Researchers and data scientists spend significant amount of time to invent some architectural tweaks or better optimizations to deal with it.
Authors define the root cause of this issue as a separation of model's architecture and optimization algorithms for two different things.
The Nested Learning treats ML model as a system of interconnected, multi-level learning problems that are optimized simultaneously. And in that approach the model's architecture and the rules used to train it are fundamentally the same concepts.
The overall idea is heavily based on associative memory concept: the ability to map and recall one thing based on another (like recalling a name when you see someone's face):
πΈ The training process is modeled as an associative memory. The model learns to map a given data point to the value of its local error, which serves as a measure of how "surprising" or unexpected that data point was.
πΈ Key architectural components (like transformers) are also formalized as simple associative memory modules that learn the mapping between tokens in a sequence.
Proof-of-concept tests shows superior performance in language modeling and demonstrates better long-context memory management than in existing models (you can check the measurements in the article or in the full text of the research there ).
ML keeps evolving. Whatβs interesting is that the best architecture ideas come from systems theory and attempts to copy human brain behavior.
#ai #architecture
research.google
Introducing Nested Learning: A new ML paradigm for continual learning
π₯3
What Changes as You Grow
When I was first an engineer then a team lead I used to think that my managers and senior architects always know what to do and how to do. I can ask them for directions, bring the problems and request for help (I'm lucky to have really great managers during my career). And, of course, it is because they are super smart, experienced and kind of "grown-up".
What I realized that on each level there are people just like us with their own problems, fears and they also may not know what to do. Moreover they can also make mistakes.
The key difference is the ability to continue working in situations with a high level of uncertainty: good leaders stay focused, accept risks, take the responsibility, choose a way to go forward and engage the right people.
And here we can really help our leaders with the right details and expertise, new ideas, suggestions. Believe me, it would be really appreciated. I say this as someone whoβs now expected to always know what to do for my team π.
#selfreflection #offtop
When I was first an engineer then a team lead I used to think that my managers and senior architects always know what to do and how to do. I can ask them for directions, bring the problems and request for help (I'm lucky to have really great managers during my career). And, of course, it is because they are super smart, experienced and kind of "grown-up".
What I realized that on each level there are people just like us with their own problems, fears and they also may not know what to do. Moreover they can also make mistakes.
The key difference is the ability to continue working in situations with a high level of uncertainty: good leaders stay focused, accept risks, take the responsibility, choose a way to go forward and engage the right people.
And here we can really help our leaders with the right details and expertise, new ideas, suggestions. Believe me, it would be really appreciated. I say this as someone whoβs now expected to always know what to do for my team π.
#selfreflection #offtop
π4β€1
Architectural Debt
Let's continue the topic with architectural debt. The previous post was focused on the impact and importance of the debt itself, but it has no information about what to do with it.
To fix that I suggest to read Technical Debt vs. Architecture Debt: Donβt Confuse Them. The article maybe not fully answer this question but provide actionable recommendations of how to measure the architectural debt and what strategies can be applied to decrease it.
Architectural debt indicators:
πΈ Duplicated functionality: Count how many systems perform overlapping functions.
πΈ Integration complexity: Measure the number of point-to-point connections vs API gateways, enterprise service buses (ESBs) or event-driven models.
πΈ Principle violations: Track how many systems lack defined owners, documented interfaces or compliance with internal architectural standards.
πΈ Latency chains: Calculate end-to-end data flow time between multiple hops.
πΈ Configuration management completeness: Measure the percentage of applications with filled ownership, life cycle and dependency fields.
What can be done:
πΈ Officially define architecture debt. The first step to fixing a problem is admitting it. :)
πΈ Build metrics and dashboards.
πΈ Practice architecture observability. Track system dependencies, integration bottlenecks and principal compliance in near-real time.
πΈ Run architecture reviews.
πΈ Manage debt as a portfolio. Not all debt needs immediate repayment. Like managing a project portfolio, organizations should prioritize the debt by business impact.
πΈ Link debt to business KPIs.
As you can see there is no rocket science: standard
I think the article is a good point to start analyze whether you have architectural debt in your organization and prepare first steps to work with it.
#architecture
Let's continue the topic with architectural debt. The previous post was focused on the impact and importance of the debt itself, but it has no information about what to do with it.
To fix that I suggest to read Technical Debt vs. Architecture Debt: Donβt Confuse Them. The article maybe not fully answer this question but provide actionable recommendations of how to measure the architectural debt and what strategies can be applied to decrease it.
Architectural debt indicators:
πΈ Duplicated functionality: Count how many systems perform overlapping functions.
πΈ Integration complexity: Measure the number of point-to-point connections vs API gateways, enterprise service buses (ESBs) or event-driven models.
πΈ Principle violations: Track how many systems lack defined owners, documented interfaces or compliance with internal architectural standards.
πΈ Latency chains: Calculate end-to-end data flow time between multiple hops.
πΈ Configuration management completeness: Measure the percentage of applications with filled ownership, life cycle and dependency fields.
What can be done:
πΈ Officially define architecture debt. The first step to fixing a problem is admitting it. :)
πΈ Build metrics and dashboards.
πΈ Practice architecture observability. Track system dependencies, integration bottlenecks and principal compliance in near-real time.
πΈ Run architecture reviews.
πΈ Manage debt as a portfolio. Not all debt needs immediate repayment. Like managing a project portfolio, organizations should prioritize the debt by business impact.
πΈ Link debt to business KPIs.
As you can see there is no rocket science: standard
make the problem evident -> measure -> improve cycle. I think the article is a good point to start analyze whether you have architectural debt in your organization and prepare first steps to work with it.
#architecture
π3π1
AI Impact on Developer's Productivity
Over the past year, almost everyone has predicted that AI will replace developers. New tools appear every week, and the hype continues to grow. But where are these tools in real life? Do they really help developers to be more productive?
Anthropic recently published a research how AI tools transformed development in their own company. Anthropic is well-known by its Claude Code agent and it's quite interesting how AI tools impact development in AI company (they definitely should know how todo it in an effective way, right?).
Key points:
πΈ AI is mostly used for fixing bugs and code understanding.
πΈ Engineers reported 50% productivity boost (subjective opinion).
πΈ 27% of assisted work consists of tasks that wouldn't have been done otherwise: minor issues, small refactoring, nice-to-have tools, additional tests, documentation.
πΈ Only 20% of real work can be delegated to the AI assistant. Moreover, engineers tend to delegate tasks that can be easily verified and reviewed.
πΈ Everyone is becoming more βfull-stack": for example, backend developer can do simple frontend tasks.
πΈ Claude Code became the first point to ask questions, decreasing mentorship and collaboration experience.
πΈ Many engineers shows deep uncertainty about their future and how development will look like in several years.
So according to the survey AI assistants can significantly help with routine tasks, can act as a knowledgebase about the code. But there is still not enough trust to delegate them complex tasks or architectural decisions.
#ai #engineering
Over the past year, almost everyone has predicted that AI will replace developers. New tools appear every week, and the hype continues to grow. But where are these tools in real life? Do they really help developers to be more productive?
Anthropic recently published a research how AI tools transformed development in their own company. Anthropic is well-known by its Claude Code agent and it's quite interesting how AI tools impact development in AI company (they definitely should know how todo it in an effective way, right?).
Key points:
πΈ AI is mostly used for fixing bugs and code understanding.
πΈ Engineers reported 50% productivity boost (subjective opinion).
πΈ 27% of assisted work consists of tasks that wouldn't have been done otherwise: minor issues, small refactoring, nice-to-have tools, additional tests, documentation.
πΈ Only 20% of real work can be delegated to the AI assistant. Moreover, engineers tend to delegate tasks that can be easily verified and reviewed.
πΈ Everyone is becoming more βfull-stack": for example, backend developer can do simple frontend tasks.
πΈ Claude Code became the first point to ask questions, decreasing mentorship and collaboration experience.
πΈ Many engineers shows deep uncertainty about their future and how development will look like in several years.
So according to the survey AI assistants can significantly help with routine tasks, can act as a knowledgebase about the code. But there is still not enough trust to delegate them complex tasks or architectural decisions.
#ai #engineering
Anthropic
How AI is transforming work at Anthropic
How AI Is Transforming Work at Anthropic
π₯2β€1
Personal Goals & Well-Being
Do you already have some plans to start doing something from January? π
December is traditionally a time to sum up the year and start planning next achievements.
So today I want to share Gallup's key elements of well-being that can help to define the areas for personal global goals. Gallup institute made a huge research to define aspects of human life that we can do something about to make our life better:
πΈ Career: You like what you do every day.
πΈ Social: You have meaningful friendships in your life.
πΈ Financial wellbeing: You manage your money well.
πΈ Physical: You have energy to get things done.
πΈCommunity: You like where you live.
For each area you can define several goals for the year. To make them real decompose goals to particular steps (plan) and activities to start with (better to immediately add them to the calendar).
Many years I was focused on career and finance only to have more expertise, more experience, interesting tasks to solve, get money to feel safe. As a result this year I have different health issues.
So I made my lessons learnt and for the next year I prepare a separate plan for other areas of well-being especially the physical part.
Take care and be balanced!
#softskills #productivity
Do you already have some plans to start doing something from January? π
December is traditionally a time to sum up the year and start planning next achievements.
So today I want to share Gallup's key elements of well-being that can help to define the areas for personal global goals. Gallup institute made a huge research to define aspects of human life that we can do something about to make our life better:
πΈ Career: You like what you do every day.
πΈ Social: You have meaningful friendships in your life.
πΈ Financial wellbeing: You manage your money well.
πΈ Physical: You have energy to get things done.
πΈCommunity: You like where you live.
For each area you can define several goals for the year. To make them real decompose goals to particular steps (plan) and activities to start with (better to immediately add them to the calendar).
Many years I was focused on career and finance only to have more expertise, more experience, interesting tasks to solve, get money to feel safe. As a result this year I have different health issues.
So I made my lessons learnt and for the next year I prepare a separate plan for other areas of well-being especially the physical part.
Take care and be balanced!
#softskills #productivity
Gallup.com
The Five Essential Elements of Well-Being
In partnership with leading economists, psychologists, and other acclaimed scientists, Gallup has been exploring the common aspects of well-being that transcend countries and cultures. Researchers uncovered five essential elements that differentiate a thrivingβ¦
β€2π2
Dear Friends, Happy New Year! πβ¨
I wish you motivation that doesnβt burn out, career growth in the direction you want, and progress you can actually be proud of.
Interesting challenges, reasonable deadlines, clean architecture and teams you enjoy working with.
I hope you stay healthy, have enough energy, and keep your closest people nearby.
Take care, rest well, and have a great 2026.π π π
Warm wishes,
Nelia
I wish you motivation that doesnβt burn out, career growth in the direction you want, and progress you can actually be proud of.
Interesting challenges, reasonable deadlines, clean architecture and teams you enjoy working with.
I hope you stay healthy, have enough energy, and keep your closest people nearby.
Take care, rest well, and have a great 2026.
Warm wishes,
Nelia
Please open Telegram to view this post
VIEW IN TELEGRAM
π16π3
Stanford Engineering: Transformers & LLMs
The New Year holidays are over, and itβs the perfect time to start learning something new π€.
Stanford Engineering fully opened a course CME 295: Transformers & LLMs with explanation of LLMs core components, their limitations and how to use them effectively in real-world applications.
Course instructors are engineers with work experience in Uber, Google and Netflix, so they they really know what they are talking about.
Topic covered in the course:
- Transformers architecture
- Decoding strategies & MoEs
- LLMs finetuning & optimizations
- Results evaluation & Reasoning
- RAG & Agentic workflows
To really understand a topic, I need to know how everything is organized under the hood and what the core architectural principles are. That's why I really like such courses as they provide a structured and systematic view of the topic with all the necessary theory.
#ai #engineering
The New Year holidays are over, and itβs the perfect time to start learning something new π€.
Stanford Engineering fully opened a course CME 295: Transformers & LLMs with explanation of LLMs core components, their limitations and how to use them effectively in real-world applications.
Course instructors are engineers with work experience in Uber, Google and Netflix, so they they really know what they are talking about.
Topic covered in the course:
- Transformers architecture
- Decoding strategies & MoEs
- LLMs finetuning & optimizations
- Results evaluation & Reasoning
- RAG & Agentic workflows
To really understand a topic, I need to know how everything is organized under the hood and what the core architectural principles are. That's why I really like such courses as they provide a structured and systematic view of the topic with all the necessary theory.
#ai #engineering
YouTube
Stanford CME295: Transformers and Large Language Models I Autumn 2025
This course explores the world of Transformers and Large Language Models (LLMs). You will learn the evolution of NLP methods, the core components of the Tran...
π₯2
A2UI Protocol
Google has introduced a new protocol for AI - A2UI. It allows agents to generate rich user interfaces that they can be displayed in different host applications. Now Lit, Angular, and Flutter renderers are supported, others are in the roadmap.
The main idea is that LLMs can generate a UI from a catalog of predefined widgets and send them as a message to the client.
The workflow looks as follows:
πΈ User sends a message to an AI agent
πΈ Agent generates A2UI messages describing the UI (structure + data in JSON lines format)
πΈ Messages stream to the client application
πΈ Client renders it using native components (Angular, Flutter, React, etc.)
πΈ User interacts with the UI, sending actions back to the agent
πΈ Agent responds with updated A2UI messages
According to the article, the main benefits of the protocol are:
πΈ Security: No LLM-generated code, there is only a declaration passed to the client.
πΈ LLM-friendly: Flat structure, easy for LLMs to generate incrementally
πΈ Framework-agnostic: Separation of the UI structure from the UI implementation.
Right now the project is in early public preview. It looks promising, especially once REST protocol will be supported (now only A2A and AG-UI are supported).
#ai #news
Google has introduced a new protocol for AI - A2UI. It allows agents to generate rich user interfaces that they can be displayed in different host applications. Now Lit, Angular, and Flutter renderers are supported, others are in the roadmap.
The main idea is that LLMs can generate a UI from a catalog of predefined widgets and send them as a message to the client.
The workflow looks as follows:
πΈ User sends a message to an AI agent
πΈ Agent generates A2UI messages describing the UI (structure + data in JSON lines format)
{"surfaceUpdate":
{"surfaceId": "booking",
"components": [
{"id": "root", "component": {"Column": {"children": {"explicitList": ["header", "guests-field"]}}}},
{"id": "header", "component": {"Text": {"text": {"literalString": "Confirm Reservation"}, "usageHint": "h1"}}},
{"id": "guests-field", "component": {"TextField": {"label": {"literalString": "Guests"}, "text": {"path": "/reservation/guests"}}}}
]}}πΈ Messages stream to the client application
πΈ Client renders it using native components (Angular, Flutter, React, etc.)
πΈ User interacts with the UI, sending actions back to the agent
πΈ Agent responds with updated A2UI messages
According to the article, the main benefits of the protocol are:
πΈ Security: No LLM-generated code, there is only a declaration passed to the client.
πΈ LLM-friendly: Flat structure, easy for LLMs to generate incrementally
πΈ Framework-agnostic: Separation of the UI structure from the UI implementation.
Right now the project is in early public preview. It looks promising, especially once REST protocol will be supported (now only A2A and AG-UI are supported).
#ai #news
Googleblog
Google for Developers Blog - News about Web, Mobile, AI and Cloud
A2UI is an open-source project for agent-driven, cross-platform generative UI. It uses a secure, declarative format for agents to safely render UIs.
π5π4π₯1
Make Your Docs Ready for AI
Several years ago, we wrote documents for humans. Now we write documents for AI π.
And, to be honest, machines require much more structured text than humans do. Poorly structured content leads to incorrect chunking and low quality answers during RAG context evaluation.
Common recommendations for modern docs:
πΈ Use structured format: md, html, asciidoc.
πΈ Provide metadata: important dates, tags, context and document goal.
πΈ Define glossary: describe key terms and abbreviations.
πΈ Organize content hierarchy: create clear structure using descriptive headings and subheadings.
πΈ Use lists: prefer bulleted or numbered lists instead of comma-separated enumerations.
πΈ Include text description for visual information: duplicate important details from diagrams, charts, and screenshots via text.
πΈ Use simple layout: do not use complex visuals and tables, prefer simple headings, lists and paragraphs.
πΈ Keep related information together: design content in a way that each section contains sufficient context to be understood independently.
πΈ Describe external references: when referencing external concepts, provide brief context and explanation.
As you can see, the recommendations are quite simple. And moreover, it feels like I became a machine a long time ago: I hate long paragraphs without clear structure and prefer well-structured documents for years.
#ai #documentation
Several years ago, we wrote documents for humans. Now we write documents for AI π.
And, to be honest, machines require much more structured text than humans do. Poorly structured content leads to incorrect chunking and low quality answers during RAG context evaluation.
Common recommendations for modern docs:
πΈ Use structured format: md, html, asciidoc.
πΈ Provide metadata: important dates, tags, context and document goal.
πΈ Define glossary: describe key terms and abbreviations.
πΈ Organize content hierarchy: create clear structure using descriptive headings and subheadings.
πΈ Use lists: prefer bulleted or numbered lists instead of comma-separated enumerations.
πΈ Include text description for visual information: duplicate important details from diagrams, charts, and screenshots via text.
πΈ Use simple layout: do not use complex visuals and tables, prefer simple headings, lists and paragraphs.
πΈ Keep related information together: design content in a way that each section contains sufficient context to be understood independently.
πΈ Describe external references: when referencing external concepts, provide brief context and explanation.
As you can see, the recommendations are quite simple. And moreover, it feels like I became a machine a long time ago: I hate long paragraphs without clear structure and prefer well-structured documents for years.
#ai #documentation
β1π₯1
AI-Ready Repos: AGENTS.md
Structuring project documentation helps build a good knowledge base, but it's not enough to work effectively with the codebase. In practice, agents also need extra instruction files: AGENTS.md and SKILLS.md. Let's start with AGENTS.md: what it is for and how to cook it properly.
AGENTS.md is a markdown file that provides context, instructions, and guidelines for AI coding agents working with the repo.
Its content is added to the initial prompt (system context) when LLM session is created. What matters that this is a standard that widely adopted by different agents such as Cursor, Codex, Claude Code and many others.
Common structure:
πΈ Project overview: project description, tech stack with particular versions, key folders and dependencies.
πΈ Commands: list of build and test commands with required flags and options.
πΈ Code Style: describe preferred code style.
πΈ Testing: commands to run different types of tests and linters.
πΈ Boundaries: do's and don'ts (e.g., never touch secrets, env configs).
πΈ Extra: PR guidelines, git workflow details, deployment instructions, etc.
Common recommendations:
πΈ Keep it short (~150 lines)
πΈ Continuously update it with code changes
πΈ Be specific, prefer samples over description
πΈ Improve it iteratively by adding what really works and removing what doesn't
πΈ Use nested AGENTS.md files in large codebases. The agent reads the closest file to the work it is doing.
Sample:
Samples from opensource projects:
- RabbitMQ Cluster Operator
- Kubebuilder
- Airflow
- Headlamp
Additionally I recommend reading How to write a great agents.md: Lessons from over 2,500 repositories from Github blog. I didn't get how they measured the effectiveness of analyzed instructions, but anyway the overall recommendations can be helpful.
#ai #agents #engineering #documentation
Structuring project documentation helps build a good knowledge base, but it's not enough to work effectively with the codebase. In practice, agents also need extra instruction files: AGENTS.md and SKILLS.md. Let's start with AGENTS.md: what it is for and how to cook it properly.
AGENTS.md is a markdown file that provides context, instructions, and guidelines for AI coding agents working with the repo.
Its content is added to the initial prompt (system context) when LLM session is created. What matters that this is a standard that widely adopted by different agents such as Cursor, Codex, Claude Code and many others.
Common structure:
πΈ Project overview: project description, tech stack with particular versions, key folders and dependencies.
πΈ Commands: list of build and test commands with required flags and options.
πΈ Code Style: describe preferred code style.
πΈ Testing: commands to run different types of tests and linters.
πΈ Boundaries: do's and don'ts (e.g., never touch secrets, env configs).
πΈ Extra: PR guidelines, git workflow details, deployment instructions, etc.
Common recommendations:
πΈ Keep it short (~150 lines)
πΈ Continuously update it with code changes
πΈ Be specific, prefer samples over description
πΈ Improve it iteratively by adding what really works and removing what doesn't
πΈ Use nested AGENTS.md files in large codebases. The agent reads the closest file to the work it is doing.
Sample:
# Tech Stack
- Language: Go 1.24+
- API: gRPC
- Database: PostgreSQL 18
- Message Queue: RabbitMQ 4.2, Apache Kafka 4.1.x
- Observability: OpenTelemetry, Jaeger, Prometheus, Grafana
- Security: JWT, OAuth2, TLS
- Deployment: Docker, Kubernetes, Helm
# Build & Test Commands
- Build: `go build -o myapp`
- Test `go test`
# Boundaries
- Never touch `/charts/secrets/` files.
- Avoid adding unnecessary dependencies.
# PR Submission
## Title Format (MANDATORY)
Issue No: User-facing description
Samples from opensource projects:
- RabbitMQ Cluster Operator
- Kubebuilder
- Airflow
- Headlamp
Additionally I recommend reading How to write a great agents.md: Lessons from over 2,500 repositories from Github blog. I didn't get how they measured the effectiveness of analyzed instructions, but anyway the overall recommendations can be helpful.
#ai #agents #engineering #documentation
agents.md
AGENTS.md is a simple, open format for guiding coding agents. Think of it as a README for agents.
π₯2π1
Fighting the Calendar Chaos
The more responsibilities you have, the more meetings you get. Unfortunately, it's a typical story.
At some point of time there are so many meetings in the calendar, it's not clear how to manage the day.
To bring it back to a manageable state, I use the following tips:
πΈBook slots in the calendar for my working tasks
πΈDecline meetings that I think I don't need to attend (no clear agenda, already delegated, etc.)
πΈ Use color coded categories to classify the meetings. I mostly sort them out according to the importance and urgency, for example:
- Black: urgent and/or important, I must attend.
- Red: not urgent but important, need to attend.
- Orange: not so important, make sense to attend if I have time (e.g., some regular meetings or where I believe my team can make a decision without me).
- Gray: not needed, for my info that meeting is scheduled.
You can use whatever categories are convenient for you. But I don't recommend having more than 5 categories, itβs hard to keep them all in your head. Additionally, meetings categorization helps to review the week if there is enough time for important tasks. If not, it's a sign to reflect and change something.
#softskills #tips #productivity
The more responsibilities you have, the more meetings you get. Unfortunately, it's a typical story.
At some point of time there are so many meetings in the calendar, it's not clear how to manage the day.
To bring it back to a manageable state, I use the following tips:
πΈBook slots in the calendar for my working tasks
πΈDecline meetings that I think I don't need to attend (no clear agenda, already delegated, etc.)
πΈ Use color coded categories to classify the meetings. I mostly sort them out according to the importance and urgency, for example:
- Black: urgent and/or important, I must attend.
- Red: not urgent but important, need to attend.
- Orange: not so important, make sense to attend if I have time (e.g., some regular meetings or where I believe my team can make a decision without me).
- Gray: not needed, for my info that meeting is scheduled.
You can use whatever categories are convenient for you. But I don't recommend having more than 5 categories, itβs hard to keep them all in your head. Additionally, meetings categorization helps to review the week if there is enough time for important tasks. If not, it's a sign to reflect and change something.
#softskills #tips #productivity
π₯4β2
AI-Ready Repos: Skills
Previously we checked how to cook AGENTS.md, today we'll check another important part of agent configuration - skills.
Agent skill is a detailed workflow description or checklist for performing a specific task. Technically a skill is a folder containing a `SKILL.md` file, scripts, and additional resources. Most imprtantly, it's a standard already supported by different AI agents.
Skills structure in a repo:
SKILL.md is a standard md file used to define and document agent capabilities. It consists of:
- Metadata (~100 tokens):
- Instructions (< 5000 tokens recommended): The main part of the file that is loaded when the skill is activated.
The overall flow is as follows::
1. Discovery: At startup the agent loads the name and description of each available skill.
2. Activation: When a task matches a skillβs description, the agent reads the full `SKILL.md` instructions into context.
3. Execution: The agent follows the instructions, loading referenced files or executing bundled code if needed.
The main idea is to load instructions lazily to prevent prompt sprawl and context rot, when LLMs lose focus on specific tasks and start making mistakes.
What I like about skills is that it's an artifact with clear specification and development lifecycle. It allows not only collecting knowledge about tools and procedures, but also makes AI outputs more testable and predictable.
#ai #agents #engineering #documentation
Previously we checked how to cook AGENTS.md, today we'll check another important part of agent configuration - skills.
Agent skill is a detailed workflow description or checklist for performing a specific task. Technically a skill is a folder containing a `SKILL.md` file, scripts, and additional resources. Most imprtantly, it's a standard already supported by different AI agents.
Skills structure in a repo:
skills/
βββmy-skill/
βββ SKILL.md # Required: instructions + metadata
βββ scripts/ # Optional: executable code
βββ references/ # Optional: documentation
βββ assets/ # Optional: templates, resources
SKILL.md is a standard md file used to define and document agent capabilities. It consists of:
- Metadata (~100 tokens):
name(skill name) and description (when to use).- Instructions (< 5000 tokens recommended): The main part of the file that is loaded when the skill is activated.
The overall flow is as follows::
1. Discovery: At startup the agent loads the name and description of each available skill.
2. Activation: When a task matches a skillβs description, the agent reads the full `SKILL.md` instructions into context.
3. Execution: The agent follows the instructions, loading referenced files or executing bundled code if needed.
The main idea is to load instructions lazily to prevent prompt sprawl and context rot, when LLMs lose focus on specific tasks and start making mistakes.
What I like about skills is that it's an artifact with clear specification and development lifecycle. It allows not only collecting knowledge about tools and procedures, but also makes AI outputs more testable and predictable.
#ai #agents #engineering #documentation
β€3π1
The Words You Use
Have you ever been in the meetings where people say the same thing in different words and still donβt understand each other? Or maybe you were in another situation: you explain something, but the audience doesn't get it, then someone else repeats the same idea and surprisingly it clicks.
Strangely enough, it all comes down to the words you use.
Words reflect the picture of the world in someoneβs head.
To make communication effective, pay attention to what a person says and how they say it, what definitions they use and what terms they like. If this is someone you work with regularly, you can do a deeper analysis and understand what is important for that person and why.
Then start using the same terms, metaphors, and definitions. If you fit into a personβs picture of the world, then it become much easier to share your ideas to that person. You start talking on the same language using the same vocabulary and stop wasting time on pointless arguing .
#tips #communications
Have you ever been in the meetings where people say the same thing in different words and still donβt understand each other? Or maybe you were in another situation: you explain something, but the audience doesn't get it, then someone else repeats the same idea and surprisingly it clicks.
Strangely enough, it all comes down to the words you use.
Words reflect the picture of the world in someoneβs head.
To make communication effective, pay attention to what a person says and how they say it, what definitions they use and what terms they like. If this is someone you work with regularly, you can do a deeper analysis and understand what is important for that person and why.
Then start using the same terms, metaphors, and definitions. If you fit into a personβs picture of the world, then it become much easier to share your ideas to that person. You start talking on the same language using the same vocabulary and stop wasting time on pointless arguing .
#tips #communications
π₯3β2π1
Can you see the forest for the trees?
How often have you seen developers stuck in code review discussing some method optimization or "code excellence"? Spending hours or even days trying to make code perfect? Did that really help build a well-architected solution?
Developers often get stuck in small details and completely lose sight of the bigger picture. That's a very common mistake that I see in engineering teams. As a result, there are perfect classes or functions and complete mess in overall structure.
A proper review should always start with a birdβs-eye view:
πΈ Component structure: Are changes implemented in the components\services that are actually responsible for this logic?
πΈ Module structure: Are changes in the modules you expect them to be (public vs private, pkg vs internal, etc.)?
πΈ Public contracts: Review how your APIs will be used by other parties. Are they clear, convenient, easy to use, and easy to extend?
πΈ Naming: Are module, class and function names clear and easy to understand? Don't they duplicate existing entities?
πΈ Data model: Is the domain modeled correctly? Does the model follow single responsibility principle?
πΈ Testing: Are main cases covered? What about negative scenarios? Do we have proper failure handling approach?
In most cases, thereβs no point in reviewing code details until the items above are finalized. The code will likely be rewritten, maybe even more than once.
Thatβs why specific lines of code should be the last thing to check.
Details are cheap to fix.
Structure and contracts are not.
#engineering #codereview
How often have you seen developers stuck in code review discussing some method optimization or "code excellence"? Spending hours or even days trying to make code perfect? Did that really help build a well-architected solution?
Developers often get stuck in small details and completely lose sight of the bigger picture. That's a very common mistake that I see in engineering teams. As a result, there are perfect classes or functions and complete mess in overall structure.
A proper review should always start with a birdβs-eye view:
πΈ Component structure: Are changes implemented in the components\services that are actually responsible for this logic?
πΈ Module structure: Are changes in the modules you expect them to be (public vs private, pkg vs internal, etc.)?
πΈ Public contracts: Review how your APIs will be used by other parties. Are they clear, convenient, easy to use, and easy to extend?
πΈ Naming: Are module, class and function names clear and easy to understand? Don't they duplicate existing entities?
πΈ Data model: Is the domain modeled correctly? Does the model follow single responsibility principle?
πΈ Testing: Are main cases covered? What about negative scenarios? Do we have proper failure handling approach?
In most cases, thereβs no point in reviewing code details until the items above are finalized. The code will likely be rewritten, maybe even more than once.
Thatβs why specific lines of code should be the last thing to check.
Details are cheap to fix.
Structure and contracts are not.
#engineering #codereview
π₯3π1