Jiachen Yu

Thoughts on the Video Quality Evaluation

Sun, 08 Feb 2026 00:00:00 GMT

Last year, I led the team behind OpusClip’s LLM-as-a-Judge system. Since we already published a post about video quality evaluation, I can share a short recap here. I am currently working on a separate evaluation track for AgentOpus, so I will close with a related question that I find personally interesting.

What We Did Last Year

You can check the blog for full details; below is a short summary.

Our goal was to build a video quality judge system that can score video quality across different rubrics.

Building the First Rubric

The first step was data collection. We collected 300 samples from both internal and external sources. With a target agreement rate of 80% and a 95% confidence level (±5% margin of error), the minimum required sample size is about 246. ^[1]

At the same time, we defined rubrics for the judge. We set up a number of rubrics under the hook, content, visual, and audio categories. The first rubric we annotated manually was hook engagement. I asked everyone on the team, as well as external experts, to annotate videos. It was important that human annotators achieve an 80% agreement rate first.

The annotation result for each video is simple: Does the video meet the rubric? The result is 0 (does not meet), 1 (partially meets), or 2 (meets). As the annotation progressed, we needed to rebalance the dataset to ensure the number of 0/1/2 samples remained roughly equal.

Once we had the annotation results, we tested different prompts on Gemini 2.5 Pro (and later Gemini 3 Pro). The prompt that aligned most with human annotations was selected as the “judge” for the current rubric.

Scaling to More Rubrics

Once we knew how to build the judge for one rubric, scaling to others was straightforward. We sped up the annotation process via LLM pre-annotation, reducing the number of annotators needed for high-agreement samples. We also built an internal agent to iterate on prompts across different rubrics automatically.

In the end, we had an LLM-as-a-Judge system that gave quality scores for videos. A video’s quality score equals the sum of its per-rubric results (each 0, 1, or 2), yielding a score range of 0 to 2N for N rubrics.

We also cross-validated the judge by testing it on new samples and calculating the correlation between export rate and judge score. The results show that a higher judge score indicates a higher export rate.

Figure 1. Judge score vs. export rate on a holdout set. Each point represents a score bucket. The trend is monotonic: higher judge scores are associated with higher export rates.

Results

We used this system to curate clipping strategies produced by another system, and it improved our online export rate. The system also worked effectively on our B2B customer clips, increasing business metrics for other teams.

These results gave me confidence in rubric-based quality judging. At the same time, they raised a question I find personally interesting: can an evaluation infer a plausible execution path from the final result? I am still exploring this direction, but it seems promising for making agent evaluation more interpretable.

What Am I Curious About This Year?

The core question is: can an evaluation do more than assign a score — can it also explain why the agent failed?

This matters for two reasons:

If the agent is too weak (in other words, useless), the evaluation becomes meaningless because we may not have an effective way to use it to improve the agent itself.
On the other hand, if we can infer an execution path from a result, then we can transfer that knowledge into the agent. As a result, the agent will have the same knowledge as the evaluator.

OpenAI has a very interesting approach: they ask Codex CLI to evaluate its own performance. What I learned from this is that we should try to put the evaluator and the agent at the same level, so that when the evaluator improves the agent, the agent can also improve the evaluator.

With that, we can build a data flywheel — a self-improving bootstrap for the agent. Then we can iterate on its performance quickly by feeding more data and more cases into the system.

I am still early in exploring this direction, but the underlying question feels worth asking: if your evaluator could hand the agent a concrete diagnosis instead of just a score, how would that change the way you build and iterate on agent systems?

n = z² · p · (1 − p) / e², with z = 1.96, p = 0.8, e = 0.05. This assumes approximately independent samples and uses raw agreement as the primary metric. ↩︎

How Engineers and PMs Ship LLM Features Together

Mon, 15 Sep 2025 00:00:00 GMT

Originally published on Medium via OpusClip Engineering; reposted here on my blog.
Original: Bridging the Gap: How Engineers and PMs Ship Winning LLM Features Together

How Engineers and PMs Ship Winning LLM Features Faster: 3 Technical Decisions

TL;DR:

Prompts belong in configs, not code: Enable rapid iteration without deployments
Variables go last: Save 90% on costs through KV-cache optimization
Separate semantics from schema: PMs own meaning, engineers own structure

The best LLM features aren’t built in silos. When engineers and PMs at OpusClip started collaborating on prompt architecture, iteration cycles dropped from days to minutes, API costs fell by 10x, and prompts in production are more reliable. Here are the three technical decisions that made the biggest difference.

PM to Prompt Distance

What it is: The number of hops — and the amount of interpretation — between your product requirement and the exact text/settings the model actually receives.

Tao Zhang first defines this metric in Manus.

Here’s what typically happens:

PM writes: “The assistant should be professional but approachable”
Spec translates: “Use formal language with occasional casual phrases”
Engineer implements: “You are a professional assistant. Maintain formal tone while being friendly.”
Runtime adds context: “You are a professional assistant. Maintain formal tone while being friendly. Current user: {user_name}. Previous context: {history}”

Each step adds interpretation and delay.

Why it matters:

Speed of iteration. Fewer hops = faster experiments.
Quality & intent fidelity. Each handoff (PM → spec → UI copy → template → runtime prompt) adds interpretation risk.
Observability. When prompts are hidden in code, it’s hard to debug.

How to reduce distance: Ask your engineer teammates to put prompts on dynamic configs, or a prompt management platform. Don’t put prompts on code.

KV‑Cache

What it is: KV-Cache (Key-Value Cache) is like a smart notebook that lets LLMs remember their previous calculations. Without it, every time your chatbot generates a new word, it would need to re-read the entire conversation from scratch.

Imagine you’re having a conversation with a chatbot. Each time you send a new message, the chatbot needs to understand the entire conversation history to provide a relevant response. Without a KV-Cache, the model would have to re-read and re-process the whole conversation from the beginning every single time it generates a new word. This is incredibly inefficient and slow, leading to a frustratingly laggy user experience, especially with longer conversations or longer prompts.

KV-Cache has two major benefits:

Faster Response Times (Same response qualities but faster speeds): By avoiding redundant computations, the KV-Cache allows the LLM to generate responses much more quickly. For users, this means less waiting and a more fluid, natural interaction with your AI feature.
Reduced Computational Costs (Same money but more users): Re-processing less data means using less computational power. This directly translates to lower operational costs for running your LLM, making your product more scalable.

Why PMs should care: modern LLM providers now charge 10x less for cached tokens than new tokens.

For example, GPT-5 charges $1.25 for 1M input tokens, but it only charges $0.125 for 1M cached tokens.

How to leverage KV-cache: Always put variables at the end of your prompts.

❌ Bad prompt structure (minimal caching):

User: {{user_name}}  
Question: {{user_question}}  
Conversation history: {{chat_history}}  
  
You are a customer support agent for TechCorp.  
Guidelines:  
- Be empathetic and professional  
- Check our knowledge base before answering  
- Escalate billing issues to human agents  
- Always verify account details first

✅ Good prompt structure (maximum caching):

You are a customer support agent for TechCorp.  
Guidelines:  
- Be empathetic and professional    
- Check our knowledge base before answering  
- Escalate billing issues to human agents  
- Always verify account details first  
  
User: {{user_name}}  
Question: {{user_question}}  
Conversation history: {{chat_history}}

The static instructions get cached across all requests, while only the dynamic user content changes. For a support bot handling 10,000 daily conversations, this restructuring alone could save $200–300 or more per day.

Structured Output / Response Schema

What it is: Instead of returning free‑form text, the model returns well‑formed JSON. You define a schema — fields, types, enums — and the model adheres to it.

This is counterintuitive for many PMs: you don’t need to describe the format in your prompt at all.

Structured output is an API-level feature, not a prompt trick. You turn it on in code by registering a schema/response format; then the runtime enforces it.
Once it’s configured, don’t re-specify the format in the prompt. Tell the model what to fill, not how to format it.
Consumer chatbot UIs (e.g., ChatGPT-style apps) generally don’t expose this. They’re optimized for human-readable text, not machine-parsable payloads.

We have to show some code here to explain it:

import OpenAI from "openai"  
import { z } from "zod"  
import { zodTextFormat } from "openai/helpers/zod"  
  
// 1) Your schema stays in code (Zod)  
const RelevancyItem = z.object({  
  clipId: z.string(),  
  relevant: z.boolean(),  
  relevantReason: z.string(),  
  advertisement: z.boolean(),  
})  
const RelevancyArray = z.array(RelevancyItem)  
type Relevancy = z.infer  
  
// 2) Build the semantics-only prompt (no format instructions)  
const prompt = promptTemplate.join("\n").replace(INPUT_REPLACE, input)  
  
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })  
  
// 3) Call Responses API and let the helper parse & validate for you  
const response = await openai.responses.parse({  
  model: "gpt-4o-mini",  
  input: [  
    { role: "system", content: "Judge search-trend relevancy using the definitions provided." },  
    { role: "user", content: prompt },  
  ],  
  text: {  
    // zodTextFormat drives structured output and runtime validation  
    format: zodTextFormat(RelevancyArray, "relevancy_results"),  
  },  
})  
  
// 4) Already parsed & validated:  
const results: Relevancy = response.output_parsed  
  
// 5) Same post-filtering  
const relatedResults = results.filter(  
  (r) => r.relevant && r.evergreen && !r.advertisement  
)

⚠️ Conflict Notice (VERY IMPORTANT)

Once a schema/responseFormat is set in code, do not describe output formatting in the prompt. Mixing prompt-format rules with the engineer-defined schema creates two sources of truth and measurably hurts reliability:

Validation failures & retries → higher latency/cost; occasional data loss if coerced.
Instruction dilution → worse task quality (model juggles format vs. content).
Downstream breakage → typed logic fails on “pretty” but invalid payloads.

Please remember: Schema owns format; prompt owns semantics. Keep prompts about what each field should contain (definitions, decision rules), not how to format.

❌ Your prompt should NOT say:

Analyze these search results and return a JSON object with:  
- videoId: the video identifier  
- isRelevant: boolean indicating if it matches  
- relevanceReason: explanation string  
- isPaid: boolean for sponsored content  
Format as valid JSON with these exact field names.

✅ Your prompt SHOULD say:

Analyze these search results for relevance to the user's query.  
  
For relevance assessment:  
- Consider semantic match, not just keyword overlap  
- Educational content is preferred over entertainment  
- Recent content (last 6 months) is more relevant  
  
For paid content detection:  
- Look for "Sponsored", "Ad", or "#ad" markers  
- Check if the channel name includes "Official" or "Brand"  
  
Provide clear reasoning for why content is or isn't relevant.

Notice: all semantics, zero formatting. The schema handles structure; your prompt handles meaning.

One worth-to-mention common pitfall that PMs to avoid: changing field names in prompts without coordinating with engineering. If your engineer’s schema says videoId but your prompt mentions video_id or clip_id, you’re creating confusion that degrades performance.

Putting It All Together

Low pm to prompt distance lets you iterate quickly on prompt improvements
Optimized KV-cache makes those iterations cheaper to test at scale
Proper structured output ensures reliable, parseable responses regardless of prompt changes

Next Steps

Audit your current setup: Where do your prompts live? Are variables at the end? Are you mixing format and content instructions?
Start one conversation: Pick the highest-impact improvement and discuss with your engineering team. Most engineers appreciate PMs who understand these constraints.
Measure the impact: Track iteration speed, cost per request, and error rates before and after changes.

Questions? Drop them in the comments. Our team loves talking about efficient engineering & pm collaborations.

Join Our Team

If these practical takeaways resonate with you and you’re passionate about solving complex technical challenges at scale, we’d love to hear from you.

Check out our open positions: opus.pro/careers

How to 1-on-1

Sat, 15 Feb 2025 00:00:00 GMT

Why I need to do this?

It is your responsibility to drive your own career growth No one knows what you want unless you speak it with others, so our company forces your manager to help you with that by weekly 1-on-1s.

A more practical reason is that your relationship with your manager largely determines your work experience.

Mindsets

1. Your manager is one of your resources

I’m one of my manager’s resources	My manager is one of my resources
Don’t care about the company’s operations.	Discuss your dissatisfaction (if you have) with the company’s performance in this 2-month cycle and inquired about how cofounders or teams plan to improve the current situation.
Execute your tasks assigned by your manager without asking.	Asked for an explanation of the task or project from your manager.If you think a project doesn’t have a good impact, just tell your manager why and ask him to stop this project.
Don’t talk about promotion or performance. Waiting for the manager to invite you for promotion.	Inquired about the standard for promotion and how to achieve it.Asked which preparations are needed for the next calibration.
The manager is right.	The manager may be wrong.

2. Talk about things you don’t know about each other

Talk about things you and your manager already know	Talk about things my manager doesn’t know or I need to know
Report work again after you just report them on daily standup.	Talk about the next thing or project you are planning to do and ask your manager to add it to the next 2-month ORKs. If you find there are too many tasks to deliver on time, please make sure your manager knows that you will miss the timeline.
Saying everything is fine, even if you receive some bad feedback about your team.	Tell your manager valuable feedback about your team if you received them from others.
Talked about impacts of a project that you just wrote in a PSA.	Review the project you just launched, ask for feedback from your manager.
Don’t talk too much.	Talk as much as you can.

Other tips

Whenever you have some thoughts that you want to talk about in the next 1-on-1 meeting, send it to the meeting group and pin it.

Cancel a 1-on-1 meeting if you feel it’s unnecessary.

Don’t overuse 1-on-1. If a topic is public or needs to be public, talk about it in a public group.

Don’t “manage up”

Another typical loophole of “manage up” is that it often incurs unnecessary one-on-one meetings. You may not agree with me on this but that’s alright. So what do I mean by “unnecessary”? Put it this way — if someone can explicate an issue with a text message of around 100 characters, they wouldn’t need to set up a one-on-one with me. In fact, I would rather they post it in the group chat so that I wouldn’t have to pass on the information. But the reality is that I’m quite often dragged into one-on-one meetings on such issues. Why do people prefer one-on-one dialogues in person? I’ve found two main reasons. Firstly, a one-on-one dialogue creates information asymmetry that helps to “manage up” as it prevents potential critics from a third person. But if you post the same issue in a group chat, it’s very easy to see different opinions. Actually I don’t think most issues require one-on-one communication so long as it’s not classified. Secondly, one-on-one dialogues in person or over phone calls help to navigate negotiation strategies constantly. For example, if the person observes that you are pissed off by what he/she said earlier, they tend to pull back or tone down a bit. There are many other cases of “manage up”. I used to hear this — a PR employee probed into the WeChat and Toutiao channels that his/her boss subscribe to and posted PR contents on those channels for the boss to notice. I’m not sure how you feel about this. But I would feel that CEOs and leaders are quite troubled if they were to soak in an “manage up” environment where information is “retouched”. Suppose the information that directs to you is specially angled, or what we call “SEO-ed (search engine-optimized) information” at ByteDance, you will have to verify the information via other channels. That would be very counter-efficient.

From: <ByteDance’s Zhang Yiming: “Bring Outside in” and Avoid “Managing up”— How to Protect the Company from Diseconomies of Scale>

Simple vs Easy

Sun, 19 Jan 2025 00:00:00 GMT

Translated by Claude from the Chinese original.

An example of “simple” is a manual juicer—its principle is straightforward and easy to understand. An example of “easy” is an electric juicer—it’s very easy to use, but its internal mechanism is complex.
“Simple” in this context means something you can depend on because it’s straightforward. “Easy” means something that appears effortless from the outside, but is actually full of complex structures and uncertainty inside.
Another example of “simple”: Duan Yongping figured out that Apple was a good stock, then held it for over a decade without selling. During this time, he didn’t need to do anything else—he could play golf every day and still get 20x returns. Because Apple has high profits, a moat, and a user base that keeps expanding, the stock price will definitely rise in the long run. “Simple” means once you think it through, you don’t need to worry about anything else. When he explained his holding rationale to Warren Buffett, it took just one or two sentences. After that, there was nothing more to discuss about Apple.
A corresponding example of “easy”: Yesterday, a certain cryptocurrency skyrocketed. If someone had gone all-in or even leveraged before the surge, they could have made thousands or even tens of thousands times their investment overnight. Sounds easy, but the complexity in execution is enormous—you could easily get liquidated. And the uncertainty of chasing trends is huge. Even if someone came to you right now and told you exactly what to do to make money, how many people would believe them?
As for myself: I prefer “simple” things. Simple things are simple because they follow principles and don’t change based on the number of participants or their attitudes. I don’t like “easy” things—they’re hard to replicate, and even if you can, they’re far less reliable than “simple” things. Complex systems change based on the number and attitudes of participants, consuming time and energy with no guaranteed returns.

Learnings from ByteDance

Thu, 02 Jan 2025 00:00:00 GMT

Translated by Claude from the Chinese original.

No boundaries: The market is a company’s boundary; demand is a product’s boundary; impact is work’s boundary.
Efficiency above all: The core of competition is ROI, not cost level.
Look forward: Competition only drives up costs. Revenue doesn’t come from competitors—it comes from users.
Escape gravity: Every company has a sphere of competence. If another company’s gravitational pull is stronger than yours, don’t operate within that sphere.
Supply and demand are the core of markets: The happiest situation in business is when one group needs your resources, while another group is willing to provide other high-value resources to exchange for capabilities from you. That’s why multi-sided platforms beat two-sided ones, and two-sided platforms beat tools. A tool is merely a localized means to an end.
Focus on facts: Fantasy doesn’t help reshape reality. Reality has evolved based on the laws of physics—it is necessarily rational.

New Things vs Old Things

Thu, 26 Dec 2024 00:00:00 GMT

Translated by Claude from the Chinese original.

It all started with a question I was asked today.

1

Today, a recruiter at my company came to ask me: “Do you think OpenAI is a phenomenal success or just a coincidence (with low technical barriers)?” A candidate had told her that OpenAI’s technical barriers aren’t that high—they were just the first to take the plunge. Looking at it now, OpenAI’s progress has indeed slowed down, and from what I know, other companies are catching up fast, even surpassing OpenAI in some models. But in my mind, OpenAI remains special. I believe OpenAI’s value lies not in its commercial success, but in its technology. Besides, while OpenAI may not be extremely successful commercially, it certainly can’t be called a failure.

Those who followed OpenAI early on would know that GPT and GPT-2 were seen as mere “toys” by outsiders, while Google’s BERT was considered the strongest language model at the time. When GPT-3 came out, people found it interesting but still thought it was useless. It wasn’t until GPT-3.5 that people saw the possibility of AGI. Before GPT-3.5, OpenAI had been persisting with the GPT architecture for over four years. Even with substantial funding from Musk, Altman, and others, it was incredibly difficult for a startup to keep investing in a direction that might not succeed.

In my view, OpenAI’s persistence is what truly differentiates it from other companies. Other companies build large language models because of commercial necessity; OpenAI builds them because it believes this path can create a future with AGI. This is also why what other companies do feels “meaningless”—after OpenAI released GPT, if a newcomer wants to contribute to humanity, they shouldn’t choose to work on large language models. Unless you have insights vastly different from the mainstream, this field has already been thoroughly explored by OpenAI. A newcomer should independently think about what other important fields are being overlooked by the market and remain unexplored. Those are where new value and technology will be created. In other words, as a newcomer, you should prioritize doing new things rather than rushing to do old things.

2

Some might disagree with the above—just do whatever makes money, right? But I believe that markets without competition are where the profits are. In the crowded LLM space, unless a clear winner emerges, everyone ends up working at cost.

Take China’s “AI Four Dragons” (SenseTime, Megvii, Yitu, CloudWalk) as an example. Their technology seemed difficult, but they were all highly homogeneous. Eventually, after using their solutions for a while, big companies turned to building in-house, leaving the Four Dragons to pursue ToB and ToG routes. Of course, computer vision also has the problem of limited commercial value, but that’s another story.

Another example is Xiaomi. Xiaomi always enters markets after demand has been validated, which is why they make “heartfelt” high-value products. There’s nothing wrong with this business model, and Xiaomi is a company with great social responsibility. But the reality is that Xiaomi’s hardware profit margins are extremely low, leaving little room for innovation.

So circling back—is building LLMs profitable? Personally, I think it’s a money-losing business. The only exception seems to be ByteDance’s Doubao, because ChatGPT can’t serve China, and Chinese users need a good LLM app.

3

Compared to other companies, what the big tech companies are doing with LLMs is actually the “old thing” that OpenAI already did. What OpenAI did with GPT years ago, with multimodal in the past two years, and recently with post-training—those are the “new things.” I personally prefer doing “new things,” for the following reasons.

First, doing new things is meaningful. OpenAI’s years of persistence on AGI led to today’s AI application explosion. New technology creates new demands and resources, while old things mostly just redistribute existing resources.

Second, maximizing returns. If a new thing succeeds, you become the industry leader with substantial commercial profits, and abundant capital allows you to do many things. In contrast, profits from old things are always limited because markets and user demands gradually solidify, and competitors want a piece of your pie. The success rate of new things is actually quite low—failure is the more common outcome. But failure also brings significant returns. In the 1990s, a company called General Magic tried to create a portable touchscreen internet device—something that looks very much like an iPhone today. As you might expect, 1990s technology couldn’t produce an iPhone. General Magic failed, but many of its employees went on to have successful careers. Tony Fadell left General Magic and eventually joined Apple, leading the design and development of the iPod. Kevin Lynch led the development of the Apple Watch. Andy Rubin wrote an operating system called Android that was acquired by Google.

Third, leaders have a huge advantage in copying others. While the “new things” leaders create will be copied by followers into “old things,” leaders can also copy the “micro-innovations” of followers. Examples abound—from Tencent to Apple to OpenAI. All followers of these companies are merely helping leaders validate demands they didn’t have time to validate. Once validated, leaders can copy at minimal cost, and even take over the followers’ original customers.

Fourth, being misunderstood is actually a huge advantage. Because short-term profit-seekers can’t see you, you can easily filter out employees who don’t share your vision. Because what you’re doing hasn’t become a trend, you can acquire resources from suppliers at very low cost. And because there’s no competition, you won’t have too much time pressure—the rhythm and feel of doing “new things” will be much better.

4

If we agree that doing “new things” is better, what practices can we adopt?

Career: Prefer companies that do “new things.” This way, you not only gain more personal growth but also have a small chance of getting significant financial returns.
Investing: Prefer companies that do “new things.” These companies have obvious characteristics—they focus on things others don’t want to do, so their stock prices are low. But once these things succeed, they’ll have great commercial value, and stock prices will rise significantly. However, judging things and companies isn’t just talk—I don’t recommend spending too much time or money on investing.
Daily work: Shift your evaluation criteria from “completeness” to “innovation.” I used to have a bad student mentality—I could get 80% of the results with 20% of the time, but I’d spend 100% of the time to get 100% of the results. The extra 20% is often trivial or unimportant stuff. Now, for me, quickly delivering results and validating the value of ideas is more important. If your 80% is important enough, many people will be willing to do the remaining 20%.

What you don’t know is always more valuable than what you already know.

Shape Your Mac

Sun, 14 Jul 2024 00:00:00 GMT

📖 Foreword

Have you ever furnished your own house? To be honest, I haven’t, since I always rent. However, I do arrange things in my home. Right now, I’m lying on my sofa and writing this doc. To make it more engaging, I decided to include a picture of my current setup.

I moved to Shanghai less than two months ago, but I’ve already made this place feel familiar. This has allowed me to live happily and somewhat “efficiently” — meaning the room’s arrangement perfectly suits my daily routine.

It’s the same with a computer, especially a work computer. We don’t own our work computers; they belong to the company. But since we spend most of our screen time on them, it’s important to make sure they make us feel happy and efficient.

🔧 Basic Settings

🗂️ Dotfiles

https://github.com/yujiachen-y/dotfiles

Dotfiles can be considered the metadata of your computer. Personally, I use a Brewfile to manage my Mac’s dependencies. As long as an app can be installed via Homebrew, you should add it to your Brewfile. In fact, only a few apps cannot be installed using Homebrew.

I also store my .zshrc and .vimrc files in Dotfiles, so that I don’t need to configure my zsh and vim repeatedly.

Additionally, it’s common practice to store Mac settings in dotfiles. However, you still need to go to System Settings to configure some options that can’t be changed via command lines. BTW, every time I set up a Mac, there are often UI changes and new options in System Preferences. So, the manual setup helps me discover what changes Apple has made lol.

💻 Command Line Config

I use Warp as my terminal emulator. However, I believe the differences between Terminal.app, iTerm2, and Warp are not significant. For me, Warp has 3 main advantages:

Suggestions: I find them better than zsh-autosuggestions.
AI Copilot: In my opinion, all AI copilot products can be replaced by directly asking ChatGPT. (Update: Warp has agent mode too.)
Built-in Shell Prompt: Although it’s not as flexible or functional as p10k, I’m considering reverting to my old .p10k.zsh setup.

Have you ever checked how many lines of code are in your .zshrc file? What do those lines do? I checked mine, and it’s only 15 lines. Typically, a .zshrc file is full of [oh-my-zsh](https://github.com/ohmyzsh/ohmyzsh) configurations, but I only keep a few essential plugins and delete the unnecessary ones. Here are some oh-my-zsh plugins worth mentioning:

dotenv: Automatically loads your .env files into environment variables.
git: I use commands like gaa (git add --all), gsmsg (git commit --message), and gcn! (git commit --verbose --no-edit --amend) every day.
- A git tip: please use gcn! to avoid unnecessary commit information in PR.
z: Access your most visited directories with very few keystrokes.

As mentioned earlier, I use Warp for its built-in shell prompt, which looks like this:

However, it’s not very flexible and cannot be used in other terminal emulators. If you use Warp’s shell prompt, you might encounter issues in VSCode or other text editors and IDEs. Embedding Warp into VSCode is particularly challenging.

I recommend using p10k as your shell prompt. It’s highly flexible and can be used across different terminal emulators since it’s not tied to any specific one.

I want to mention 2 of my favorite macOS commands here: pbpaste and pbcopy. There are already some useful examples on the manual pages.

⚙️ System Settings

As mentioned earlier, some settings below can be incorporated into your dotfiles.

Implementing these settings will enhance your productivity:

Prevent Automatic Sleeping

Three Finger Drag

How to select or drag using three fingers on your MacBook track pad

Reduce Keyboard Delay Time: Set your keyboard delay time to the shortest possible setting. As coders, we can’t afford to waste time waiting for keyboard delays.

Disable Double-Space Period: This feature can be annoying, so it’s best to turn it off.

Swap Caps Lock and Command Keys: The keys you use most frequently should be more accessible. Swap the Caps Lock key with the Command key.

Disable Press and Hold: This setting can only be disabled via the command line. If you use Vim, you might not need to disable this setting, as pressing and holding a key is not considered good practice in Vim.

defaults write -g ApplePressAndHoldEnabled -bool false

📱 Apps

As I mentioned earlier, all apps listed here should be managed by your Brewfile.

🌟 Raycast

Raycast is an everything store for Mac shortcuts. Here are some shortcuts I often use:

Clipboard History	Search Emoji	Window Management	Music Control	Kill Process	Search Browser Tabs	Reminder

Raycast is super useful with your custom hotkeys.

🌟 AltTab

On a Mac, you can use Command + Tab to switch between apps. However, the built-in app switcher doesn’t allow you to switch between windows within the same application, which can be inconvenient if you need to frequently switch between windows of a single app.

AltTab solves this problem by allowing you to switch between windows across different Mac desktops. This way, you don’t need to use a mouse to switch between windows and desktops.

💾 Backup

There are two types of people:

Those who do backups

Those who will do backups

Any data you own that you haven’t backed up is data that could be gone at any moment, forever. Here we will cover some good backup basics and the pitfalls of some approaches.

3-2-1 Rule

The 3-2-1 rule is a general recommended strategy for backing up your data. It state that you should have:

at least 3 copies of your data

2 copies in different mediums

1 of the copies being offsite

The main idea behind this recommendation is not to put all your eggs in one basket. Having 2 different devices/disks ensures that a single hardware failure doesn’t take away all your data. Similarly, if you store your only backup at home and the house burns down or gets robbed you lose everything, that’s what the offsite copy is there for. Onsite backups give you availability and speed, offsite give you the resiliency should a disaster happen.

From Backups

💭 Reflecting thoughts

Routinely review your workflow and try to improve your efficiency.
Fold knowledge into data, so program logic can be stupid and robust.
Programmer time is expensive; conserve it in preference to machine time.
We shape our tools and thereafter our tools shape us

The Permission Model Myth

Sun, 25 Feb 2024 00:00:00 GMT

Translated by Claude from the Chinese original.

Basic Concepts

Depending on context, “permission” can mean different things. This article focuses on authorization in computing. Related concepts include:

Authentication (AuthN): The process of confirming the identity of a person or entity. In computer security, this typically involves verifying whether a user or system is truly who it claims to be. Authentication can be performed through various methods, including passwords, biometrics, smart cards, or digital certificates. Authentication is the first step in the access control process—only after successful identity verification will the system consider granting access permissions. Related technical protocols and frameworks include OAuth, OpenID, and SAML.
Authorization (AuthZ): After a user’s identity has been verified, the system needs to decide which resources the user can access and what operations they can perform. Authorization is the process of defining and managing access permissions, determining what data or resources a verified user can view, use, modify, or delete. Its goal is to ensure that only users with appropriate permissions can access specific resources or data. Related authorization models include Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), and others.
Access Control: A broader concept that encompasses both authentication and authorization. It refers to the various methods and technologies used to restrict access to and use of systems and data.

Use Cases

Below are some access-control scenarios commonly encountered in engineering work.

File System Permissions

The drwx------ in the image is the mode string for a file-system entry. The first character indicates the file type (- for regular files, d for directories), followed by three groups of three characters representing read, write, and execute permissions for owner, group, and others. Some entries also include @ or +. @ means the file or directory has extended attributes (not discussed here). + means an Access Control List (ACL) is present. ACLs provide finer-grained rules beyond standard rwx bits, with per-user/per-group configuration. Using ls -le, we can inspect these ACL entries:

For more details, see man chmod. Two observations:

Unix permission granularity is too coarse, only supporting read, write, and execute—three types of operations. Other operating systems chose to add additional ACL mechanisms as a patch.
macOS ACL rule format is a triplet of: (user, allow/deny, operation). Screenshot from macOS 14.2 manual

Database Permissions

Database systems provide permissions (and permission combinations), allowing administrators to grant and audit access for users or user groups.

The screenshots below show one example of granting and inspecting permissions: Granting: Granting SELECT permission on the ‘lark.message’ table to the user ‘admin’. Viewing a user’s global permissions: The ‘admin’ user has no global permissions. Viewing a user’s permissions on a specific table: The ‘admin’ user has SELECT permission on the ‘lark.message’ table.

From the examples above, we also observe:

MySQL authorization operates at multiple levels—a user can have global permissions and table-level permissions.
MySQL’s permission control also follows the (user, allow/deny, operation) triplet pattern.

Access Control Models

Several commonly used access control models exist in the industry: Discretionary Access Control (DAC), Mandatory Access Control (MAC), Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), and Policy-Based Access Control (PBAC).

Note: While these access control models represent a general consensus, there is currently no unified industry standard for determining which access control model a real-world solution belongs to. Therefore, the descriptions below serve mainly as conceptual explanations rather than definitive definitions and distinctions.

DAC

DAC is the most basic access control model, allowing resource owners to control access to their resources. In the DAC model, users can assign access permissions to other users based on their own judgment.

The file system permissions and database permissions examples above are actually DAC systems—they directly specify “who can perform what operations,” such as the file system’s User 1 allow read and the database’s GRANT SELECT ON lark.message TO 'admin'@'%';.

RBAC

RBAC is an access control model based on user roles. It allows system administrators to assign users to different roles according to organizational functions and responsibilities, with each role having a set of predefined access permissions. RBAC simplifies permission management because administrators only need to manage the relationship between roles and permissions, rather than each user’s individual permissions.

The following article describes how to configure RBAC in SAP:

SAP Commissions - Implementing Authorization With User Roles (RBAC)

In many RBAC implementations, user groups (or roles) bind users to predefined permissions.

RBAC has the advantages of simplified management and flexible configuration, but in practice it also exhibits the following drawbacks:

Role explosion: In complex organizations, a large number of roles may need to be created to cover all access requirements, which can make role management complex and difficult to maintain.
Complex initial setup: Correctly implementing an RBAC system may require significant upfront planning and configuration, especially when migrating existing permissions to the RBAC model.
Limited flexibility: While RBAC improves permission management efficiency, in some cases it may limit flexibility for individual users’ specific needs.
Performance issues: In large systems with many users, roles, and permission rules, permission checks may cause performance problems.
Management challenges: As organizations grow, managing and updating roles and their permissions can become complex, especially without automated tooling support.

ABAC

ABAC is a more flexible and dynamic access control model that determines access permissions based on attributes of the requester (such as age, position, etc.), resource attributes, and environmental conditions (such as time). ABAC provides more fine-grained access control, supports more complex security policies, and is suitable for scenarios requiring highly customized access control.

The main advantage of ABAC over RBAC is flexibility at the individual record level. RBAC grants are group-based, so members in one group usually share permissions. In a healthcare system, if each nurse should only see records for assigned patients, pure RBAC may require one role/group per nurse. With ABAC, one rule like record.assignedNurse == currentUser can express this directly.

A common pattern is hybrid: use RBAC to assign rule sets to users, then use ABAC within those rules to evaluate user/resource/environment attributes for the final decision.

Real-World Case Study

https://www.youtube.com/watch?v=ZUmzELJ2UcM&list=PLnobS_RgN7JZxK1wjUvQ84jMFqRZoJXbD&index=1

The video demonstrates 4 types of access control:

IP and login time determine whether a user can access the organization.
User permission groups determine whether a user can access objects.
Roles and reporting lines determine whether a user can access specific records.
User permission groups determine whether a user can access specific fields on a record.

We can see that different access control models exist at different levels of the system, which is one of the key challenges in abstracting permission systems.

Industry Solutions

Many early solutions did not clearly separate authentication (AuthN) from authorization (AuthZ). In my observation, RBAC is often the first-class model in these frameworks: once identity is user-centric, downstream authorization also tends to be user-centric.

Spring Security: No control panel; provides various APIs but doesn’t support configuration files.
Apache Shiro: No control panel; supports configuration files but doesn’t support ABAC.

The above two solutions and other earlier open-source solutions have relatively simple support for access control models and are therefore outside the scope of this article. In recent years, two solutions have provided more comprehensive support for access control models:

Casbin: Open source, positioned as an SDK, supports various access control models, integrates with existing systems, and defines permissions through configuration.
Zanzibar: Closed source, positioned as an authorization service, also supports various access control models with finer control granularity, and permissions are configured through API calls.

Casbin

Here’s an example of creating RBAC with Casbin.

Model file:

[request_definition]
r = sub, act, obj

[policy_definition]
p = sub, act, obj

[role_definition]
g = _, _
g2 = _, _

[policy_effect]
e = some(where (p.eft == allow))

[matchers]
m = r.sub == p.sub && g(p.act, r.act) && g2(p.obj, r.obj)

Policy file:

p, alice, sub-reader, sub1
p, bob, rg-owner, rg2

// subscription role to subscription action mapping
g, sub-reader, sub-read
g, sub-owner, sub-read
g, sub-owner, sub-write

// resourceGroup role to resourceGroup action mapping
g, rg-reader, rg-read
g, rg-owner, rg-read
g, rg-owner, rg-write

// subscription role to resourceGroup role mapping
g, sub-reader, rg-reader
g, sub-owner, rg-owner

// subscription resource to resourceGroup resource mapping
g2, sub1, rg1
g2, sub2, rg2

Request: alice, rg-read, rg1 -> true.

Reasoning flow: first match the subject. Alice’s policy entry is alice, sub-reader, sub1. Then match the action. The requested action is rg-read, while Alice’s role is sub-reader. Through g, sub-reader, rg-reader and g, rg-reader, rg-read, we infer that sub-reader implies rg-read. Finally, match the resource. The requested object is rg1, and Alice’s mapped resource is sub1; g2, sub1, rg1 links the two.

Zanzibar

This article skips the detailed introduction of Zanzibar. The following URL provides a comprehensive introduction:

Zanzibar: A Global Authorization System - Presented by Auth0

One key difference is architecture: Zanzibar is a standalone service, while Casbin is an SDK. A standalone service needs its own relationship-data storage plus caching and consistency management. Zanzibar is generally more full-featured, but that also implies higher operational cost and potentially tougher integration than an embedded SDK.

The Challenge of Abstraction

Permission systems do not have a universal standard like TCP/IP, mainly because requirements and implementations vary widely across domains. TCP/IP provides a common interoperability layer for networks, while authorization systems are deeply coupled to business logic, organizational structures, and security constraints that differ by product. Key reasons include:

Diverse business requirements: Different types of applications have different business models and security needs. For example, a financial industry permission management system needs to consider complex compliance and audit requirements, while a social media platform’s permission system may focus more on user privacy and content management. This diversity makes it difficult to establish a unified standard covering all scenarios.
Organizational structure differences: Different organizations’ structures, policies, and management processes also affect permission management implementation. Large enterprises may need a complex role hierarchy and fine-grained permission control, while small teams may only need a simple permission model.
Rapid technological development: The rapid advancement of IT technology (including software development frameworks, database technology, cloud services, etc.) means new permission management methods and tools are constantly emerging, making it difficult to maintain a long-term effective unified standard.
Balancing security and flexibility: Permission systems need to find a balance between security and flexibility. Different application scenarios may prioritize these two aspects differently, leading to different permission management strategies.
Difficulty of standardization: Although some permission management concepts (such as RBAC) are widely accepted and used, extending these concepts to a universal standard covering all possible use cases is extremely difficult. Attempting to create such a standard could result in something overly complex and unwieldy, unable to adapt to rapidly changing technology and business requirements.

Therefore, while some degree of standardization is possible in certain areas (such as authentication and encryption technology), the specific implementation of permission management systems often needs to be customized based on specific application scenarios and requirements.

Extension: Zero Trust Network Model

Zero Trust leans more toward identity verification (AuthN), but many of its principles are closely related to AuthZ design. So this article ends with a brief Zero Trust overview.

Zero Trust is a cybersecurity model whose core principle is “never trust, always verify.” This model requires strict identity verification for all users and devices attempting to access network resources, regardless of whether they are inside or outside the network. The Zero Trust model’s starting point is to no longer assume the internal network is safe, but instead to consider that threats can come from anywhere, and therefore every access request needs to be verified and authorized.

The core principles of the Zero Trust model include:

Principle of least privilege: Users and devices receive only the minimum permissions needed to complete their tasks, limiting access to sensitive information and systems.
Continuous verification: The system continuously monitors and verifies the trust status of users and devices, never assuming they are secure at any point.
Explicit verification: All access attempts must undergo authentication and authorization, regardless of user or device location.
Multi-factor authentication (MFA): Adding security layers by requiring two or more identity-proving factors to reduce the risk of password compromise.
Micro-segmentation: Dividing the network into small, managed segments to reduce the potential attacker’s range of movement and limit access to sensitive data.
Risk-based adaptive policies: Dynamically adjusting access control policies based on factors such as user behavior, device security status, access time, and location.

Golang Generics

Fri, 26 May 2023 00:00:00 GMT

Translated by Claude from the Chinese original.

Generics let us write reusable, type-safe logic over a family of types. In languages such as Java, C++, and C#, generics have long been a core feature. Go intentionally launched without generics, which kept the language simpler but made some abstractions awkward. As Go evolved, the community introduced generics to better support reusable data structures and common algorithms in production code.

An example:

import "golang.org/x/exp/constraints"

func GMin[T constraints.Ordered](x, y T) T {
    if x < y {
        return x
    }
    return y
}

Go generics are similar in spirit to other languages, but different in several important details. This article starts from Go’s type-parameters proposal and uses implementation snippets plus examples to explain key behavior, with the goal of helping you use generics more effectively in real code.

Main reference: https://go.googlesource.com/proposal/+/HEAD/design/43651-type-parameters.md

Terminology

parameter (formal parameter): A placeholder declared in a function definition, e.g., a and b in func Add(a, b int) int. In this article, “type parameter” refers to this concept; for example, T in func Add[T any](a, b T) T.
argument (actual argument): A concrete value passed at a call site, e.g., a and 2 in sum := Add[int](a, 2). In this article, “type argument” refers to this concept; for example, int in Add[int].
function: Refers to a function in Go, such as func Add(a, b int) int. https://go.dev/ref/spec#Function_types
method: Refers to a struct method in Go, such as func (s *Struct) Get() string. https://go.dev/ref/spec#Method_declarations
operation: Can be understood as operators supported by Go’s built-in types (see below). The Go spec uses the term “operator.” https://go.dev/ref/spec#Operators

Expression = UnaryExpr | Expression binary_op Expression .
UnaryExpr  = PrimaryExpr | unary_op UnaryExpr .

binary_op  = "||" | "&&" | rel_op | add_op | mul_op .
rel_op     = "==" | "!=" | "<" | "<=" | ">" | ">=" .
add_op     = "+" | "-" | "|" | "^" .
mul_op     = "*" | "/" | "%" | "<<" | ">>" | "&" | "&^" .

unary_op   = "+" | "-" | "!" | "^" | "*" | "&" | "<-" .

Overview

The Go type parameters proposal summarizes several key points (https://go.googlesource.com/proposal/+/HEAD/design/43651-type-parameters.md#summary):

Functions and types can have type parameters, and those parameters are constrained by interface types. (Methods are not included here; methods cannot declare their own type parameters.)
Constraints define which concrete types are allowed as type arguments, and which methods those arguments must provide.
Constraints also determine which operations are valid on type parameters inside generic code.
At call sites, type inference may infer missing type arguments, so explicit arguments are not always required. For example, with func Add[T any](a, b T) T, we can often write sum := Add(a, 2) instead of sum := Add[int](a, 2).

Next, we’ll explain Go generics design and usage from these perspectives:

The specific definition of type constraints
How type inference is implemented
Some other related topics
Selected type inference code

Type Constraints

Go type constraints can be seen as the “type” of type parameters. For example, in the following, T is the type parameter and Stringer is the type constraint—i.e., the type of T.

func Stringify[T Stringer](s []T) (ret []string) {
        for _, v := range s {
                ret = append(ret, v.String())
        }
        return ret
}

The type of a type constraint itself is interface. Before Go 1.18, an interface was a collection of methods. In the example above, Stringer describes that T should be a type with the method func (T) String() string. But if an interface can only be a collection of methods, the following function cannot be implemented:

func Add[T constraints.Integer](a, b T) T {
        return a + b
}

How can constraints.Integer, as an interface, express support for the + operation? The actual definition in the constraints package is:

type Integer interface {
        Signed | Unsigned
}

type Signed interface {
        ~int | ~int8 | ~int16 | ~int32 | ~int64
}

type Unsigned interface {
        ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64 | ~uintptr
}

As we can see, the Integer interface doesn’t define any methods. This brings us to the concept of underlying types.

Underlying Type

Reference: https://go.dev/ref/spec#Underlying_types

To make constraints more expressive, generics expanded interfaces from “sets of methods” to “sets of types.” The key difference is built-in operators. In Go, operators such as < and == cannot be overloaded, so user-defined methods alone cannot describe operator support. With underlying types, a named type can reuse operators supported by its underlying type. As a result, interfaces can now express both method sets and type sets (for example, with syntax like ~underlying_type).

The rules for determining a type’s underlying type are:

Types that are their own underlying type:
1. boolean
2. numeric
3. string
4. type literal
  1. ArrayType
  2. StructType
  3. PointerType
  4. FunctionType
  5. InterfaceType
  6. SliceType
  7. MapType
  8. ChannelType
When none of the above apply, a type’s underlying type is the underlying type of the type it was created from. For example:

type (
    A1 = string
    A2 = A1
)

Here A1 follows rule 1, so its underlying type is string. A2 follows rule 2, and its underlying type is also string.

Looking back at the definition of constraints.Integer, it represents the set of all types whose underlying type is int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64, or uintptr. All these types support the + operation, so the Add function can be defined.

Difference Between Using Type Parameters and Using Interfaces Directly

Reference: https://go.googlesource.com/proposal/+/HEAD/design/43651-type-parameters.md#values-of-type-parameters-are-not-boxed

Because constraints are interfaces, Go generics can look similar to interface-based programming. A key difference is that generic functions can preserve and return concrete types, rather than forcing everything through interface{}.

Using the Add example, if we remove the type parameter, it becomes:

func Add(a, b interface{}) interface{} {
        n, _ := a.(int)
        m, _ := b.(int)
        return n + m
}

With this non-generic implementation, the function returns interface{} rather than a caller-specific concrete type. The caller then needs a type assertion to recover the concrete type:

func main() {
        a, b := 1, 2
        c := Add(a, b)
        d, ok := c.(int)
        ...
}

Also recall how Go interfaces are represented: an interface value carries both dynamic type information and the underlying value. That extra boxing/unboxing path may add overhead, which generics can often avoid.

Type Inference

Not every call to a generic function requires explicit type arguments. Under certain conditions, the compiler can infer missing type arguments. Note that inference itself does not complete all semantic checks; some checks happen afterward.

The following sections introduce 3 key concepts of type inference, followed by the complete type inference process.

Type Unification

Reference: https://go.dev/ref/spec#Type_unification

Description

Input

A mapping P -> A, where P is a type parameter and A is a known type argument. For example, for func Add[T any](a, b T) T, a possible mapping is T -> int.
Two types, which may or may not contain type parameters.

Output

Given known mappings, determine whether the two input types can be unified.

Operations

For types without type parameters: the type must be equivalent to the comparison type, otherwise unification fails.
1. Two identical types are naturally equivalent.
2. If both types are channel types, they can be considered equivalent if they are identical after ignoring channel direction.
3. If two types have the same underlying types, they can also be considered equivalent.
For types with type parameters: after abstracting over type parameters, the structure must still align; otherwise unification fails.
1. For example, []map[T1]T2 and []T3 are structurally consistent—T3 can be substituted with map[T1]T2. Similarly, []map[T1]bool and []map[string]T2 are structurally consistent.
2. For example, []map[T1]T2 and int, struct{}, []struct{} etc. cannot possibly be structurally consistent.
If matching succeeds and the type contains type parameters, we learn a new P’ -> A’ mapping, which is added to the existing mappings.

Function Argument Type Inference

References:

https://go.googlesource.com/proposal/+/HEAD/design/43651-type-parameters.md#function-argument-type-inference

https://go.dev/ref/spec#Function_argument_type_inference

Description

When calling a function with type parameters, if the caller doesn’t pass type arguments, infer the type arguments from the actual arguments.

Implementation

Get a set of (parameter, argument) pairs from the caller’s actual arguments.
First ignore combinations where the argument has no type (i.e., constants, which have their own type inference rules). For typed (parameter, argument) pairs, perform type unification on their corresponding types and continuously update the mapping P -> A.
Next, handle constant (parameter, argument) pairs. If a parameter’s corresponding type parameter was already inferred in step 2, ignore it. If not, treat the constant argument as its default type and perform type unification.
When all (parameter, argument) pairs have been processed, inference is complete. If any processing fails along the way, inference fails.

Here’s an example illustrating the above steps:

func scale[Number ~int64|~float64|~complex128](v []Number, s Number) []Number {
        ...
}

func main() {
        var vector []float64
        scaledVector := scale(vector, 42)
        ...
}

When function argument type inference begins, we get two (parameter, argument) pairs:

(v []Number, vector []float64)
(s Number, 42)

First, perform type unification on (v []Number, vector []float64), yielding the mapping Number -> float64.

Since we’ve already inferred the mapping Number -> float64, the (s Number, 42) pair doesn’t need type unification.

If there were no mapping Number -> float64, the type of 42 in (s Number, 42) would be treated as the default type int, and the mapping would be Number -> int.

Constraint Type Inference

Reference: https://go.dev/ref/spec#Constraint_type_inference

Description

Based on defined type parameter constraints, infer other unknown type parameters from a known type parameter.
For example, given a function func Double[S ~[]E, E constraints.Integer](s S) S, called as Double([]int{1, 2, 3}), we can infer E -> int from the type constraint S ~[]E and S -> []int.

Implementation

Iterate through all type parameters:
1. If a type parameter already has a corresponding argument, perform unification on their underlying types. In the Double example, the underlying type of S is []E, so we unify []E and the known argument []int, inferring E -> int.
2. If a type parameter doesn’t have a corresponding argument, but its type constraint has only one type, infer that the type parameter’s argument is the constraint type.
In known mappings, if we have P -> A and Q -> B and A contains Q, substitute Q in A with B. For example, in func Copy[T any, P *T](value T, dst P), given T -> int and P -> *T, we can infer P -> *int.
Repeat step 2 until no type parameter P can be found that is contained in some type argument A.

Type Inference Execution Steps

Reference: https://github.com/golang/go/blob/go1.18/src/cmd/compile/internal/types2/infer.go#L33

Based on comments in the compiler code, inference proceeds in these steps:

Perform function argument type inference using type arguments.
Perform constraint type inference.
Perform function argument type inference on remaining untyped arguments.
Perform a final round of constraint type inference.

An example:

package main

import "fmt"
import "golang.org/x/exp/constraints"

func Multiple[S ~[]E, E, X constraints.Integer](s S, x X) S {
        for i, e := range s {
                s[i] *= x
        }
        return s
}

type IntVector []int

func main() {
        vector := IntVector{0, 1, 2, 3, 4}
        vector = Multiple(vector, 3)
        fmt.Printf("%s\\n", vector)
        // output: [0, 3, 6, 9, 12]
}

The type inference steps for Multiple:

Perform type inference on typed function arguments, i.e., on (s S, vector IntVector), yielding: S -> IntVector.
Perform constraint type inference. S’s constraint is []E, and IntVector’s underlying type is []int, so unify []E and []int, yielding E -> int.
Perform type inference on untyped function arguments, i.e., on (x X, 3). Take the default value int for constant 3, yielding X -> int.
Perform constraint type inference again, but since all parameter types are known, it terminates early.

Type Inference Code

Type Unification

Reference: https://github.com/golang/go/blob/go1.18/src/cmd/compile/internal/types2/unify.go

Recursively checks whether x, y Type can unify under mapping p *ifacePair. If x or y is an uninferred type parameter, it can be matched and inference proceeds.

// nify implements the core unification algorithm which is an
// adapted version of Checker.identical. For changes to that
// code the corresponding changes should be made here.
// Must not be called directly from outside the unifier.
func (u *unifier) nify(x, y Type, p *ifacePair) (result bool) {

        ......

        // Cases where at least one of x or y is a type parameter.
        switch i, j := u.x.index(x), u.y.index(y); {
        case i >= 0 && j >= 0:
                // both x and y are type parameters
                if u.join(i, j) {
                        return true
                }
                // both x and y have an inferred type - they must match
                return u.nifyEq(u.x.at(i), u.y.at(j), p)

        case i >= 0:
                // x is a type parameter, y is not
                if tx := u.x.at(i); tx != nil {
                        return u.nifyEq(tx, y, p)
                }
                // otherwise, infer type from y
                u.x.set(i, y)
                return true

        case j >= 0:

                ......

        }

        ......

        switch x := x.(type) {

        ......

        case *Slice:
                // Two slice types are identical if they have identical element types.
                if y, ok := y.(*Slice); ok {
                        return u.nify(x.elem, y.elem, p)
                }

        case *Struct:
                // Two struct types are identical if they have the same sequence of fields,
                // and if corresponding fields have the same names, and identical types,
                // and identical tags. Two embedded fields are considered to have the same
                // name. Lower-case field names from different packages are always different.
                if y, ok := y.(*Struct); ok {
                        if x.NumFields() == y.NumFields() {
                                for i, f := range x.fields {
                                        g := y.fields[i]
                                        if f.embedded != g.embedded ||
                                                x.Tag(i) != y.Tag(i) ||
                                                !f.sameId(g.pkg, g.name) ||
                                                !u.nify(f.typ, g.typ, p) {
                                                return false
                                        }
                                }
                                return true
                        }
                }

        ......

        default:
                panic(sprintf(nil, true, "u.nify(%s, %s), u.x.tparams = %s", x, y, u.x.tparams))
        }

        return false
}

Function Argument Type Inference

Typed Arguments Are Directly Unified

Reference: https://github.com/golang/go/blob/go1.18/src/cmd/compile/internal/types2/infer.go#L250

        // indices of the generic parameters with untyped arguments - save for later
        var indices []int
        for i, arg := range args {
                par := params.At(i)
                // If we permit bidirectional unification, this conditional code needs to be
                // executed even if par.typ is not parameterized since the argument may be a
                // generic function (for which we want to infer its type arguments).
                if isParameterized(tparams, par.typ) {
                        if arg.mode == invalid {
                                // An error was reported earlier. Ignore this targ
                                // and continue, we may still be able to infer all
                                // targs resulting in fewer follow-on errors.
                                continue
                        }
                        if targ := arg.typ; isTyped(targ) {
                                // If we permit bidirectional unification, and targ is
                                // a generic function, we need to initialize u.y with
                                // the respective type parameters of targ.
                                if !u.unify(par.typ, targ) {
                                        errorf("type", par.typ, targ, arg)
                                        return nil
                                }
                        } else if _, ok := par.typ.(*TypeParam); ok {
                                // Since default types are all basic (i.e., non-composite) types, an
                                // untyped argument will never match a composite parameter type; the
                                // only parameter type it can possibly match against is a *TypeParam.
                                // Thus, for untyped arguments we only need to look at parameter types
                                // that are single type parameters.
                                indices = append(indices, i)
                        }
                }
        }

Untyped Arguments Are Assigned Constant Default Values Then Unified

Reference: https://github.com/golang/go/blob/go1.18/src/cmd/compile/internal/types2/infer.go#L297

        // Use any untyped arguments to infer additional type arguments.
        // Some generic parameters with untyped arguments may have been given
        // a type by now, we can ignore them.
        for _, i := range indices {
                tpar := params.At(i).typ.(*TypeParam) // is type parameter by construction of indices
                // Only consider untyped arguments for which the corresponding type
                // parameter doesn't have an inferred type yet.
                if targs[tpar.index] == nil {
                        arg := args[i]
                        targ := Default(arg.typ)
                        // The default type for an untyped nil is untyped nil. We must not
                        // infer an untyped nil type as type parameter type. Ignore untyped
                        // nil by making sure all default argument types are typed.
                        if isTyped(targ) && !u.unify(tpar, targ) {
                                errorf("default type", tpar, targ, arg)
                                return nil
                        }
                }
        }

Constraint Type Inference

Reference: https://github.com/golang/go/blob/go1.18/src/cmd/compile/internal/types2/infer.go#L468

Core Type Processing for Type Parameters

In the first phase of constraint type inference, a new concept called “core type” is introduced. We won’t go into too much detail here—it can be understood as the underlying type of the type corresponding to a type constraint. Using core types and known arguments, some type inferences can be completed.

                for i, tpar := range tparams {
                        // If there is a core term (i.e., a core type with tilde information)
                        // unify the type parameter with the core type.
                        if core, single := coreTerm(tpar); core != nil {
                                // A type parameter can be unified with its core type in two cases.
                                tx := u.x.at(i)
                                switch {
                                case tx != nil:

                                        ......

                                        if !u.unify(tx, core.typ) {
                                                // TODO(gri) improve error message by providing the type arguments
                                                //           which we know already
                                                // Don't use term.String() as it always qualifies types, even if they
                                                // are in the current package.
                                                tilde := ""
                                                if core.tilde {
                                                        tilde = "~"
                                                }
                                                check.errorf(pos, "%s does not match %s%s", tpar, tilde, core.typ)
                                                return nil, 0
                                        }

                                case single && !core.tilde:
                                        // The corresponding type argument tx is unknown and there's a single
                                        // specific type and no tilde.
                                        // In this case the type argument must be that single type; set it.
                                        u.x.set(i, core.typ)

                                default:
                                        // Unification is not possible and no progress was made.
                                        continue
                                }

                                ......

                        }
                }

Mapping Simplification

The second phase of constraint type inference continuously simplifies the mappings.

                smap := makeSubstMap(tparams, types)
                n := 0
                for _, index := range dirty {
                        t0 := types[index]
                        if t1 := check.subst(nopos, t0, smap, nil); t1 != t0 {
                                types[index] = t1
                                dirty[n] = index
                                n++
                        }
                }

Notes on A Philosophy of Software Design

Sun, 09 Oct 2022 00:00:00 GMT

Translated by Claude from the Chinese original.

About six months ago, I borrowed A Philosophy of Software Design from a friend who had not finished it yet, and then left it unread myself. During the National Day holiday, I spent three on-and-off days finishing it. The most useful part for me was its revised-edition pushback against some views in Clean Code. Software engineering is still a young discipline: there are best practices that many people agree on, but almost nothing is universally correct. I had also seen many of this book’s ideas in other books, sometimes in a stronger form. For example, this book explains why inheritance can be problematic but spends less time on alternatives, while Effective Java explicitly argues to “prefer composition over inheritance,” and The Pragmatic Programmer also discusses extensibility.

The book’s discussions on complexity are very helpful for software development, while the other parts are more ordinary—experienced developers can skim them quickly.

Definition of Complexity

Summarized by a formula: software complexity is the sum of each component’s complexity multiplied by how frequently it’s modified. That is, to reduce software complexity, we can either lower individual component complexity or take a holistic view and place highly complex logic in modules that are modified less frequently.

Growing complexity brings 3 problems:

Increased modification complexity—one change requires changes everywhere.
Heavier cognitive burden—needing to understand too many concepts before knowing how to make changes.
Unknown unknowns—not knowing what knowledge is needed to understand a problem, let alone how to acquire it. Even if the engineer makes incorrect changes or misses some changes, they have no way of knowing.

Tangled dependencies between code lead to problems 1 and 2, while missing critical information (e.g., inconsistencies between code and documentation) leads to problems 3 and 2.

Strategic Programming vs Tactical Programming

Strategic programming: Prioritizes overall software design. Minimum viability and shortest time are not considered in isolation (though reasonable development costs and working code are important parts of design).
Tactical programming: Using the shortest programming time to produce minimally viable code.

Note that “strategic” here still operates within agile cycles, not the waterfall model. In waterfall, projects are too large with execution times and feedback cycles too long, which is unfavorable for designing better software architecture.

The book says that investing 10%-20% more time beyond tactical programming can transform it into strategic programming. I disagree—strategic programming generates a lot of throwaway work from various comparisons and research. Overall, I believe rigorous strategic programming costs about twice as much as bare-minimum tactical programming. However, I very much agree with one illustration in the book: with strategic programming, progress and time spent have a nearly linear relationship, while with tactical programming, achieving one unit of progress requires nearly exponential development time investment. This is because strategic programming leverages good architecture to hide complexity, making each unit of progress as independent as possible from existing code. Tactical programming, on the other hand, often means every piece of new code must account for existing old code, making the burden increasingly heavy.

The 10-20% time mentioned in the book reminds me of:

Google’s 20% free work time.
Asana dedicates one week per quarter specifically to software refactoring. (Accounting for vacation and the fact that engineers don’t spend all their time coding, this is also close to 10%?) I read about this in The Effective Engineer—I helped review the Chinese edition, so go buy it!

The book argues that good code attracts better developers, and that some startups believe tactical programming lets them move faster and they can hire better developers to fix the code later once the business takes off—but this approach is inadvisable. I think this is certainly true for a technology company, but for a business-driven company, it may depend on whether the business itself actually needs strong technical support. Some businesses genuinely don’t.

Modules Should Be Deep

A module is “deep” when it hides substantial implementation complexity behind a small and convenient interface.

The book gives a fascinating example: GC in Go and Java. The implementation is very complex, but GC means the language does not need to expose manual memory-management interfaces to users. Even when we add substantial internal complexity, good modularization can still reduce the number of exposed interfaces.

The GC example feels similar to the server-compute/client-render pattern: centralizing complex computation in one place means clients do not each need to maintain their own copy of that logic.

The author also notes that some modern developers tend to make functions and classes smaller and smaller, which can also make them shallow (not deep). This is worth keeping in mind during everyday development.

Information Hiding (and Leaking)

This chapter argues that different modules should own orthogonal (unrelated) concerns. I summarized the traits emphasized by the examples:

Low coupling
- After breaking a large function into multiple smaller ones, those smaller functions should not depend on each other. Pay special attention to call order: whether a function executes correctly should not depend on specific preceding functions having run.
- It’s unreasonable for both a file-read module and a file-write module to carry file-format parsing knowledge. Consider merging them into one read-write module, or extracting file-format parsing into a separate module.
High cohesion
- Code generating an HTTP response should not first set the HTTP version and then delegate the response to other modules. Instead, responsibility for setting protocol details should be encapsulated in the relevant modules.
Hide implementation
- A class that internally uses a map should not expose that map directly. Otherwise external code can mutate internal state, and replacing the map with another implementation later becomes much harder.

The More General, the Simpler

(My loose translation—the original heading is “General-Purpose Modules are Deeper.” In this book, “deep” refers to modules that hide significant complexity behind a simple interface.)

My understanding is: once complexity is reduced to a certain point, it is often transferred rather than eliminated. Distributing complexity between interface and implementation determines how often users encounter that complexity. Usually, the simpler and more general the interface, the more complex the implementation, and thus the deeper the module.

The author also emphasizes that designing general interfaces does not mean over-designing. A general interface may accommodate future needs, but those needs may never arrive, or may even break the existing interface. So we can preserve interface generality while implementing only current requirements. If new needs emerge later, we can extend internals behind the same interface and make the module deeper.

Different Layers, Different Abstractions

Different layers should have different abstractions. If different layers share the same abstraction, it often indicates that the code at those layers isn’t deep enough.

For example, the classic Principle of Least Knowledge: in a call chain A -> B -> C, if B only forwards requests to C, then A should call C directly instead of B. Adding code always increases complexity, so we must evaluate whether new code brings enough benefit. In this case, B adds complexity without benefit and increases future maintenance costs.

The book also discusses decorators and pass-through variables.

Regarding decorators, Java and Python provide relatively ergonomic language support. In other languages, the author suggests considering alternatives to decorators:

Add the new functionality directly into the decorated method or object.
If the decorator adds functionality for special cases while the decorated method handles general cases, consider whether these special cases can be handled elsewhere.
Add the new functionality into other existing decorators.

Pass-through variables are values that must be threaded through every method in a call chain, such as Go’s ctx context.Context. The author suggests using something like thread-local storage to keep context in an instance, but I don’t think there is a universally good solution today. A potentially cleaner approach might be thread-scoped dependency injection (for example, via frameworks like Guice): delegate thread-local read/write mechanics to the framework, let it inject available values automatically, and expose only remaining values to callers. I have not used this approach in production yet.

In Defense of Long Functions

The book mentions Robert Martin’s Clean Code, which advocates that functions should be as short as possible.

However, the author disagrees that all functions should be as short as possible: “Each method should do one thing and do it completely.” If a function is hard to decompose into shorter independent units, or if the extracted functions still depend on shared context, that kind of decomposition can increase complexity and maintenance cost.

Comments

Should We Write Comments?

Here the author again explicitly disagrees with Clean Code, which treats comments as a “necessary evil” and argues that comments often indicate failure to write expressive code.

The author argues that comments and code are complementary: together they reduce complexity, and missing comments can increase it. For example, comments can reduce the need for excessively long function names (the book’s example: isLeastRelevantMultipleOfNextLargerPrimeFactor) and can avoid forced decomposition into many short but interdependent functions.

Function comments let callers use code without reading implementations, which is itself a form of abstraction. Comments can also capture design information that cannot be fully expressed in code alone.

Comments Should Describe Non-Obvious Parts of Code

The author discourages writing information in comments that can be directly derived from reading the code. The book’s counterexample:

ptr_copy=get_copy(obj)    #Get pointer copy
if is_unlocked(ptr_copy):  #Is obj free?
return obj                 #return current obj

Comments can be classified into these categories, each with different standards:

Lower-level comments: Help developers understand certain details in the code more precisely.
Higher-level comments: Help developers intuitively understand what the code does without reading it.
Interface documentation: Doesn’t describe implementation details, but lets developers know how to use the corresponding interface. Good interface documentation represents good abstraction; if interface documentation must describe internal implementation, the underlying implementation may be too shallow.
Implementation comments: Describe what the code does and why, but don’t need to describe how—because the code itself contains that information (what and why, not how).

For cross-module designs shared by multiple modules, the author suggests writing notes in shared structural definitions, or adding a designNotes file to the source code.

Write Comments First, Code Second

The author believes writing comments after finishing code isn’t a good habit, because writing comments after all code is done increases the resistance to writing them. Also, after finishing code, some critical information may have already been discarded by the brain and is difficult to recover.

This is similar to how I write reading notes immediately after finishing a book—it’s the most efficient approach.

So the author advocates writing comments first, then code.

A common counterargument is that code usually goes through several rounds of changes before it stabilizes, so writing comments only at the end seems more efficient. But the author argues that repeatedly modifying code is more expensive than iterating on comments first. Writing comments in advance helps stabilize structure; if the design keeps changing, revise the comments first, where iteration is cheaper.

Don't Be a Programmer

Sun, 01 May 2022 00:00:00 GMT

Translated by Claude from the Chinese original.

Late at night, taking a hot shower alone, hiding behind the bathroom curtain, as if hiding in a greenhouse—or perhaps a petri dish. Not wanting to stop the water, not wanting to leave this enclosed space. I feel so much pressure. At times like these, I often do things to relieve stress and anxiety, but the process of relief wastes time, which only adds to my anxiety. Being a programmer often means facing such situations. The imagined life is filled with glamorous phrases like “financial freedom,” but in reality, we’re stuck in the daily grind.

I suddenly remembered a phrase I once read: “Think more about how to build the product, less about how to be a product manager.” I think this can be extended to programming: “Think more about how to write programs, less about how to be a programmer.” That’s right—you don’t need to become a programmer, spending most of your day staring at a screen with swollen eyes. Your body gradually becoming out of shape, wanting to exercise but never finding the time. An increasingly rigid, mechanical way of thinking, and less and less contact with people. The ideal is beautiful, but reality is so disappointing. After all, programmers of the previous era had so many opportunities for financial freedom. So many stories of wealth creation, even world-changing tales, happened to programmers. But that era has passed. We can only watch its afterglow as the sun slowly sets.

Recently, while preparing to change jobs, I discovered that the interview prep materials are full of things I don’t know. It’s hard not to feel anxious. And I noticed that even summarizing interview prep has become someone’s thriving side business and personal brand. Envy and jealousy—these are negative emotions programmers often have. You wonder why that person is better than you; why they can work so hard while you can’t; why some things come naturally to them but not to you. The longing for opportunities and uncertainty makes this industry exciting. It inflates your jealousy endlessly, making you believe that through effort you can gain infinitely, even making you feel that the distance between yourself and the world, between yourself and your dreams, is constantly shrinking. But is that really true? We always have all kinds of pressure. Society has developed quite a stereotypical image of programmers. How I wish there wasn’t so much overtime. How I wish life could be easy while also becoming a great developer, showing off online and being envied by others.

With such desires, the work of a programmer can feel cold and abstract. Exceptional comprehension, even wisdom, yet losing so much rich emotion. Life becomes increasingly monotonous, even childish.

Why did I choose to become a programmer? After finishing my shower, I looked at the night sky outside the window and asked myself. I think it was the inspiration and sentiment from various startup stories, and also a love for abstraction. A program is essentially a simpler language, yet with this language we can describe all kinds of problems and solutions in the world. Language is so important—just as I’m writing this article in Chinese right now. Language—when we discuss programming, what are we really discussing? We discuss programming itself: how code abstracts reality, how to design, what makes a good or bad design. But we’re not discussing programmers.

Can you be a programmer and still love life? Of course you can. I don’t know how, but everyone has their own solution. Though I’m also anxious about my future—I don’t know how many more years I can stay in this industry, when I’ll become obsolete, what my income will be, what kind of life I’ll live, or how many layoffs await me. But this uncertainty brings so much excitement. Disappointment and excitement rise and fall like tides. This is what makes this industry special—risk and reward always come together.

Embrace more uncertainty. Embrace true passion. Use language to reorganize life. Love code, love programming, love structure and design patterns—but don’t become what people call a programmer.

Key Iterations: Trustworthy Online Controlled Experiments

Fri, 24 Dec 2021 00:00:00 GMT

Translated by Claude from the Chinese original.

This article is a summary with excerpts from Trustworthy Online Controlled Experiments (the Chinese edition is titled “Key Iterations”). More specifically, these are notes from an Airbnb reader on a technical book whose authors and translators are also from Airbnb. Please point out any errors; comments are welcome.

If all you have is a hammer, everything looks like a nail. —Abraham Maslow

Preparing for Tests

To Succeed, Fail More

The cognitive shift brought by experiments is also related to the gap between expectations and reality.

If you think something will happen and it does, you won’t learn much. If you think something will happen but it doesn’t, you’ll learn something important. If you originally thought something was insignificant but it produced surprising or breakthrough results, you’ll learn something extremely valuable.

Teams develop a product feature because they believe it’s useful. However, in many domains, most ideas fail to improve key metrics. At Microsoft, only one-third of tested ideas improved target metrics. In already highly optimized product areas like Bing and Google, success is even harder—test success rates are only 10%-20%. Bing’s team of hundreds of ranking algorithm engineers has an annual goal of improving a single OEC metric by 2%.

Testing Is Tactical, Decision-Making Is Strategic

When testing bold ideas, the way experiments are run and evaluated also changes.

Experiment duration: A small temperature increase won’t cause change in a room, but once the melting point is reached, ice begins to melt.
Number of ideas tested: Each experiment tests only one specific tactic—a component of the overall strategy. Individual experiment failures don’t mean the overall strategy is flawed. But if many tactics evaluated through controlled experiments fail, it’s likely a strategy problem. Using experiments to validate strategy is extremely costly; strategy should primarily be informed by existing data.

One interpretation of OKR is: define the strategic O (Objective), then derive a series of tactics from the KRs (Key Results) to achieve those KRs and ultimately feed back into O.

What Are Significant Results?

Not all statistically significant results have practical meaning. Take revenue per user as an example: how large a difference is important from a business perspective? In other words, what level of change is practically significant? Building this substantive boundary is important—it helps understand whether a difference is worth the cost of the corresponding change. If your website, like Google and Bing, has billions of dollars in revenue, a 0.2% change is practically significant. By comparison, a startup might consider even 2% growth too small, as they’re pursuing 10% or greater improvements.

In 2012, every 10-millisecond performance improvement at Bing (1/30th of a blink) was enough to justify the cost of hiring a full-time engineer for a year. By 2015, that number had dropped to 4 milliseconds.

Setting Up Tests

Variant Assignment

How to assign experiments to users:

Single-layer method: Bucket users—say into 1,000 buckets—and use two user buckets per experiment.
Parallel experiments:
- Divide the code architecture into layers, with only one experiment per layer.
- Don’t worry about experiments interfering with each other; assume all experiments are orthogonal. If non-orthogonal experiments exist, rely on the execution team to discover and resolve conflicts.

Metric Classification

Goal metrics, also called success metrics or North Star metrics, reflect what the organization ultimately cares about. Being able to clearly articulate your goal in words is important, because the translation from goal to metric is usually imperfect. Your goal metric may only be an approximation of what you truly care about and needs iterative improvement over time. Helping people understand this limitation—and the distinction between the metric and the goal statement—is crucial for keeping the company on the right track.
Driver metrics, also called signpost metrics, proxy metrics, indirect metrics, or predictive metrics, are generally shorter-term than goal metrics, changing faster and more sensitively. Driver metrics reflect a mental causal model—a hypothesis about how to make the organization more successful, i.e., assumptions about the drivers of success, rather than what success itself looks like.
Guardrail metrics ensure we make the right trade-offs on the path to success and don’t violate important constraints.

Metric Selection

Metric selection, beyond statistical validity, also depends on the designer’s value choices (“people are ends, not means”). Metrics represent how you want the system to understand data and what you care about. For example, use P95/P99 to focus on extreme values, or mean and median for overall trends.

Proxy Metrics

Some subscription services renew on an annual basis. Unless you’re willing to run a year-long experiment, it’s hard to measure the impact on renewal rates. In such cases, we can’t use renewal rate as the experiment metric and instead need to find proxy metrics, such as service usage, which can indicate user satisfaction early and ultimately influence renewal rates.

Metric Normalization

Even if you want to increase total revenue, it’s not recommended to use total revenue as the metric, since it depends on the number of users in each variant. Even with equal allocation, actual user counts may differ due to randomness. We recommend normalizing key metrics by actual sample size—hence revenue per user is a good overall evaluation criterion.

Duality

Sometimes it’s simpler to precisely measure what you don’t want rather than what you do want—such as user dissatisfaction or unhappiness. Establishing causal relationships from observational data is difficult, but a carefully conducted observational study can help disprove false hypotheses.

Preventing Sub-Metric Conflicts

Carefully examine each sub-metric under a major metric—sub-metrics may conflict with each other. For example, search engine queries per user = sessions per user × unique queries per session.

A session starts when a user begins their first query and ends when the user has no activity on the search engine for 30 minutes.

Sessions per user is most likely a positive metric—the more users like the search engine, the more frequently they use it, increasing sessions per user.
Unique queries per session should decrease, which conflicts with the overall goal. A decrease in unique queries per session means users need fewer steps to solve their problem—but it could also mean users are abandoning their queries. We should ideally pair this metric with a query abandonment metric.

Pitfalls

Beware of Extreme Results

When we see surprisingly positive results (e.g., major improvements to key metrics), we tend to construct a narrative around them, share, and celebrate. When results are surprisingly negative, we tend to find some limitation or minor flaw in the study and dismiss it. Experience tells us that many extreme results are more likely caused by instrumentation (e.g., logging) errors, data loss (or duplication), or calculation errors.

Simpson’s Paradox

Two groups of data that satisfy a certain property when examined separately may lead to the opposite conclusion when combined.

Simpson’s paradox - Wikipedia

When an experimental feature causes individuals to migrate between two mutually exclusive and exhaustive segments, similar situations can occur (though the book emphasizes this isn’t Simpson’s paradox per se).

For example, a feature that moves Level 2 users back to Level 1 might improve both levels’ data—Level 2’s user pool removes poorly performing users while Level 1’s pool gains better-performing users—but overall performance across both levels may stay the same or even worsen.

Ideally, segmentation should only use values determined before the experiment, so the experiment doesn’t cause users to change segments. In practice, however, this is hard to enforce in some cases.

Changing User Base

When computing metrics or running experiments, all data comes from the existing user base. Especially for early-stage products and startups, early users may not represent the user base the business hopes to acquire for long-term growth.

Uncertain Confounding Factors

Shared resources and dependencies can cause experiments to fail:

Market resources: Homestays and ride-sharing have this problem—once supply becomes constrained, one group capturing too many resources deprives the other group of its original resources.
Advertising campaign budgets.
Model training for recommendation systems: Once both models are trained on full user data, knowledge sharing between the two models is likely within days.
CPU or other computational resources.

Others

Goodhart’s Law: When a measure becomes a target, it ceases to be a good measure.

The Lucas Critique observes that relationships found in historical observational data cannot be considered structural or causal. Policy decisions change the structure of economic models, so historical correlations no longer hold. Over time, even the causal relationships we previously relied on may change.

Long-Running Experiments

Reasons why short-term and long-term experiment results may differ:

User learning effects.
Network effects.
Delayed experience and evaluation.
Ecosystem changes:
- Launching other new features.
- Seasonality.
- Competitive landscape.
- Government policies.
- Concept drift.
- Software performance degradation.

Methods for improving long-running experiments:

Cohort analysis: Long-term tracking of a stable cohort’s user data.
Post-period analysis: After running for a period, close the experiment (or roll it out to all users), then continue measuring the difference between control and treatment groups.
Time-staggered experiments: Use experiment start time t as a variable to create multiple treatment groups, comparing user performance differences across different values of t.
Holdback and reversal experiments:
- Holdback: Keep 10% of users in the control group.
- Reverse: After several months, move those 10% back into the control group. Then measure user performance.

Ethics

Respect for persons: Respect experiment participants as autonomous individuals and protect them when they lack this capacity. This principle focuses on transparency, truthfulness, and voluntariness (choice and consent).
Beneficence: Protect people from harm. Properly assess risks and benefits, and appropriately balance them when reviewing proposed research.
Justice: Ensure participants are not exploited and that risks and benefits are fairly distributed.

An Example: Designing a Slowdown Experiment

Why can a slowdown experiment measure the impact of speed improvement on a product?

Assume that the relationship between a relevant metric (e.g., revenue) and performance (e.g., speed) can be well approximated by a linear fit near the current value. This is essentially a first-order Taylor expansion, or linear approximation.

That is, if we improve performance, the resulting metric change can be approximated by the metric change obtained from degrading performance.

Two additional reasons support this assumption:

From our own experience as users, faster is always better for website speed.
We can also verify the linearity assumption—for example, test the metrics at 100ms delay and 250ms delay, and check whether the metrics at these two points follow a linear relationship.

Alternative Methods

How do we measure counterfactuals? Consider an experiment with a human population as test subjects. To determine the deviation between counterfactual and reality:

  Outcome of affected group - Outcome of unaffected group
= (Outcome of affected group - Outcome of affected group had they not been affected) +
  (Outcome of affected group had they not been affected - Outcome of unaffected group)
= Effect of the change on the affected group + Selection bias

What we can observe is the performance of affected vs. unaffected groups, but what we want to know is the effect of the change on the affected group. So we want the selection bias in the system to be zero.

A/B testing is a system with zero selection bias, but in many real situations we can’t apply A/B testing to real problems—for example, there’s no randomization unit, the experiment would waste significant (opportunity) costs, or the experiment would be unethical.

These methods generally serve as alternatives to A/B testing, but they usually have larger estimation error and bias:

Interrupted time series: Split the experiment time into many small segments, uniformly (and randomly, or just round-robin) assign each segment to different treatments, then measure and analyze (some time-series models can be useful, such as Bayesian structural time series analysis).
Interleaved experiments: Often used for ranking algorithms—show Algorithm A at odd positions and Algorithm B at even positions to compare differences.
Regression discontinuity design: Consider this method when the affected population is identified by a clear threshold. Use the group just below the threshold as the control and the group just above as the treatment, comparing these two groups to reduce selection bias. For example, studying the impact of scholarships on students where students scoring 80+ receive scholarships—we study the 80-84 and 75-79 score groups.
Instrumental variables and natural experiments: If we can find an instrument within the studied process that helps achieve random grouping, we can use it for causal analysis. For example, draft lotteries for military conscription or school assignment—if the lottery process is random, it can serve as the randomization unit.
Propensity score matching: Although we can’t randomize, if we can identify confounding factors (covariates) between groups that might affect our judgment of the variable, or if we can determine that unidentifiable confounding factors won’t affect analysis, then we can analyze causal relationships between variables and outcomes. This is usually very difficult to achieve.
Difference-in-differences: Identify a control group as similar as possible to the treatment group, assuming both groups share the same trends. For example, select two cities with similar characteristics as control cities—enable the new feature in one but not the other, then compare differences between the two cities.

These types of analyses should be used carefully, as they may fail due to:

Common causes: Smaller palms correlate with longer lifespans—actually, women live longer.
Spurious or deceptive correlations: The length of words in the National Spelling Bee positively correlates with the number of people killed by venomous spiders that year—but this is obviously spurious.

Mathematical Derivation

We use a two-sample t-test to calculate the p-value, which matches the actual experimental setup.

T = \frac{\mathit{\Delta}}{\sqrt{\text{var} (\mathit{\Delta})}}

Where $\mathit{\Delta} = \overline{Y^t} - \overline{Y^c}$ is the difference between the treatment group mean and control group mean. $\overline{Y^t}$ and $\overline{Y^c}$ follow normal distributions due to the Central Limit Theorem.

Considering the variance formula for independent variables $\text{var}(ax+by) = a^2\text{var}(x) + b^2\text{var}(y)$ :

\text{var}(\mathit{\Delta}) = \text{var}(\overline{Y^t} - \overline{Y^c}) = \text{var}(\overline{Y^t}) + \text{var}(\overline{Y^c})

Where $\text{var}(\overline Y) = \text{var}(\frac 1n \sum^n_{i=1}Y_i)=\frac 1 {n^2} * n * \text{var}(Y) = \frac{\text{var}(Y)}n$

The p-value is essentially a conditional probability: “the probability of observing the current difference or an even larger one, given that the null hypothesis is true,” i.e., $P(\text{observing } \mathit{\Delta}\mid H_0 \text{ is true})$ . The $\mathit{\Delta}$ here is represented by the normalized statistic $T$ . This is why p-values have a common interpretation across experiments under the null.

Some people mistakenly interpret the p-value as “the probability that the null hypothesis is true, given that we observed the current or larger difference.” The implicit condition “observing the current or larger difference” has a probability specific to each experiment and lacks universality. The relationship can be derived using Bayes’ theorem:

\begin{align*} P(H_0 \text{ true}|\text{observing } \mathit{\Delta}) &= \frac {P(\text{observing } \mathit{\Delta}|H_0 \text{ true})P(H_0 \text{ true})} {P(\text{observing } \mathit{\Delta})} \\ &= \frac {P(H_0 \text{ true})}{P(\text{observing } \mathit{\Delta})} * P(\text{observing } \mathit{\Delta}|H_0 \text{ true}) \\ &= \frac {P(H_0 \text{ true})}{P(\text{observing } \mathit{\Delta})} * \text{p-value} \end{align*}

P-value calculation depends on variance. Common issues with variance estimation:

Variance Calculation for Percentage Deltas and Ratio Metrics

Note that if the metric’s meaning and the analysis unit differ, values may not be independently and identically distributed (i.i.d.). For example, calculating page conversion rate while the experimental unit is the user—users visiting a page multiple times makes page-level data non-i.i.d. In this case, variance needs to be calculated from the numerator and denominator of the ratio formula separately, not by first computing the ratio and then calculating variance.

The correct calculation: we express the ratio metric as the ratio of two user-level metric averages:

M = \frac {\overline X} {\overline Y}

Via the delta method (skipping the derivation), the final variance formula is:

\text{var}(\mathit \Delta \%) = \frac 1 {\overline{Y^c}^2}\text{var}(\overline{Y^t}) + \frac{\overline{Y^t}^2}{\overline{Y^c}^4}\text{var}(\overline{Y^c})

Outliers

Outliers can interfere with experiments. A reasonable threshold can be added to observed metrics; other methods can be researched independently.

Improving Sensitivity

Lower variance means higher significance. These methods can reduce variance:

Transform metrics through thresholding, binarization, and log transformation: Purchase amounts have larger variance than whether a purchase was made; per-user watch time has larger variance than whether viewing exceeded x hours.
Remove noise through trigger analysis.
Stratify within the sampling scope through stratified sampling, control variables, or CUPED. Combine results from each stratum to get the total—variance calculated this way is smaller.
Choose finer-grained randomization units.
Shared control group: Calculate an appropriately sized control group and compare it against multiple treatment groups. This reduces variance and can improve statistical power for all treatment groups, though it introduces some issues.

P-Value Threshold Issues

Beware of this fact: when the significance threshold is 0.05, testing a no-op feature (where the null hypothesis is true) 100 times means that, statistically, about 5 tests will incorrectly conclude significance (false positives). Here we introduce two concepts:

Type I error: The measurement shows a significant difference, but in reality there is no difference.
Type II error: The measurement shows no significant difference, but in reality there is a difference.

A smaller significance threshold ( $\alpha$ ) reduces Type I errors but increases Type II errors, and vice versa.

To reduce the aforementioned “5 out of 100 tests incorrectly concluding significance” Type I errors, we can apply a stricter significance threshold to metrics confirmed to be unrelated (or indirectly related) to the experiment. The more confidence we need, the lower the significance threshold should be.

Conversely, for sensitive guardrail metrics where missing a real change is costly, we may use a less strict significance threshold together with explicit power targets. Especially for metrics where we explicitly require that fluctuations not exceed x%, we should follow the industry principle that tests should have 80% statistical power (statistical power = 1 - Type II error rate; see specific approximation formulas elsewhere), and calculate thresholds based on x.

How Large Should Sample Size n Be?

This section overlaps somewhat with the sensitivity improvement section.

What’s the theoretical basis for “larger n is better”? In practice, many t-test settings rely on the sample mean being approximately normal. According to the Central Limit Theorem, under appropriate conditions, the mean of a large number of mutually independent random variables converges in distribution to a normal distribution after proper standardization. So n needs to be sufficiently large.

The statistical power approximation formulas mentioned earlier can help calculate n given certain conditions, but these formulas require specifying the effect size x%. This is only suitable for some guardrail metrics.

To better approximate normality of the mean, we can also use empirical formulas based on skewness to calculate n. Metrics with large variation often also have high skewness. We can reduce skewness and n by capping differences (e.g., treating all values above 10 as 10), as long as capping doesn’t violate our assumptions.

We can also verify by constructing a null distribution, but the concepts involved are more complex, so we’ll skip that here.

TBD: Why Choose the t-Distribution?

This requires comparing the characteristics of several sampling distributions—refer to the summaries in university textbooks.

Transactions and Consensus from DDIA

Mon, 27 Sep 2021 00:00:00 GMT

Translated by Claude from the Chinese original.

Some Reflections

Most applications are built by layering data models on top of one another.

The problem of surplus and scarcity of computing resources will always exist. “Resource-saving technologies only lead to increased resource usage” (Jevons paradox).

Many problems discussed in the book follow a pattern: under constraint P (the real-world problem), find the lowest-cost C (consistency level) that still achieves the best A (availability/outcome). One example in the book is multi-core CPUs. While multi-core CPUs can be viewed as distributed systems, they are not constrained by inter-machine network latency, so partition tolerance is effectively assumed. Even so, to maximize throughput, multi-core designs may still trade some consistency for higher availability, accepting occasional redundant or incorrect computation to squeeze out more performance.

Are Paradigms Useful?

Paradigms exist to standardize how we use data. But as hardware improves and application scenarios expand, breaking those rules can sometimes bring bigger gains. Otherwise NoSQL would not have emerged: store a large JSON document, skip strict relational modeling, and iteration speed can increase dramatically.

So should we still follow paradigms? Actually, paradigms are just a yardstick that tells us what to do in which scenario and what happens if we don’t—not rules we must rigidly follow.

What Are Concurrency Problems?

A single machine with a single thread can always avoid various concurrency problems, but this approach is too slow. So we develop multi-threading, multi-processing, multi-machine, and distributed systems. Along with these come concurrency problems—problems that fit the following condition:

Under identical conditions, if the results of two tasks executed serially differ from those executed in parallel, a race condition has occurred, causing a concurrency problem.

Single-Machine Concurrency Problems

Let’s first discuss concurrency problems that arise even with a single machine running multiple threads/processes.

Discussion model:

There are exactly 2 transactions involved in the problem scenario.
For real-world problems caused by more than 2 transactions, we use induction to reduce the problem to 2 concurrent transactions. For example, we group multiple non-conflicting transactions into 1, or split the problem into different causes, each corresponding to 2 transactions.

Scenario 1—Transaction 1 is reading, Transaction 2 is writing, leading to:

Dirty reads
Non-repeatable reads / Read skew

Scenario 2—Both transactions are writing, leading to:

Lost updates
Dirty writes
Write skew

Dirty Reads

Prerequisite: One transaction reads partial modification results of another uncommitted transaction.

Without read-committed isolation, dirty reads occur in these scenarios:

One transaction updates multiple objects while another transaction sees only some of the updated objects, not all.
One transaction aborts, and during rollback another transaction sees partially un-rolled-back objects.

Examples:

Example 1: Alice has two bank accounts—Account 1 has 100 yuan and Account 2 has 0. Account 1 transfers 100 yuan to Account 2. The transaction deducts 100 from Account 1 and adds 100 to Account 2. If Alice reads her total balance after the deduction from Account 1 but before the addition to Account 2, the result is 0. (Note: non-repeatable reads can also cause the same situation—we’ll discuss this in the non-repeatable reads section.)

Example 2: Building on Example 1, Account 2 is a Class II account with a single-transfer limit of 50 yuan. If the transfer exceeds the limit, the transaction rolls back. After deducting 100 from Account 1, another query reads Account 1’s balance as 0 yuan. Then the transfer fails and rolls back. Due to transaction atomicity, Account 1’s balance never actually became 0—yet at a certain moment, another transaction saw it as 0.

Dirty Writes

Prerequisite: One transaction modifies partial modification results of another uncommitted transaction.

Without read-committed isolation, dirty writes occur in this scenario:

Transaction 1 updates multiple objects while Transaction 2 modifies some of those objects (through updates, creates, deletes, or their rollbacks). Then Transaction 1 updates the remaining unmodified objects.

Example:

On a trading website, a purchase updates both the item’s recipient and the payer. Alice and Bob simultaneously buy the same item. Alice’s transaction sets the recipient to Alice. Then Bob’s transaction changes the recipient to Bob and declares Bob as the payer. Then Alice’s transaction updates the payer to Alice. Result: Bob receives the item but Alice pays.

Non-Repeatable Reads / Read Skew

Prerequisite: Transaction 2’s execution starts after Transaction 1 begins but completes before Transaction 1 ends, and Transaction 1 reads Transaction 2’s modifications.

Most business scenarios don’t want non-repeatable reads. These scenarios especially cannot tolerate them:

Backups
Long-running analytical queries and integrity checks

Non-repeatable reads are also called “read skew” because: ideally, a transaction should read all data in an instant. When non-repeatable reads occur, the transaction’s reads are spread across the timeline rather than happening at a single point—hence the read is “skewed.”

Example:

Continuing from dirty read Example 1: Alice starts a balance-reading Transaction 1 before initiating transfer Transaction 2. Transaction 1 first reads Account 2’s 0 yuan (this conforms to read-committed). Then Transaction 2 completes—Account 1 becomes 0, Account 2 becomes 100. Finally Transaction 1 reads Account 1’s 0 yuan (also conforming to read-committed). Total: 0 yuan.

Lost Updates

Two transactions simultaneously execute read-modify-write sequences, where one overwrites the other’s write without incorporating the other’s latest value, ultimately causing some modified data to be lost.

Examples:

Incrementing counters
Modifying part of a complex object (e.g., multiple users editing a large JSON)

Write Skew

An escalated version of lost updates, differing in its greater dependency on application-layer logic. It follows this pattern:

Read: Input some matching conditions and query.
Decide: Based on query results, application-layer code decides the next action.
Write: If the application decides to proceed, it initiates a database write.

Generally, if 2 transactions updating different objects cause errors (usually semantic errors at the application layer), that’s write skew. If they update the same object, it’s likely dirty writes or lost updates.

Why “write skew”: Referencing the meaning of read skew—ideally, a transaction’s write operations should happen in an instant. But when write skew occurs, a transaction typically reads stale data, the data gets modified, and the transaction unknowingly writes invalid data. The transaction’s reads and writes are spread across the timeline—hence the write is “skewed.”

Example: A user has an expense ledger with a balance constraint. Two transactions each insert expense items that individually don’t exceed the balance. But since neither notices the other, the combined expenses push the balance negative (application-layer logic violation).

Here we discuss one solution with limited feasibility:

Materializing Conflicts

Write skew often occurs because query results contain no objects, so there’s nothing to lock. Solution: pre-create lockable objects.

For the balance example above, we create a total balance table with a new field for pending changes—meaning only one change can affect the balance at a time. This transforms the problem into a lost update or dirty write problem.

The main issue with materializing conflicts is excessive database storage. For instance, should we materialize all room-and-time combinations for the next 6 months in a meeting room booking system?

This method is generally not used unless absolutely necessary.

Phantom Reads

A write in one transaction changing the query results of another transaction is called a phantom read.

Phantom reads are a highly general concept. I believe most concurrency problems discussed so far could be classified as phantom reads.

Preventing Lost Updates

We must prevent lost updates to avoid many other phantom-read-like anomalies. As we’ll see, the key to distributed transactions is achieving consensus, and consensus is largely about preventing lost updates.

Lost update scenarios are simpler than other concurrency problems, and the solutions are easier to understand. So we discuss lost update prevention first.

Atomic Write Operations

If the DB supports atomic write operations, use them whenever possible.

The DB can implement atomic writes through exclusive locks or by executing all atomic operations on a single thread.

Explicit Locking

The application explicitly locks relevant objects. DDIA uses the FOR UPDATE keyword to represent application-layer requests to lock SELECT results.

This approach might seem to solve write skew too, but write skew also includes cases like “checking whether rows matching a search condition exist (expected result: empty).” In that case, there is nothing concrete to lock. We discuss write-skew solutions in other sections.

Above we discussed two lock-based solutions. Now let’s discuss lock-free solutions.

Automatically Detecting Lost Updates

Allow updates to execute concurrently. If the transaction manager detects a lost-update risk, it aborts the current transaction and retries using a safe read-modify-write path.

The database can use snapshot-level isolation for detection. A rough intuition is that an object can only have one uncommitted version at a given moment, though finer-grained detection methods likely exist.

Atomic Compare-and-Set

Only allow updates when the data hasn’t changed since the last read. If it has changed, fall back to another read-modify-write approach or retry.

Implementation: CAS (compare-and-swap)—modern CPUs support this instruction. Or explicitly add conditions like WHERE content = 'old content' during execution. The danger is that if WHERE executes against a snapshot, the content value may not be the latest.

Conflict Resolution and Replication

Details are discussed in the distributed concurrency section. The general idea: the application layer has logic, or data structures have logic, to handle conflicting writes. Or, design operations to be order-independent, so update conflicts naturally don’t occur.

Isolation Levels

Read Committed

The most basic transaction isolation level: a transaction’s internal execution won’t be affected by other transactions.

Solves:

Dirty reads
Dirty writes

Why “read committed”? My interpretation is: reads should observe only committed data.

Implementation methods:

Row-Level Read Locks

Read locks certainly achieve read committed, but with drawbacks:

Poor performance
Potential deadlocks

Old/New Value Snapshots

For each object pending update, maintain two versions: the old value and the new value the lock-holding transaction will set. Before the transaction commits, all other reads return the old value. Only after the write transaction commits does the system switch to the new value.

Multi-Version Concurrency Control / MVCC

Discussed in the next section.

Snapshot Isolation

Each transaction reads a consistent snapshot of the database—once read, data doesn’t change.

Prevents:

Non-repeatable reads / Read skew

Multi-Version Concurrency Control / MVCC

The database maintains multiple committed versions of objects, adding created_by and deleted_by fields to each row, representing versions created by different transaction operations. A periodic garbage collection task cleans up versions no longer needed.

A transaction cannot see:

Changes made by transactions still running when this transaction started
Changes made by any aborted transactions
Changes made by transactions that start after this transaction
All other changes are visible to this transaction (I understand “other changes” refers to non-transactional changes)

Conversely, a transaction can see:

Objects created or updated by already-committed transactions before this transaction started
Objects not deleted by uncommitted transactions before this transaction started

MVCC indexing has roughly two implementation approaches:

Index points to all versions of an object: PostgreSQL uses this method, placing all versions on a single memory page for performance optimization.
Persistent data structures: Typically a persistent B-tree. Different transactions create their own database entry nodes when writing. When reading, use the node corresponding to the latest committed transaction as the entry point.

Serializable Isolation

Even if transactions may execute in parallel, the final result is the same as if they executed one at a time (serially).

Generally considered the strongest isolation level.

But achieving serializability in practice is very difficult. Here are three implementation approaches.

Actual Serial Execution

With continuous hardware advances, database researchers have recognized that single-threaded transaction execution is feasible and efficient.

These conditions help achieve serializable isolation using memory and a single CPU:

Transactions must be short and efficient—otherwise a slow transaction affects all others (since it’s single-threaded).
Limited to scenarios where the active dataset fits entirely in memory.
Write throughput must be low enough for a single CPU core to handle; otherwise partitioning is needed, ideally without cross-partition transactions.
Cross-partition transactions can be supported but must be a very small proportion.

In practice, we can encapsulate transactions in stored procedures. Typical OLTP operations are short, and as long as user I/O is excluded from the transaction path, single-threaded execution can be very efficient. The business server packages logic as data and sends it to the database server, which executes it directly in memory. Historically, stored procedures were criticized because each database had its own language. A modern approach is to use general-purpose languages where possible (for example, Redis uses Lua).

Two-Phase Locking

The only widely used serialization algorithm for nearly 30 years.

Use locking to achieve serializable isolation: transactions need shared locks before reading objects and exclusive locks before modifying them, excluding all other transactions from reading or writing the modified objects (two phases: acquire locks before start, release after end). The database system automatically detects deadlocks between transactions and forcibly terminates one to break the deadlock.
For the “nothing to lock” case discussed in the read-skew section (execute only when query results are empty), apply predicate locks. Predicate locks apply to all rows matching certain search conditions (similar to locking a WHERE predicate so overlapping ranges from concurrent transactions are disallowed). However, predicate locks are hard to implement and inefficient.
In practice, index-range locks often replace predicate locks by widening the protected scope. By locking one or more index ranges of queried objects, those ranges become exclusively locked. In the worst case, a single transaction may lock the whole table.

Serializable Snapshot Isolation

An optimistic-control algorithm proposed on top of MVCC in 2008, with limited real-world adoption. The DDIA author believes it will become a future standard in databases for these reasons:

Pessimistic control shuts down too many transactions—retry costs are too high.
Hardware still has much room for improvement. In the future we’ll encounter fewer concurrency problems, so we should adopt optimistic control strategies.

Building on MVCC snapshot isolation, we apply optimistic control with these principles:

Before a transaction reads, check whether concurrent uncommitted writes could make that read stale.
Before a transaction commits, check whether writes that occurred after its read phase could create conflicts with other transactions.

If conflicts are found, roll back.

Serializable Snapshot Isolation (SSI) relies on SSI locks, similar to index-range locks, but SSI locks only record—they don’t block. After a transaction commits, SSI locks notify other related transactions and are discarded.

The implementation tradeoff is lock granularity: too coarse may misjudge conflicts and expand a transaction’s impact; too fine may cause excessive metadata overhead.

Distributed System Challenges

Network latency
Clock synchronization
Process pausing/crashing

Why go distributed?

Scalability
Fault tolerance
Low latency
If you can avoid opening Pandora’s box, keeping everything on one machine is worth trying

Properties:

Safety: Properties that must never be violated—once violated, the system design has failed.
Liveness: Availability the system guarantees under certain preconditions. If preconditions fail, restoring them returns the system to normal.

The following discussions assume we’ve already solved some problems through transactions.

Byzantine Faults

Not worth considering.

Too expensive.
Environmental issues like radiation can cause Byzantine faults, but the probability on Earth’s surface is extremely low. Of course, machines operating in space must consider this.
Software bugs can cause machine errors, but all machines run the same code. Bugs can’t be prevented unless all machines’ software is independently developed and only a few have bugs—which is clearly unrealistic.
Network intrusions could cause machine errors, but once an intruder can compromise one machine, there’s no reason to believe they can’t compromise all machines. Authentication, encryption, and firewalls are better approaches to network intrusion.

Consensus

Consensus is one of the most important abstractions in distributed systems: all nodes agree on a proposal. Based on this, many distributed-system challenges can be addressed. The solutions below can achieve consensus. In that sense, consensus is a bit like Turing completeness: once you implement one strong consensus mechanism, you can derive many others from it.

Linearizability

Basic idea: Make a system appear as if there’s only one data copy, and all operations are atomic. As we’ll see, because linearizability has a simple definition, we can use whether other solutions can achieve linearizability to verify whether they’re consensus algorithms.

Notes:

Linearizability’s most intuitive requirement: once the system returns the latest value for a read, even if the related write hasn’t committed, all subsequent reads must return the latest value.
Atomic operations. This property can be expressed as CAS (compare-and-set), similar to preventing lost updates in single-machine transactions.
Building on the above, we don’t consider the effects of external network latency when observing the system—i.e., we don’t need to consider phantom reads. Once linearizability is achieved, we can naturally build distributed transactions to handle phantom reads. But when thinking about the problem model, be clear: transactions address data inconsistency between tables (a business-layer concern); distributed system challenges address data synchronization inconsistency between replicas (an infrastructure-layer concern).
Note the distinction between serializability and linearizability: the former concerns concurrent transaction results matching serial execution; the latter concerns data replicas appearing as a single copy—the strongest linearizability prevents the outside world from perceiving parallelism.
Actual serial execution and two-phase locking first achieve linearizability by restricting parallelism and concurrency, thereby achieving serializability. Serializable snapshot isolation, however, uses different snapshots for optimistic concurrency control; snapshot states and their changes can proceed in parallel, so linearizability is not guaranteed. Multiple SSI-based transactions may read different values, but conflicting ones cannot both commit. The first two approaches try to prevent this situation up front.

Quorum can achieve linearizability, provided there’s no uncertain network latency.

As long as there’s an unreliable network, there’s a risk of violating linearizability. This is CAP theory: the network is definitely unreliable; between availability and consistency, you can only choose one.

Ordering Guarantees

Causal Consistency

Linearizability’s constraints are too strict. Can we achieve consensus with weaker requirements? We think of causal consistency: as long as causally related events occur in order and other events happen in parallel, the difficulty should be less than linearizability (which is equivalent to no parallelism).

If a system obeys the order prescribed by causal relationships, we call it causally consistent. If neither of two operations happened before the other, they’re concurrent; otherwise they’re causal and can be ordered.

How do we causally order operations? One approach is Lamport timestamps: assign each operation/client a sequence number composed of an incrementing counter plus a node ID. Sequence numbers are ordered by counter first, then node ID on ties. Every request carries a timestamp. Whenever a node or client observes a larger sequence number, it advances its own counter so the next request uses an even larger value. This allows ordering of operations.

As long as there’s no uncertain network latency, ordering is guaranteed. With network latency, implementing CAS with Lamport timestamps requires each node to first confirm no concurrent CAS requests (resolving ties by sequence number), so network latency directly stalls the system. (Lamport timestamps’ operating environment requires more assumptions that we won’t discuss.)

Total Order Broadcast

Lamport timestamps and other causal-consistency approaches can fail here because they behave like synchronous coordination models (decision waits for information from all nodes). If we switch to an asynchronous model, CAS can be implemented.

Regardless of implementation environment, CAS can be achieved with these two conditions:

Reliable delivery: No message loss. If a message is sent to one node, it must be sent to all nodes.
Strict ordering: Messages are always delivered to each node in the same order.

With this, CAS only needs to be implemented on one node, and all other nodes must follow that CAS operation because operations are reliably delivered and strictly ordered.

So does implementing total order broadcast achieve consensus? We can verify by checking whether total order broadcast can achieve linearizability. It turns out that asynchronous models can’t handle the reading problem: system-internal network latency (not external—we said we don’t need to consider external network latency) may cause the outside world to read a new value followed by an old value.

Therefore:

We say total order broadcast satisfies write linearizability (which is essentially serializability) but not read linearizability.
But is that really the case? Actually, if we treat reads (or reads requiring strict accuracy) as operations and add them to the operation queue, read linearizability is satisfied. ZooKeeper and etcd have similar implementations.

The key takeaway from the two paragraphs above is:

Fully asynchronous consensus is impossible to achieve. Synchronous consensus is possible but has performance issues.

So how do we implement total order broadcast? (Answer: using linearizability…)

Implementing Total Order Broadcast

Finally, we discuss achieving consensus through total order broadcast.

We first discuss a feasible approximate solution, then extend it to total order broadcast.

Two-Phase Commit (2PC)

Introduce a new component: the coordinator. The coordinator and nodes implement two-phase commit as follows:

Send a prepare request to all nodes, asking if they can commit.
If all nodes return “yes,” the coordinator issues a commit request and nodes commit. If any node returns “no,” the coordinator tells all nodes to abort.

This can clearly produce agreement in the happy path. The problem, as always, is network latency and failures. So we enhance the coordinator’s fault tolerance:

Fault-Tolerant Consensus

The book introduces several fault-tolerant consensus algorithms: VSR, Paxos, Raft, and Zab. I’m most familiar with Raft (MIT 6.824), so I’ll use Raft to explain how to extend 2PC for better fault tolerance.

The coordinator can also be the leader node. Whenever a problem occurs, if we can elect a new leader from available nodes to serve as coordinator, consensus remains valid. Electing a new node is itself a consensus problem—this consensus only needs to be acknowledged by more than half the nodes.

Whenever a node detects leader failure, it can declare itself the new leader. As long as it gains acknowledgment from more than half the nodes, it becomes the leader. This leader then acts as coordinator, ordering operations for follower nodes—any operation approved by more than half the nodes is immediately executed.

If more than half the nodes go down, the system enters a crash state. The benefit of requiring a majority: there’s always at least one node that participated in every vote, ensuring split-brain doesn’t occur and consensus is maintained.

Membership and Coordination Services

Now we know how to achieve consensus. We also see that not all operations need consensus participation; only certain critical operations do. Therefore, we can use packaged systems like ZooKeeper (Zab) and etcd (Raft) to establish consensus for small, memory-resident datasets and build highly reliable coordination paths.

Introduction to Ring-Allreduce

Thu, 24 May 2018 00:00:00 GMT

Translated by Claude from the Chinese original.

Today I stumbled upon the ring-allreduce GPU communication algorithm out of curiosity. I originally wanted to read about it on Baidu Research’s page, but couldn’t find the relevant content on their site. After reading the comments in the baidu-allreduce code, I understood it. It’s actually a fairly simple algorithm to explain, so I’ll give a brief overview.

If you want implementation details, you can read the comments on GitHub directly—they’re very clear: https://github.com/baidu-research/baidu-allreduce/blob/master/collectives.cu#L156

Unless otherwise noted, the images in this post are from a Zhihu answer, because I could not find the original Baidu Research article containing these diagrams.

A major drawback of typical multi-GPU training is that one GPU needs to collect gradients from all other GPUs each time, then distribute the updated model back to all other GPUs. As shown below:

The biggest drawback of this model is that GPU 0’s communication time grows linearly with the number of GPUs. This is why ring-allreduce was developed, as shown below:

The basic idea of this algorithm is to eliminate the central reducer and let data flow through a ring formed by the GPUs. The entire ring-allreduce process consists of two major steps: the first step is reduce-scatter, and the second step is allgather.

First step (reduce-scatter): We have n GPUs. We divide the data on each GPU into n equal chunks and assign each GPU its left and right neighbors (in the diagram, GPU 0’s left neighbor is GPU 4 and right neighbor is GPU 1; GPU 1’s left neighbor is GPU 0 and right neighbor is GPU 2, and so on). Then we perform n-1 rounds. In round i, GPU j sends its chunk (j - i) mod n to GPU j+1 and receives chunk (j - i - 1) mod n from GPU j-1, then performs a reduce operation on the received data. (All indices use modulo-n wrap-around, e.g., -1 mod n = n - 1.) The diagram below illustrates this:

After n-1 rounds, the first step (reduce-scatter) of ring-allreduce is complete. At this point, each GPU holds one fully reduced chunk. The algorithm then enters allgather, which also takes n-1 rounds.

The second step, allgather, is straightforward: through n-1 rounds, each GPU forwards its reduced chunk to other GPUs. In round i, GPU j sends its (j - i - 1) mod n chunk to its right neighbor and receives the (j - i - 2) mod n chunk from its left neighbor. Unlike the first step, the received data does not need a reduce operation; it is copied directly into place.

Finally, the data on each GPU looks like this:

If it’s still unclear, let’s walk through a 3-GPU example:

First, the reduce-scatter step:

Then the allgather step:

References:

https://github.com/baidu-research/baidu-allreduce

https://www.zhihu.com/question/57799212/answer/292494636?utm_source=ZHShareTargetIDMore&utm_medium=social&utm_oi=37729630945280

Jiachen Yu

Thoughts on the Video Quality Evaluation

What We Did Last Year

Building the First Rubric

Scaling to More Rubrics

Results

What Am I Curious About This Year?

How Engineers and PMs Ship LLM Features Together

How Engineers and PMs Ship Winning LLM Features Faster: 3 Technical Decisions

PM to Prompt Distance

KV‑Cache

Structured Output / Response Schema

Putting It All Together

Next Steps

Join Our Team

How to 1-on-1

Why I need to do this?

Mindsets

1. Your manager is one of your resources

2. Talk about things you don’t know about each other

Other tips

Simple vs Easy

Learnings from ByteDance

New Things vs Old Things

1

2

3

4

Shape Your Mac

📖 Foreword

🔧 Basic Settings

🗂️ Dotfiles

💻 Command Line Config

⚙️ System Settings

📱 Apps

🌟 Raycast

🌟 AltTab

💾 Backup

3-2-1 Rule

💭 Reflecting thoughts

The Permission Model Myth

Basic Concepts

Use Cases

File System Permissions

Database Permissions

Access Control Models

DAC

RBAC

ABAC

Real-World Case Study

Industry Solutions

Casbin

Zanzibar

The Challenge of Abstraction

Extension: Zero Trust Network Model

Golang Generics

Terminology

Overview

Type Constraints

Underlying Type

Difference Between Using Type Parameters and Using Interfaces Directly

Type Inference

Type Unification

Description

Operations

Function Argument Type Inference

Description

Implementation

Constraint Type Inference

Description

Implementation

Type Inference Execution Steps

Other Topics

The Generics Dilemma

Using [T] Instead of

Why Don’t Go Generics Support Methods?

Supporting Pointer Methods

Using Generics in Practice

Type Inference Code

Type Unification