Вхід Реєстрація
Реклама
Ваше рекламне місце
Забронюйте цей слот без конкуренції на обраний період.
Купити рекламу →
Логотип телеграм спільноти - CatOps
Додано 06 гру 2025

CatOps

@catops
Кількість підписників: 5 059
Фото: 94
Відео: 5
Посилання: 2,660
Опис:
DevOps and other issues by Yurii Rochniak (@grem1in) - SRE @ Preply && Maksym Vlasov (@MaxymVlasov) - Engineer @ Star. Opinions on our own. We do not post ads including event announcements. Please, do not bother us with such requests!

👥 Кількість підписників

5 059
Середній/День:: +2
Середній/Тиждень:: 0
Середній/Місяць:: +16

📊 Кількість повідомлень на день

0.5
Останній день: 1
Середнє за тиждень: 0.7
Середнє за день: 0.5

Історія зміни статуса

Офіційно не підтверджена 2025-12-06

Стіна

Статистика telegram каналу

👁 966 26-06-18 11:23
Continuing with our AI week.AI in SRE: What's Actually Coming in 2026 is telling a story of AI coming for help with incident response.The article suggests trying an AI tool for real investigation or data collection for postmortems. To clarify this, in my experience, you don’t need to have a dedicated tool, a general purpose AI agent with some harness (skills and scripts) would do. You should try it! AI does the job of data gathering incredibly well. Yet, the results are indeed not perfect.Another good point in this article is data quality. AI results are as good as context you provide. I witnessed two prominent failure modes so far:1. Inference on incomplete data: a person with limited access (typically a developer) asks their agent to investigate an alert. The agent comes to some conclusion. At the same time, a person with elevated access (typically a systems engineer) asks their agent to investigate the same alert and gets a different result, likely because some data is only available via kubectl events, etc. The fix for that is not to allow everyone to do everything, the fix is to revisit your observability pipelines and ensure that you ship all the relevant data, which is easier said than done.2. Agent that cries "wolves": if you have a pollutant in your logs, or simply an event that happens very often, agents like to correlate it with everything. If your clusters are elastic, an agent could blame node count fluctuations for every error. The problem here is that once node count fluctuation actually causes a problem, you will be the one to ignore this hint from an agent, because it suggests it every single time.If you are ready to share more AI failure modes specifically related to SRE in Ukrainian, welcome to our chat.#ai #sre
👁 1,420 26-05-14 14:31
Continuing with security advisory.NGINX ngx_http_rewrite_module vulnerability CVE-2026-42945.~NGINX Plus and NGINX Open Source have a vulnerability in the *ngx_http_rewrite_module* module. This vulnerability exists when the *rewrite* directive is followed by a *rewrite*, *if*, or *set* directive and an unnamed Perl-Compatible Regular Expression (PCRE) capture (for example, $1, $2) with a replacement string that includes a question mark (?). An unauthenticated attacker along with conditions beyond its control can exploit this vulnerability by sending crafted HTTP requests. This may cause a heap buffer overflow in the NGINX worker process leading to a restart. Additionally, for systems with Address Space Layout Randomization (ASLR ) disabled, code execution is possible. (CVE-2026-42945) Don't confuse the F5's NGINX Ingress Controller with the community-led ingress-nginx, that is deprecated now.In any case, though, if you're using the ngx_http_rewrite_module (and it's widely used!), you are likely vulnerable.#security
👁 1,690 26-04-18 10:12
Do you trust your colleagues?An article Stop Using Pull Requests from the same author as the previous article in the channel, argues that they may be not ideal.The core argument is that pull requests were originally created for low trust open source environment, in which contributors may have never seen each other, and often do not know each other at all. Development teams in the corporate world operate on another set of assumptions.It's interesting that this article also builds up on the ideas of Thierry de Pauw. IIRC, I already posted his talk "Non blocking Pull Requests" on the channel, but in any case, I can do it again.The main premise of the article is that you need to adopt T*D practices: test-driven development, trunk-based development, and another made-up T*D practice that basically means pair-programming.From my experience I can say, that eliminating pull requests is probably not something you can do in a short run, but measuring the waiting time before PRs are merged is a good practice. Another good practice is to team-up on tasks or projects. So, basically pair-programming, but several people can still work on different tasks within a project, share context on this project, and thus be able to review each other's work almost immediately without much context switching.T*D practices are also nice. Honestly, I have an impression that the majority of people are using the trunk-based merge model and continuous deployment these days. Also, it's interesting how AI can facilitate test-driven development: spec (by human) => test (by a machine) => tests review (by humans) => coding (by a machine).#culture #programming
👁 1,610 26-03-24 08:22
You may already know that Trivy - a popular security scanner - was compromised last Friday.- Here is a report by Wiz about this breach.- Here is another article that goes beyond the GitHub Actions exploit.If you run Trivy in any form, including locally, double-check what and when you ran.Check if you had in your CI logs lines like below. Especially, if you’re not using curl in your CI normally.Terminate orphan process: pid (xxxx) (curl) Check if you have this file on your local machine or a non-GHA executor: ~/.config/systemd/user/sysmon.py.You may need to rotate a lot of credentials as a fallout of this breach.Also, as harsh as it sounds, this line from one of the articles above makes sense:~Stop using Trivy. This isn’t the first time Aqua Security’s infrastructure has been compromised, and the `aqua-bot` account that enabled this attack was reportedly left exposed from a previous incident earlier in March that was never fully contained. That’s not a one-off failure; it’s an organizational pattern. A security scanning tool that can’t secure its own supply chain is a liability, not an asset. Remove `trivy-action` from your workflows and the Trivy CLI from your toolchains. #security
👁 1,430 26-02-25 11:08
An article from OpenAI on how they created a complete project without any human-written code.This is, of course, kind of marketing material for OpenAI, but it also has interesting points: code throughput increased, our bottleneck became human QA capacity. management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: give Codex a map, not a 1,000-page instruction manual. the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see. And the most important point, in my opinion: kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift....In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once. In any case, it's an interesting read. Obviously, it's all related to a completely green field project. So, your mileage for decade-old monoliths may vary.P.S. Also, according to the diagrams in this article, OpenAI uses VictoriaMetrics, which is also cool :)#ai #programming
👁 1,800 26-01-09 09:21
I think, this could be a good Friday read: "When Change Outruns Us" is a tale about sustained progress.The main point of this article is that smart companies do not push for "constant change for the sake of change", but rather adopt a more cyclic pace, when the periods of extensive work are followed by more relaxed times.This article is particularly interesting to me, because I've just finished listening to the "Slow Productivity" book by Cal Newport. One of the principles, outlined in that book, is that one should work in their natural pace. However, a constant run is no one's natural pace. Another observation in that book, is that starting from the second half of the XX century, managers started to approximate work by "business", i.e. if you look busy, you do some work, even if in the reality, there are zero outcomes.Many tech companies like to claim that they are "outcomes-oriented" or "value impact", but in my experience, "business" is still the approximation for work. Especially, once a company growth beyond the size, when everyone naturally knows everyone, as well as what they are doing.#culture #mgmt
👁 1,730 25-12-06 10:53
At least Cloudflare is fast in sharing their postmortems.https://blog.cloudflare.com/5-december-2025-outage/A curious thing is this:>>>Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.<<<IIRC, in the previous incident on Nov 18, only the customers on the newer proxy version were impacted. So, one could say that Cloudflare had a single time-distributed total outage.Another important thing:>>>Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.<<<Honestly, looking forward to seeing the write-up. I can only imagine how stressed their team is after taking down a big chunk of the Internet twice in less than 30 days.#cloudflare #postmortem