Planet Crustaceans

This is a Planet instance for lobste.rs community feeds. To add/update an entry or otherwise improve things, fork this repo.

June 20, 2019

Indrek Lasn (indreklasn)

How did you get this title formatting? June 20, 2019 06:59 PM

How did you get this title formatting?

How To Use Redux with React Hooks June 20, 2019 10:44 AM

React Redux released hooks with the 7.1.0 version. This means we get to use the latest best practices with React.

June 19, 2019

Derek Jones (derek-jones)

A zero-knowledge proofs workshop June 19, 2019 11:33 PM

I was at the Zero-Knowledge proofs workshop run by BinaryDistict on Monday and Tuesday. The workshop runs all week, but is mostly hacking for the remaining days (hacking would be interesting if I had a problem to code, more about this at the end).

Zero-knowledge proofs allow person A to convince person B, that A knows the value of x, without revealing the value of x. There are two kinds of zero-knowledge proofs: an interacting proof system involves a sequence of messages being exchanged between the two parties, and in non-interactive systems (the primary focus of the workshop), there is no interaction.

The example usually given, of a zero-knowledge proof, involves Peggy and Victor. Peggy wants to convince Victor that she knows how to unlock the door dividing a looping path through a tunnel in a cave.

The ‘proof’ involves Peggy walking, unseen by Victor, down path A or B (see diagram below; image from Wikipedia). Once Peggy is out of view, Victor randomly shouts out A or B; Peggy then has to walk out of the tunnel using the path Victor shouted; there is a 50% chance that Peggy happened to choose the path selected by Victor. The proof is iterative; at the end of each iteration, Victor’s uncertainty of Peggy’s claim of being able to open the door is reduced by 50%. Victor has to iterate until he is sufficiently satisfied that Peggy knows how to open the door.

Alibaba example cave loop.

As the name suggests, non-interactive proofs do not involve any message passing; in the common reference string model, a string of symbols, generated by person making the claim of knowledge, is encoded in such a way that it can be used by third-parties to verify the claim of knowledge. At the workshop we got an overview of zk-SNARKs (zero-knowledge succinct non-interactive argument of knowledge).

The ‘succinct’ component of zk-SNARK is what has made this approach practical. When non-interactive proofs were first proposed, the arguments of knowledge contained around one-terabyte of data; these days common reference strings are around a kilobyte.

The fact that zero-knowledge ‘proof’s are possible is very interesting, but do they have practical uses?

The hackathon aspect of the workshop was designed to address the practical use issue. The existing zero-knowledge proofs tend to involve the use of prime numbers, or the factors of very large numbers (as might be expected of a proof system that is heavily based on cryptographic techniques). Making use of zero-knowledge proofs requires mapping the problem to a form that has a known solution; this is very hard. Existing applications involve cryptography and block-chains (Zcash is a cryptocurrency that has an option that provides privacy via zero-knowledge proofs), both heavy users of number theory.

The workshop introduced us to two languages, which could be used for writing zero-knowledge applications; ZoKrates and snarky. The weekend before the workshop, I tried to install both languages: ZoKrates installed quickly and painlessly, while I could not get snarky installed (I was told that the first two hours of the snarky workshop were spent getting installs to work); I also noticed that ZoKrates had greater presence than snarky on the web, in the form of pages discussing the language. It seemed to me that ZoKrates was the market leader. The workshop presenters included people involved with both languages; Jacob Eberhardt (one of the people behind ZoKrates) gave a great presentation, and had good slides. Team ZoKrates is clearly the one to watch.

As an experienced hack attendee, I was ready with an interesting problem to solve. After I explained the problem to those opting to use ZoKrates, somebody suggested that oblivious transfer could be used to solve my problem (and indeed, 1-out-of-n oblivious transfer does offer the required functionality).

My problem was: Let’s say I have three software products, the customer has a copy of all three products, and is willing to pay the license fee to use one of these products. However, the customer does not want me to know which of the three products they are using. How can I send them a product specific license key, without knowing which product they are going to use? Oblivious transfer involves a sequence of message exchanges (each exchange involves three messages, one for each product) with the final exchange requiring that I send three messages, each containing a separate product key (one for each product); the customer can only successfully decode the product-specific message they had selected earlier in the process (decoding the other two messages produces random characters, i.e., no product key).

Like most hackathons, problem ideas were somewhat contrived (a few people wanted to delve further into the technical details). I could not find an interesting team to join, and left them to it for the rest of the week.

There were 50-60 people on the first day, and 30-40 on the second. Many of the people I spoke to were recent graduates, and half of the speakers were doing or had just completed PhDs; the field is completely new. If zero-knowledge proofs take off, decisions made over the next year or two by the people at this workshop will impact the path the field follows. Otherwise, nothing happens, and a bunch of people will have interesting memories about stuff they dabbled in, when young.

Indrek Lasn (indreklasn)

I reckon those are for testing purposes only. June 19, 2019 05:43 PM

I reckon those are for testing purposes only. Check out the Move programming section to programmatically handle transactions:

https://developers.libra.org/docs/move-overview#move-transaction-scripts-enable-programmable-transactions

I’m Giving Out My Best Business Ideas Hoping Someone Will Build Them — Part II June 19, 2019 02:36 PM

This is a series where I post ideas and problems I want to see solved. The ideas range from tech to anything. The main idea behind the…

June 18, 2019

Indrek Lasn (indreklasn)

Thanks, Madhu. I love exploring and writing about cool new tech. Keep on rockin’ June 18, 2019 08:55 PM

Thanks, Madhu. I love exploring and writing about cool new tech. Keep on rockin’

Getting started with the Facebook Libra programming language June 18, 2019 03:06 PM

Facebook revealed it&aposs new global cryptocurrency and programming environment called Libra. Libra will let you buy things or send money to…

June 17, 2019

Indrek Lasn (indreklasn)

I’m Giving Out My Best Business Ideas Hoping Someone Will Build Them June 17, 2019 09:07 PM

Yup, I’m giving out some of my best ideas. I’m not a saint, but I definitely belong to the group of doing things that matter and create a…

Tobias Pfeiffer (PragTob)

What’s the Fastest Data Structure to Implement a Game Board in Elixir? June 17, 2019 03:00 PM

Ever wanted to implement something board game like in Elixir? Chess? Go? Islands? Well, then you’re gonna need a board! But what data structure would be the most efficient one to use in Elixir? Conventional wisdom for a lot of programming languages is to use some sort of array. However, most programming languages with immutable […]

Bogdan Popa (bogdan)

racket/gui saves the day June 17, 2019 07:00 AM

Yesterday, I bought an icon pack containing over 3,000 (!) SVG files and macOS utterly failed me when I tried to search the unarchived folder.

empty search screen

So I did what any self-respecting Racketeer would do. I used this as an excuse to play around with Racket’s built-in GUI library!

the final product

Marc Brooker (mjb)

Is Anatoly Dyatlov to blame? June 17, 2019 12:00 AM

Is Anatoly Dyatlov to blame?

Without a good safety culture, operators are bound to fail.

(Spoiler warning: containers spoilers for the HBO series Chernobyl, and for history).

Recently, I enjoyed watching HBO's new series Chernobyl. Like everybody else on the internet, I have some thoughts about it. I'm not a nuclear physicist or engineer, but I do think a lot about safety and the role of operators.

The show tells the story of the accident at Chernobyl in April 1986, the tragic human impact, and the cleanup and investigation in the years that followed. One of the villains in the show is Anatoly Dyatlov, the deputy chief engineer of the plant. Dyatlov was present in the control room of reactor 4 when it exploded, and received a huge dose of radiation (the second, or perhaps third, large dose in his storied life of being near reactor accidents). HBO's portrayal of Dyatlov is of an arrogant and aggressive man whose refusal to listen to operators was a major cause of the accident. Some first-hand accounts agree2, 3, 6, and others disagree1. Either way, Dyatlov spent over three years in prison for his role in the accident.

There's little debate that the reactor's design was deeply flawed. The International Nuclear Safety Advisory Group (INSAG) found4 that certain features of the reactor "had a primary influence on the course of the accident and its consequences". During the time before the accident, operators had put the reactor into a mode where it was unstable, with reactivity increases leading to higher temperatures, and further reactivity increases. The IAEA (and Russian scientists) also found that the design of the control rods was flawed, both in that they initially increased (rather than decreasing) reactivity when first inserted, and in that they machinery to insert them moved too slowly. They also found issues with the control systems, cooling systems, and the fact that some critical safety measures could be manually disabled. Authorities had been aware of many of these issues since an accident at the Ignalina plant in 19834, page 13, but no major design or operational practice changes had been made by the time of the explosion in 1986.

In the HBO series' telling of the last few minutes before the event, Dyatlov was shown to dismiss concerns from his team that the reactor shouldn't be run for long periods of time at low power. Initially, Soviet authorities claimed that the dangers of doing this was made clear to operators (and Dyatlov ignored procedures). Later investigations by IAEA found no evidence that running the reactor in this dangerous mode was forbidden 4, page 11. The same is true of other flaws in the plant. Operators weren't clearly told that pushing the emergency shutdown (aka SCRAM, aka AZ-5) button could temporarily increase the reaction rate in some parts of the reactor. The IAEA also found that the reactors were safe in "steady state", and the accident would not have occurred without the actions of operators.

Who is to blame for the 1986 explosion at Chernobyl?

In 1995, Dyatlov wrote an article in which he criticized both the Soviet and IAEA investigations5, and asked a powerful question:

How and why should the operators have compensated for design errors they did not know about?

If operators make mistakes while operating systems which have flaws they don't know about, is that "human error"? Does it matter if their ignorance of those flaws is because of their own inexperience, bureaucratic incompetence, or some vast KGB-lead conspiracy? Did Dyatlov deserve death for his role in the accident, as the series suggests? As Richard Cook says in "How Complex Systems Fail"7:

Catastrophe requires multiple failures – single point failures are not enough. ... Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

And

After accidents, the overt failure often appears to have been inevitable and the practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. ... That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

If Dyatlov and the other operators of the plant had known about the design issues with the reactor that had been investigated following the accident at Ignalina in 1983, would they have made the same mistake? It's hard to believe they would have. If the reactor design had been improved following the same accident, would the catastrophe had occurred? The consensus seems to be that it wouldn't have, and if it did then it would have taken a different form.

From "How Complex Systems Fail":

Most initial failure trajectories are blocked by designed system safety components. Trajectories that reach the operational level are mostly blocked, usually by practitioners.

The show's focus on the failures of practitioners to block the catastrophe, and maybe on their unintentional triggering of the catastrophe seems unfortunate to me. The operators - despite their personal failings - had not been set up for success, either by arming them with the right knowledge, or by giving them the right incentives.

From my perspective, the show is spot-on in it's treatment of the "cost of lies". Lies, and the incentive to lie, almost make it impossible to build a good safety culture. But not lying is not enough. A successful culture needs to find the truth, and then actively use it to both improve the system and empower operators. Until the culture can do that, we shouldn't be surprised when operators blunder or even bluster their way into disaster.

Footnotes

  1. BBC, Chernobyl survivors assess fact and fiction in TV series, 2019
  2. Svetlana Alexievich, "Voices from Chernobyl".
  3. Serhii Plokhy, "Chernobyl: The History of a Nuclear Catastrophe". This is my favorite book about the disaster (I've probably read over 20 books on it), covering a good breadth of history and people without being too dramatic. There are a couple of minor errors in the book (like confusing GW and GWh in multiple places), but those can be overlooked.
  4. INSAG-7 The Chernobyl Accident: Updating of INSAG-1, IAEA, 1992
  5. Anatoly Dyatlov, Why INSAG has still got it wrong, NEI, 1995
  6. Adam Higginbotham, "Midnight in Chernobyl: The Untold Story of the World's Greatest Nuclear Disaster"
  7. Richard Cook, How Complex Systems Fail

June 16, 2019

Derek Jones (derek-jones)

Lehman ‘laws’ of software evolution June 16, 2019 09:32 PM

The so called Lehman laws of software evolution originated in a 1968 study, and evolved during the 1970s; the book “Program Evolution: processes of software change” by Lehman and Belady was published in 1985.

The original work was based on measurements of OS/360, IBM’s flagship operating system for the computer industries flagship computer. IBM dominated the computer industry from the 1950s, through to the early 1980s; OS/360 was the Microsoft Windows, Android, and iOS of its day (in fact, it had more developer mind share than any of these operating systems).

In its day, the Lehman dataset not only bathed in reflected OS/360 developer mind-share, it was the only public dataset of its kind. But today, this dataset wouldn’t get a second look. Why? Because it contains just 19 measurement points, specifying: release date, number of modules, fraction of modules changed since the last release, number of statements, and number of components (I’m guessing these are high level programs or interfaces). Some of the OS/360 data is plotted in graphs appearing in early papers, and can be extracted; some of the graphs contain 18, rather than 19, points, and some of the values are not consistent between plots (extracted data); in later papers Lehman does point out that no statistical analysis of the data appears in his work (the purpose of the plots appears to be decorative, some papers don’t contain them).

One of Lehman’s early papers says that “… conclusions are based, comes from systems ranging in age from 3 to 10 years and having been made available to users in from ten to over fifty releases.“, but no other details are given. A 1997 paper lists module sizes for 21 releases of a financial transaction system.

Lehman’s ‘laws’ started out as a handful of observations about one very large software development project. Over time ‘laws’ have been added, deleted and modified; the Wikipedia page lists the ‘laws’ from the 1997 paper, Lehman retired from research in 2002.

The Lehman ‘laws’ of software evolution are still widely cited by academic researchers, almost 50-years later. Why is this? The two main reasons are: the ‘laws’ are sufficiently vague that it’s difficult to prove them wrong, and Lehman made a large investment in marketing these ‘laws (e.g., publishing lots of papers discussing these ‘laws’, and supervising PhD students who researched software evolution).

The Lehman ‘laws’ are not useful, in the sense that they cannot be used to make predictions; they apply to large systems that grow steadily (i.e., the kind of systems originally studied), and so don’t apply to some systems, that are completely rewritten. These ‘laws’ are really an indication that software engineering research has been in a state of limbo for many decades.

Indrek Lasn (indreklasn)

Demystifying React Hooks June 16, 2019 11:33 AM

You probably heard about the new concept for React called hooks. Hooks were released in React version 16.8 and they let us write stateful…

June 15, 2019

Stig Brautaset (stig)

Digital Minimalism June 15, 2019 02:07 PM

I introduce Cal Newport's book, and how it's helping me take control of where I spend my limited currency in today's attention economy.

June 14, 2019

Jeff Carpenter (jeffcarp)

On Being Injured (Again) June 14, 2019 09:55 PM

tl;dr: I signed up for the SF Marathon (this would have been my first marathon), then overtrained, got injured, and am currently recovering. I’m probably going to defer my registration to 2020 and become a cheering squad this year. (╯°□°)╯︵ ┻━┻ This is a cycle I’ve been through over and over again. I literally wrote about this in 2015. Being injured massively sucks. I can’t exercise the way I usually do, and I don’t get see my running buddies.

Indrek Lasn (indreklasn)

How to set up a powerful API with GraphQL, Koa, and MongoDB — deploying to production June 14, 2019 07:58 PM

Our GraphQL runs smoothly locally, but what if we want to share it with the world?

June 13, 2019

Nikita Voloboev (nikivi)

Indrek Lasn (indreklasn)

Working on a startup is frustrating and here’s how you can combat it June 13, 2019 08:15 AM

There’s no doubt starting a new startup can feel lonely, depressive, and frustrating. That’s mostly because we have no idea where we will…

June 12, 2019

Indrek Lasn (indreklasn)

4 Daily habits successfully happy people have June 12, 2019 10:23 AM

What makes successful people happy? Some might say wealth, money equals happiness, right?

June 09, 2019

Ponylang (SeanTAllen)

Last Week in Pony - June 9, 2019 June 09, 2019 08:39 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

June 08, 2019

Bogdan Popa (bogdan)

Announcing marionette June 08, 2019 12:45 PM

I just released the first version of marionette (named after the protocol it implements), a Racket library that lets you remotely control the Firefox web browser. Think “puppeteer”, but for Firefox.

Derek Jones (derek-jones)

PCTE: a vestige of a bygone era of ISO standards June 08, 2019 02:24 AM

The letters PCTE (Portable Common Tool Environment) might stir vague memories, for some readers. Don’t bother checking Wikipedia, there is no article covering this PCTE (although it is listed on the PCTE acronym page).

The ISO/IEC Standard 13719 Information technology — Portable common tool environment (PCTE) —, along with its three parts, has reached its 5-yearly renewal time.

The PCTE standard, in itself, is not interesting; as far as I know it was dead on arrival. What is interesting is the mindset, from a bygone era, that thought such a standard was a good idea; and, the continuing survival of a dead on arrival standard sheds an interesting light on ISO standards in the 21st century.

PCTE came out of the European Union’s first ESPRIT project, which ran from 1984 to 1989. Dedicated workstations for software developers were all the rage (no, not those toy microprocessor-based thingies, but big beefy machines with 15inch displays, and over a megabyte of memory), and computer-aided software engineering (CASE) tools were going to provide a huge productivity boost.

PCTE is a specification for a tool interface, i.e., an interface whereby competing CASE tools could provide data interoperability. The promise of CASE tools never materialized, and they faded away, removing the need for an interface standard.

CASE tools and PCTE are from an era where lots of managers still thought that factory production methods could be applied to software development.

PCTE was a European funded project coordinated by a (at the time) mainframe manufacturer. Big is beautiful, and specifications with clout are ISO standards (ECMA was used to fast track the document).

At the time Ada was the language that everybody was going to be writing in the future; so, of course, there is an Ada binding (there is also a C one, cannot ignore reality too much).

Why is there still an ISO standard for PCTE? All standards are reviewed every 5-years, countries have to vote to keep them, or not, or abstain. How has this standard managed to ‘live’ so long?

One explanation is that by being dead on arrival, PCTE never got the chance to annoy anybody, and nobody got to know anything about it. Standard’s committees tend to be content to leave things as they are; it would be impolite to vote to remove a document from the list of approved standards, without knowing anything about the subject area covered.

The members of IST/5, the British Standards committee responsible (yes, it falls within programming languages), know they know nothing about PCTE (and that its usage is likely to be rare to non-existent) could vote ABSTAIN. However, some member countries of SC22 might vote YES, because while they know they know nothing about PCTE, they probably know nothing about most of the documents, and a YES vote does not require any explanation (no, I am not suggesting some countries have joined SC22 to create a reason for flunkies to spend government money on international travel).

Prior to the Internet, ISO standards were only available in printed form. National standards bodies were required to hold printed copies of ISO standards, ready for when an order to arrive. When a standard having zero sales in the last 5-years, came up for review a pleasant person might show up at the IST/5 meeting (or have a quiet word with the chairman beforehand); did we really want to vote to keep this document as a standard? Just think of the shelf space (I never heard them mention the children dead trees). Now they have pdfs occupying rotating rust.

June 06, 2019

Indrek Lasn (indreklasn)

Absolutely. I recommend doing the dependency updates monthly. June 06, 2019 09:59 AM

Absolutely. I recommend doing the dependency updates monthly.

June 05, 2019

Jeff Carpenter (jeffcarp)

Setting Up a Recruiter Auto-reply Bot June 05, 2019 08:16 PM

If you’re a software engineer, you’re likely familiar with unsolicited emails from recruiters. Most are probably template emails. Some of them are funny, some are thoughtful, and some of them ask you to move 3000 miles, take a 50% pay cut, and code in a language you don’t know. Impact Recruiter emails have a measurable impact on productivity. If I were to hand-write a response to each one (taking 2 minutes), and I got 1 recruiter email a day, that’s 12 hours of work, or more than one full work day of each year… gone.

June 04, 2019

Jan van den Berg (j11g)

The Fall (De Val) – Matthias M.R. Declercq June 04, 2019 06:07 PM

Matthias M.R. Declercq pulled of two remarkable things. Not only did he manage to find this extraordinary story about friendship, ambition and sacrifice, he was also able to write it down in exceptional fashion.

De Val – Matthias M.R. Declercq (2017) – 296 paginas

The events described in ‘The Fall’ (‘De Val’) are real, but the book is not necessarily a biography. The story revolves around a group of five Belgian riders (flandriens) who are pretty well known in the cycling circuit. Some are even minor celebrities. Their lives and events — and especially the fall — are pretty well known and in some cases were front page news. As a writer you could easily overlook these stories because they were already so heavily documented.

But Declercq shows to have a keen eye for the story behind a story, and he was able to look past known facts and look for a deeper, collective connection between these five riders. And from their humble shared beginnings, Declercq takes the reader on a journey for each individual rider.

He does so with finesse. There is a dignified distance in his writing style (like a reporter) and this strikes the right tone of being an interested witness rather than a thrill seeker (the latter being the fate of many sport books of recent years). By doing so we get to hear the human perspective behind the stories. All these riders have lives, parents, wives, children and they sacrifice a lot. Which might be easy to forget when watching the Giro.

The fact that the often dramatic and heroic sport of cycling is a central subject, of course helps the book, but it is mostly Declercqs’ writing that make this book stand out. I love cycling and I love good books by good writers. This book has both.

The post The Fall (De Val) – Matthias M.R. Declercq appeared first on Jan van den Berg.

Indrek Lasn (indreklasn)

Here are 3 super simple developer tips that will supercharge your project June 04, 2019 01:37 PM

GitHub repository code quality checks and automatic deployment

Saving small amounts of time here and there leads up to saving a big chunk of time. The more time you can save with automation, the more you can focus on other areas of your project and getting more done. Here are some of the most time-saving tips and tools I use at my current company.

Continuous Integration

Code checks with Github and CircleCi

Every time you push code to the repository there should be automatic checks to run all unit tests, check ESLint rules, build the project to ensure the new deployment of application will be successful.

Let’s say you work in a small team of 3 developers. Each time someone commits new code to the repository, you would have to run all code quality checks manually. Who in the right mind does that and has time for it? That’s why we have continuous integration.

Check out this tutorial on how I setup CI for my projects.

How to setup continuous integration (CI) with React, CircleCI, and GitHub

Automated dependency updates

Automated dependency updates at www.getnewly.com

Have you ever had the chance to update dependencies for an application that’s running in production?

It’s a nightmare, some packages have peer dependencies, some have bugs, some are not compatible. Worst case scenario when updating packages is that application won’t work anymore.

Automated dependency updates to the rescue!

Instead of updating once a month and hoping everything will continue to work, wake up to new pull requests of daily version updates. If a package updates, the bot will create a PR with the new version, changelogs, release notes and package commits. Very useful if I may say.

I use dependabot (https://dependabot.com/)

Code formatter + (Prettier)

To semicolon or not to semicolon? Why fight over petty things when you can save energy and focus on your next unicorn idea?

I fell in love with Prettier ever since I started using it. Prettier combined with ESLint and Stylelint will make your developer soul happy.

I found it most convenient setting up Prettier with ESLint with Wes Bos’s approach. Here’s the full tutorial on how to do that.

https://medium.com/media/9fd7e0babcf87b91ce28ca24b7e42487/href

Thanks for reading! ❤

Check out my Twitter if you have any questions or know how to improve the developer environment even further.

Indrek Lasn (@lasnindrek) | Twitter

Don’t forget to follow Newly publication for more awesome stuff!

Here are some of my previous article you might enjoy;


Here are 3 super simple developer tips that will supercharge your project was originally published in Newly on Medium, where people are continuing the conversation by highlighting and responding to this story.

June 03, 2019

Jeremy Morgan (JeremyMorgan)

Thinking About Reusable Code June 03, 2019 06:09 AM

The mythical "reusable code" idea has existed for decades. It showed up shortly after the first lines of code were written. We preach re-usability and sometimes strive for it but it rarely becomes a reality. I've seen various levels of success with this over the years. Everything from "we have a reusable library that 75% of us use" to "we have shared code libraries here, but never use them in your projects".

A recent discussion led me to think about this. Why don't more software development organizations have shared code libraries? It is the pointy-haired bosses preventing it? Team conflicts? Or is it the developers themselves? 

We can't blame the pointy-haired bosses for this one. If you explain the basic concepts of reusable code to management, most would agree it's a great idea. Building something once, so you don't have to build it repeatedly? Easy to find the upside here.

Team conflicts can also contribute to it, which is usually people disagreeing about who gets to determine what code is shared. Developers themselves can also be opposed to it, due not having enough time to build them.

All of these are contributing factors to lack of adoption, but the question you should ask is do we need reusable code libraries at all?


What Are Shared Libraries and Why Do We Care?

If you have tasks your developers are building that contain code you can use for something else, you put that code in its own "library" to use later. This can be a DLL, a folder of snippets, a Node module, whatever. Connecting to a database? There's no reason to write that code for every piece of software that accesses a database. Create a DB_Connect class, and put that in a file you can copy to another project later.

It's easy. You take a function and if it's abstract enough, parameterize it and make it available for other projects to use. When you start your project, you don't have to write code to connect to the database, pull the library and enter your parameters. Here are some upsides:

  • You write code once and use it multiple times (saving cycle times)
  • If tested thoroughly, it cuts regression errors
  • It enforces standards other groups can follow

These are all great reasons to use shared libraries. Countless articles and books have been written about code reuse and most of you are familiar with them. The biggest selling point for this is not having to code "boring stuff" over and over and have wild variations of the same methods in the wild. This frees up time to work on exciting features.


How Are They Used?

Here are some ways shared libraries are used in business:

  • Internal Use Code that shared with internal groups and used for projects
  • External Use Code that designed for programmers outside the organization to use
  • Forked - Users can take your initial code and "fork" it to make it work for their specific needs

This falls in line with the DRY principle of software development. Don't repeat yourself.


Do We Want Shared Code Libraries?

Why isn't everyone doing this? Your organization may avoid shared libraries for a good reason. Not every project or team benefits from this and it's not the magic bullet to solve all development problems. Here are some reasons not to use code libraries.

  • Distributed Monolith Problem - If you put something into a library and everyone uses it, that doesn't mean it isn't bad. An inefficient database connector design means everyone is using an inefficient database connector. Changes to your implementation can still have negative cascading effects. A small change to an API could mean every single system is now broken.

  • Dependency Hell - If every piece of software is relying on your library, there's a tight coupling in place. There will be a myriad of dependencies in place, for instance, relying on a framework like Angular. If you have software built to use a specific version, upgrading it could produce a chain reaction amongst your applications.

  • Inflexible Design - Sometimes you're asked to use a design in a certain way that's different from what you actually need. Because "everyone else is doing it" you're forced to make your code work around it, which can then increase development time. Wait, why did we build shared libraries in the first place?

  • Programming Language Lock-In - Too many times I've heard "well our library is written in ____ so you have to use that" which means you're locked into a single programming language. This can lead to using the wrong tool for the job. So don't assume because your team doesn't use shared libraries means they're unenlightened and don't want better production.


What to Examine if You Want to Use Reusable Code Libraries

If you're considering creating reusable code libraries, you should first see if it makes sense to do so.

  • Take a hard look at the design - Then look at it again. Will you really benefit from it? Do you have repeated code now? Are there any issues that may encourage tight coupling? 
  • Closely examine your dependencies - Are you creating dependencies others will have to work around? How often do they change? Can they bypass the ones you create? 
  • Start abstract as possible - You want your libraries to be highly abstract so that they perform expected functions, but can be overridden or extended to meet specialized use cases. Give them the basics but let them evolve it how they'd like.

How to Get Developers to Implement Them

So you've decided you want a shared library. Here are some ways you can help teams adopt your library:

  • Steal it from an existing project - This is the easiest way to start a new library. Take an existing project and remove everything that's specialized and keep everything that can be re-used and feels solid.

  • Thoroughly test it - The last thing you want to do is introduce problems into the software. After building your abstract library, test it thoroughly so you can rely on it, and help developers find regressions. Don't hand them garbage and expect them to use it.

  • Examine all current use cases - By digging in and understanding how people will use your library you can determine whether it will benefit them, and whether to make changes. Do this on a method by method basis to ensure that you're asking them to use something that will work for them with minimum effort.  Like most anything in business, success will come from great communication. Top down design is not your friend here, you'll want to examine the needs of the group and design a library from those needs.


Conclusion

Most of the organizations I've been involved with do not use a shared code library. They're usually in some form of working towards one, and only two where they were implemented and working well.

At one end of the spectrum you have a pattern where all repeated code is in a library. On the other you'll have no code library, and everyone builds software for their own projects. You'll find success somewhere between the two. How far you lean toward each extreme will depend on your needs.

There are abstract functions that should always be shared. Some that come to mind are authentication/authorization, database connections and logging. It should be either a shared library you developed, or one that's open sourced and built by someone else. Everything else should be examined, considering the overall design.

Don't jump to conclusions and take on a large project to build a giant code library that will make developers lives worse. But don't exclude the idea altogether. Be purposeful and observant to find which strategy will work best for your projects. 



What is your DevOps IQ?

what's your devops score


My Devops Skill IQ is 232. Can you beat it? Take the test now to find out your Devops IQ score!!

Andreas Zwinkau (qznc)

Thinking in Systems by Donella Meadows June 03, 2019 12:00 AM

Book review: A shallow introduction to Systems Thinking.

Read full article!

June 02, 2019

Gustaf Erikson (gerikson)

Gokberk Yaltirakli (gkbrk)

Gopher Server in Rust June 02, 2019 06:06 PM

I find Gopher really cool. I think it’s a really nice way to organize information into trees and hierarchies, and as we all know programmers can’t resist trees. So recently I took an interest in Gopher and started writing my own server.

Gopher, like HTTP, is a network protocol for retrieving information over the internet. One crucial difference is, it hasn’t been commercialized by adtech companies. This is probably because it doesn’t provide many opportunities for tracking, and it doesn’t have a significantly large user base.

But recently it’s been gaining traction; so we should provide a decent landscape for new gophers, full of oxidised servers. Since I started using Gopher more often, it’s beneficial for me if there’s more content out there. So I’m writing this blog post to walk you through how to write your own server. We’ll be doing this in Rust.

Before we jump into the details of the protocol, let’s set up a server that responds with “Hello world”. This will provide a skeleton that we can fill with Gopher-related code later.

Handling connections

fn handle_client(stream: TcpStream) -> io::Result<()> {
    write!(stream, "Hello world!")?;
    Ok(())
}

fn main() -> io::Result<()> {
    let listener = TcpListener::bind(format!("0.0.0.0:70")?;

    for stream in listener.incoming() {
        thread::spawn(move || handle_client(stream?));
    }

    Ok(())
}

In this skeleton, pretty much all of our code is going to be in the handle_client function. If we look at the RFC for Gopher; we can see that after establishing a connection, the client sends the selector for the resource they are interested in. Like /ProgrammingLanguages/Python. Let’s read one line from the socket and look at which selector they want.

Gopher protocol

let mut line = String::new();
BufReader::new(stream.try_clone()?).read_line(&mut line)?;
let line = line.trim();

At this point, a traditional server would check the filesystem for the selector and a fancy web framework would go through the registered routes and possibly check some regular expressions. But for our toy server, a simple match statement will be more than enough.

let mut menu = GopherMenu::with_write(&stream);

match line {
    "/" | "" => {
        menu.info("Amazing home page of amazingness")?;
    }
    x => {
        menu.info("Page not found")?;
    }
}
menu.end()?;

In the code above, GopherMenu comes from a crate called gophermap. It’s a crate that can parse and generate Gopher menus.

Relative links

For relative links, we need to know the server address. Let’s put that in a constant and write a small helper.

const HOST: &str = "gkbrk.com";

let menu_link = |text: &str, selector: &str| {
    menu.write_entry(ItemType::Directory, text, selector, HOST, 70)
};

match line {
    "/" | "" => {
        menu.info("Hi!")?;
        menu.info("Try going to page 1")?;
        menu_link("Page One", "/page1")?;
        menu_link("Go to unknown link", "/test")?;
    }
    "/page1" => {
        menu.info("Yay! You found the secret page")?;
        menu_link("Home page", "/")?;
    }
    x => {
        menu.info(&format!("Unknown link: {}", x))?;
        menu_link("Home page", "/")?;
    }
};
menu.end()?;

Now we can link between our pages and start building proper pages. Hopefully this was a good start. If anyone reading this sets up their own Gopherspace, please let me know by leaving a comment or sending me an email.

Ponylang (SeanTAllen)

Last Week in Pony - June 2, 2019 June 02, 2019 11:35 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

June 01, 2019

Bit Cannon (wezm)

A Tiling Desktop Environment June 01, 2019 11:48 PM

I’ve been thinking about graphical shells recently. One of the great things about open source desktops1 is there is a plethora of choice when it comes to graphical shells. However they seem to fall into two camps:

  1. Full featured desktop environments that stick to the conventional stacking window metaphor.
  2. Narrowly featured window manager based environments that include tools like tiling window managers often optimised for efficient keyboard use.

I am currently using the second of these through the Awesome window manager. I’m really enjoying the keyboard centric operation, and almost never needing to manually position newly spawned windows. Each workspace (desktop) gets its own layout2, which describes how windows are laid out. For example, my most commonly used layout has one master window that takes up half the screen then additional windows are stacked on the right half. The split between the two halves of the screen is easily adjusted with Superh and Superl. Layouts can be changed on the fly with Superspace to suit your work.

Another aspect I enjoy about Awesome is its snappiness. This is largely due to the lack of animation. Switching workspaces is instant, without any unnecessary flourishes. It seems that the animations used in many graphical shells these days tend to reduce the perceived performance of the system. Look how fast this iPhone is when animations are disabled.

The drawback to window manager based environments is that you give up the cohesive, full featured nature of a desktop environment. For example, these are features of GNOME that I had to research, install, and configure after moving to Awesome (items in italic are ones I haven’t actually taken the time to implement yet):

  • Compositor
  • Volume and brightness status and control
  • Network status and control
  • SSH agent, GPG agent, polkit agent
  • Screenshot tools
  • Media controls
  • Notifications
  • HiDPI support
    • Cursor sizing
  • Automounting of external drives
  • Automatic multi-monitor support
    • Desktop gracefully adapting to monitors being added and removed
  • Screen locking
  • Power management
    • Battery status
    • Low power warnings
  • Clipboard preservation
    • I.e. clipboard source can exit and you can still paste
  • Color management

Even with many of these implemented the components don’t always work as nicely as in GNOME. For example, my XPS 15 has a built-in 4K display and I connect it to an external 4K display at work. When dunst shows a notification on the built-in display the text is sized wrong, when it shows on the external display it is correct, even though the displays are identical resolution.

On the flip side Awesome has these things in its favour:

  • Lower resource usage (mostly RAM)
  • Alternate window management layouts:
    • Stacking
    • Tiling
    • Floating
    • Maximised
    • Full screen
    • And more
  • Keyboard oriented
  • Keyboard bindings completely customisable
  • Better use of screen space
    • I have no title bars on windows
    • Top bar is very short and can be instantly toggled with Superb

So all this makes me wonder, where is the middle ground? Where is the desktop environment for professionals?

Pro Desktop

Mac OS is a popular choice for developers in some circles and has the cohesive full-featured experience that I mentioned above. I conducted an informal survey on Twitter to try to see what things Mac users are adding to the system to make it work better for them:

Hey Mac power users! I’m doing some research: What tools do you install to make the UI/graphical shell work better for you? Things that come to mind are: Alfred, Divvy, Spectacle, FastScripts, LaunchBar, chunkwm, that kind of thing.

The responses almost all included one or more of these elements:

  • Window management (often via the keyboard)
  • Keyboard remapping
  • Automation
  • Application launching
  • System stats

Some open source desktops have all these features but the ones that have them all seem to lack the polish and consistency of Mac OS, GNOME, or KDE and require a large investment in researching, installing, and configuring each desired feature. The ones that have the polish and consistency lack the customisation and keyboard control.

So where does that leave me? I want a desktop environment like GNOME but with more control over window management and more keyboard control.

Perhaps there is room for something that takes the place of gnome-shell in the typical GNOME desktop but built for this use case. gnome-shell is built on mutter and there are other desktop shells built on this too such as gala, and Budgie, so perhaps it would be possible to use mutter as the base window manager and compositor and build upon it.

I’ve been considering starting such a project but before diving in decided to write this post and do some more research to help clarify my thoughts. Something for me to ponder. 🤔

Comments


Why don’t you just…

Inevitably some folks will be thinking, “why don’t you just…”. Below are a few of these that I’ve thought of already. I may add more as time goes on.

Use KDE

One possible option is using KDE with an alternate window manager. Although this does prevent you from using Wayland. I am fan of Wayland but not yet a user. I believe it is the future of the graphics stack on open source desktops and I think its architecture makes sense give the way computers are used today.

My problem with KDE is the aesthetic. KDE and Qt really don’t seem to align with me. That’s not to say they’re bad or even ugly, it’s just not to my liking. I suppose as an ex-Mac user I feel more at home with GNOME/GTK. On the other hand it seems like someone familiar with Windows would feel more at home with KDE/Qt.

Things like menus attached to windows, icons on buttons, icons on menu items, Application launcher menu (“Start” button), bottom task bar, and apply buttons in configuration dialogs all feel very foreign to my Mac using past. Sure some of these may be configurable but I’m not sure I’d ever feel at home.

KDE neon: Icons on buttons, Apply button for configuration, task bar, Start-esque menu.

KDE neon: Icons in menus.

For comparison here is GNOME showing the the same things. I prefer that it is less busy and to be honest more like Mac OS in some ways.

GNOME: No icons in menus, no task bar.

GNOME: No Apply button in settings.

Use Xfce

It is possible to use an alternate window manager with Xfce. However, while Xfce has made recent progress on HiDPI support it’s still a mishmash of blurry icons, and tiny controls in places.

Xubuntu 19.10 with 2x scaling: Blurry icons, tiny controls.

Use the gTile GNOME extension

gTile is more of a manual window resizer. It allows you to position windows on a grid but it doesn’t appear to have anything approaching Awesome’s layouts.

Stick with Awesome

It’s true that Awesome is working for me but it does feel a bit like I’m back in the dark ages needing to find and configure things that I’ve previously taken for granted. It is nice to build your own environment like this but the little imperfections like the dunst notifications mentioned above, or handling of external displays have me wanting more.


  1. I’m referring to these as open source desktops and not Linux desktops since they work on other systems too, like BSDs, and OpenIndiana. [return]
  2. My first Awesome includes a litte information in the layouts. [return]

Pierre Chapuis (catwell)

Truncating an Alembic migrations history June 01, 2019 07:00 PM

In projects that use SQLAlchemy and Alembic via Flask-Migrate, you may want to truncate the migrations history. By that I mean: rewrite all the migrations up to some point as a single initial migration, to avoid replaying them every single time you create a new database instance. Of course, you only want to do that if you have already migrated all your database instances at least up to that point.

As far as I know, there is no Alembic feature to do this automatically. However, I found a way to avoid having to write the migration by hand. Here is an example of how you can achieve this with a project using Git, PostgreSQL, and environment variables for configuration.

First, checkout a commit of your project where the first migration you want to keep is the current migration, and create a temporary branch. Then, take a note of the ID of that migration (for instance abcd12345678), delete the whole migrations directory and reinitialize Alembic.

git checkout $my_commit
git checkout -b tmp-alembic
rm -rf migrations
flask db init

At this point, using Git, revert changes to files where you should keep your changes, such as script.py.mako and env.ini. Then, create a temporary empty database to work with.

git checkout migrations/script.py.mako
git checkout migrations/env.py
createdb -T template0 my-temp-db

Now create the initial migration that corresponds to your model, with the ID that you noted previously, e.g.:

MY_DATABASE_URI="postgresql://postgres@localhost/my-empty-db" \
    flask db migrate --rev-id abcd12345678

Finally, you can delete the temporary database, commit your changes to your temporary branch, merge it into your main development branch and delete it:

dropdb my-empty-db
git commit
git checkout dev
git merge tmp-alembic
git branch -D tmp-alembic

Ponylang (SeanTAllen)

0.28.1 Released June 01, 2019 01:01 PM

Pony 0.28.1 is here! It includes a couple high-priority bug fixes. Updating as soon as possible is recommended. Please note, the Homebrew PR hasn’t yet been merged so you can’t update using Homebrew until that is done. All other supported platforms are good to go!

May 31, 2019

Jan van den Berg (j11g)

Why We Sleep – Matthew Walker May 31, 2019 07:07 PM

Why We Sleep by Matthew Walker is one of the most profound books I have ever read. It has directly impacted my attitude towards sleep and subsequently altered my behaviour. Books that change your behaviour are rare and this is one of them. You should read it.

Why We Sleep – Matthew Walker (2017) – 368 pages

We all know that sleep is important. But Walker dissects study, after study, after study to describe how important sleep exactly is, and what the devastating effects of too little sleep are. Walker presents what we know about sleep — which is still a large research area — and he comes to quite sobering, startling and stunning conclusions about the importance of sleep.

The shock and awe approach could leave you with a sense of defeat of how we approach sleep related problems. Because, individually and as a society, we handle it very poorly. But Walkers’ optimism towards the end regarding solutions, does provide a little bit of comfort.

Modern man has dug quite a hole for himself, with blue LED lights, caffeine, alcohol, iPads and online distractions, which all disrupt our (sleep) lives more than we can even begin to imagine. But Walkers’ solution is not to shy away from technology, but rather to embrace and expand it explicitly towards better sleep. So fortunately there is at least also some direction in this book. We badly need it.

Please read this book.

The post Why We Sleep – Matthew Walker appeared first on Jan van den Berg.

May 30, 2019

Derek Jones (derek-jones)

Cognitive capitalism chapter reworked May 30, 2019 01:22 AM

The Cognitive capitalism chapter of my evidence-based software engineering book took longer than expected to polish; in fact it got reworked, rather than polished (which still needs to happen, and there might be more text moving from other chapters).

Changing the chapter title, from Economics to Cognitive capitalism, helped clarify lots of decisions about the subject matter it ought to contain (the growth in chapter page count is more down to material moving from other chapters, than lots of new words from me).

I over-spent time down some interesting rabbit holes (e.g., real options), before realising that no public data was available, and unlikely to be available any time soon. Without data, there is not a lot that can be said in a data driven book.

Social learning is a criminally under researched topic in software engineering. Some very interesting work has been done by biologists (e.g., Joseph Henrich, and Kevin Laland), in the last 15 years; the field has taken off. There is a huge amount of social learning going on in software engineering, and virtually nobody is investigating it.

As always, if you know of any interesting software engineering data, please let me know.

Next, the Ecosystems chapter.

May 29, 2019

Jan van den Berg (j11g)

Humor schept evenwicht (Humor creates balance) – Jaap Bakker May 29, 2019 07:50 PM

Jaap Bakker, a local storyteller from a small rural town in the Netherlands (Urk), has written down anecdotes and jokes from the last hundred years or so. Either things he experienced first hand or that were told to him. So expect hundreds of fun little stories. Stories anyone can identify with, about human interaction and small town life, that make you smile, laugh or even burst out.

Humor Schept Evenwicht – Jaap Bakker (2005) – 87 pages

Needless to say, I am biased about this book. Since all stories are rooted in my hometown, and therefore very relatable. But I do fear some stories would probably need added context for outsiders to make sense. So it could have used some outside editing to make it more coherent. But nonetheless, I thought it was a delightful read.

The post Humor schept evenwicht (Humor creates balance) – Jaap Bakker appeared first on Jan van den Berg.

Gokberk Yaltirakli (gkbrk)

Writing a Simple IPFS Crawler May 29, 2019 04:20 PM

IPFS is a peer-to-peer protocol that allows you to access and publish content in a decentralized fashion. It uses hashes to refer to files. Short of someone posting hashes on a website, discoverability of content is pretty low. In this article, we’re going to write a very simple crawler for IPFS.

It’s challenging to have a traditional search engine in IPFS because content rarely links to each other. But there is another way than just blindly following links like a traditional crawler.

Enter DHT

In IPFS, the content for a given hash is found using a Distributed Hash Table. Which means our IPFS daemon receives requests about the location of IPFS objects. When all the peers do this, a key-value store is distributed among them; hence the name Distributed Hash Table. Even though we won’t get all the queries, we will still get a fraction of them. We can use these to discover when people put files on IPFS and announce it on the DHT.

Fortunately, IPFS lets us see those DHT queries from the log API. For our crawler, we will use the Rust programming language and the ipfsapi crate for communicating with IPFS. You can add ipfsapi = "0.2" to your Cargo.toml file to get the dependency.

Using IPFS from Rust

Let’s test if our IPFS daemon and the IPFS crate are working by trying to fetch and print a file.

let api = IpfsApi::new("127.0.0.1", 5001);

let bytes = api.cat("QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u")?;
let data = String::from_utf8(bytes.collect())?;

println!("{}", data);

This code should grab the contents of the hash, and if everything is working print “Hello World”.

Getting the logs

Now that we can download files from IPFS, it’s time to get all the logged events from the daemon. To do this, we can use the log_tail method to get an iterator of all the events. Let’s get everything we get from the logs and print it to the console.

for line in api.log_tail()? {
    println!("{}", line);
}

This gets us all the loops, but we are only interested in DHT events, so let’s filter a little. A DHT announcement looks like this in the JSON logs.

{
  "duration": 235926,
  "event": "handleAddProvider",
  "key": "QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u",
  "peer": "QmeqzaUKvym9p8nGXYipk6JpafqqQAnw1ZQ4xBoXWcCrLb",
  "session": "ffffffff-ffff-ffff-ffff-ffffffffffff",
  "system": "dht",
  "time": "2018-03-12T00:32:51.007121297Z"
}

We are interested in all the log entries with the event handleAddProvider. And the hash of the IPFS object is key. We can filter the iterator like this.

let logs = api.log_tail()
        .unwrap()
        .filter(|x| x["event"].as_str() == Some("handleAddProvider"))
        .filter(|x| x["key"].is_string());

for log in logs {
    let hash = log["key"].as_str().unwrap().to_string();
    println!("{}", hash);
}

Grabbing the valid images

As a final step, we’re going to save all the valid image files that we come across. We can use the image crate. Basically; for each object we find, we’re going to try parsing it as an image file. If that succeeds, we likely have a valid image that we can save.

Let’s write a function that loads an image from IPFS, parses it with the image crate and saves it to the images/ folder.

fn check_image(hash: &str) -> Result<(), Error> {
    let api = IpfsApi::new("127.0.0.1", 5001);

    let data: Vec<u8> = api.cat(hash)?.collect();
    let img = image::load_from_memory(data.as_slice())?;

    println!("[!!!] Found image on hash {}", hash);

    let path = format!("images/{}.jpg", hash);
    let mut file = File::create(path)?;
    img.save(&mut file, image::JPEG)?;

    Ok(())
}

And then connecting to our main loop. We’re checking each image in a seperate thread because IPFS can take a long time to resolve a hash or timeout.

for log in logs {
    let hash = log["key"].as_str().unwrap().to_string();
    println!("{}", hash);

    thread::spawn(move|| check_image(&hash));
}

Possible improvements / future work

  • File size limits: Checking the size of objects before downloading them
  • More file types: Saving more file types. Determining the types using a utility like file.
  • Parsing HTML: When the object is valid HTML, parse it and index the text in order to provide search

Evolving Neural Net classifiers May 29, 2019 04:20 PM

As a research interest, I play with evolutionary algorithms a lot. Recently I’ve been messing around with Neural Nets that are evolved rather than trained with backpropagation.

Because this is a blog post, and to further demonstrate that literally anything can result in evolution, I’m going to be using a hill climbing algorithm. Here’s the gist of it.

  1. Initially, we will start with a Neural Network with random weights.
  2. We’re going to clone the network, pick a weight and change it to a random number.
  3. Evaluate the old network and the new network and get their scores
  4. If the new network has done better or the same as the old one, replace the old network with it
  5. Repeat until the results are satisfactory

The algorithm

The algorithm is shown below. All it does is split the given data into training and test parts, randomly change the neural network weights until the score improves, and then use the test data to determine how good we did.

def train_and_test(X, y, nn_size, iterations=1000, test_size=None, stratify=None):
    random.seed(445)
    np.random.seed(445)
    net = NeuralNetwork(nn_size)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=stratify
    )

    score = 0
    for i in range(iterations):
        score = net.get_score(X_train, y_train)

        new = net.clone()
        new.mutate()
        new_score = new.get_score(X_train, y_train)

        if new_score >= score:
            net = new
            score = new_score

    print(f"Training set: {len(X_train)} elements. Error: {score}")

    score = net.get_classify_score(X_test, y_test)

    print(f"Test set: {score} / {len(X_test)}. Score: {score / len(X_test) * 100}%")

Iris flower dataset

If you are learning about classifiers, the Iris flower dataset is probably the first thing you’re going to test. It is like the “Hello World” of classification basically.

The dataset includes petal and sepal size measurements from 3 different Iris species. The goal is to get measurements and classify which species they are from.

You can find more information on the dataset here.

data = pandas.read_csv("IRIS.csv").values

name_to_output = {
    "Iris-setosa": [1, 0, 0],
    "Iris-versicolor": [0, 1, 0],
    "Iris-virginica": [0, 0, 1],
}

rows = data.shape[0]
data_input = data[:, 0:4].reshape((rows, 4, 1)).astype(float)
data_output = np.array(list(map(lambda x: name_to_output[x], data[:, 4]))).reshape(
    (rows, 3)
)

train_and_test(data_input, data_output, (4, 4, 3), 10000, 0.2)
Training set: 120 elements. Error: -5.697678436657024
Test set: 29 / 30. Score: 96.66666666666667%

96% accuracy isn’t bad such a simple algorithm. But it has that accuracy when it trains with 120 samples and tests with 30. Let’s see if it’s good at generalization by turning our train/test split into 0.03/0.97.

As you can see below; just by training on 4 samples, our network is able to classify the rest of the data with a 94% accuracy.

train_and_test(data_input, data_output, (4, 4, 3), 10000, 0.97)
Training set: 4 elements. Error: -0.8103166051741318
Test set: 138 / 146. Score: 94.52054794520548%

Cancer diagnosis dataset

This dataset has includes some data/measurements about tumors, and classifies them as either Benign (B) or Malignant (M).

You can find the dataset and more information about it here.

data = pandas.read_csv("breast_cancer.csv").values[1:]

rows = data.shape[0]

name_to_output = {"B": [1, 0], "M": [0, 1]}

data_input = data[:, 2:32].reshape((rows, 30, 1)).astype(float) / 100
data_output = np.array(list(map(lambda x: name_to_output[x], data[:, 1]))).reshape(
    (rows, 2)
)

train_and_test(data_input, data_output, (30, 30, 15, 2), 10000, 0.3)
Training set: 397 elements. Error: -5.626705318006574
Test set: 159 / 171. Score: 92.98245614035088%

To see if the network is able to generalize, let’s train it on 11 samples and test it on 557. You can see below that it has an 86% accuracy after seeing a tiny amount of samples.

train_and_test(data_input, data_output, (30, 30, 15, 2), 10000, 0.98)
Training set: 11 elements. Error: -0.2742514647152907
Test set: 481 / 557. Score: 86.35547576301616%

Glass classification dataset

This dataset has some material measurements, like how much of each element was found in a piece of glass. Using these measurements, the goal is to classify which of the 8 glass types it was from.

This dataset doesn’t separate cleanly, and there aren’t a lot of samples you get. So I cranked up the iteration number and added more hidden layers. Deep learning baby!

You can find more information on the dataset here.

data = pandas.read_csv("glass.csv").values[1:]

rows = data.shape[0]
data_input = data[:, :-1].reshape((rows, 9, 1)).astype(float)
data_output = np.array(list(map(lambda x: np.eye(8)[int(x)], data[:, -1]))).reshape((rows, 8))

train_and_test(data_input, data_output, (9, 9, 9, 9, 8), 20000, 0.3, stratify=data_output)
Training set: 149 elements. Error: -8.261249669954738
Test set: 47 / 64. Score: 73.4375%

After I saw this result, I wasn’t super thrilled about it. But after I went through the other solutions on Kaggle and looked at their results, I found out that this wasn’t bad compared to other classifiers.

But where’s the Neural Network code?

Here it is. While it’s a large chunk of code, I find that this is the least interesting part of the project. This is basically a bunch of matrices getting multiplied and mutated randomly. You can find a bunch of tutorials/examples of this on the internet.

import numpy as np
import random
import pandas
from sklearn.model_selection import train_test_split

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.layer_sizes = layer_sizes
        weight_shapes = [(a, b) for a, b in zip(layer_sizes[1:], layer_sizes[:-1])]
        self.weights = [
            np.random.standard_normal(s) / s[1] ** 0.5 for s in weight_shapes
        ]
        self.biases = [np.random.rand(s, 1) for s in layer_sizes[1:]]

    def predict(self, a):
        for w, b in zip(self.weights, self.biases):
            a = self.activation(np.matmul(w, a) + b)
        return a

    def get_classify_score(self, images, labels):
        predictions = self.predict(images)
        num_correct = sum(
            [np.argmax(a) == np.argmax(b) for a, b in zip(predictions, labels)]
        )
        return num_correct

    def get_score(self, images, labels):
        predictions = self.predict(images)
        predictions = predictions.reshape(predictions.shape[0:2])
        return -np.sum(np.abs(np.linalg.norm(predictions-labels)))

    def clone(self):
        nn = NeuralNetwork(self.layer_sizes)
        nn.weights = np.copy(self.weights)
        nn.biases = np.copy(self.biases)
        return nn

    def mutate(self):
        for _ in range(self.weighted_random([(20, 1), (3, 2), (2, 3), (1, 4)])):
            l = self.weighted_random([(l.flatten().shape[0], i) for i, l in enumerate(self.weights)])
            shape = self.weights[l].shape
            layer = self.weights[l].flatten()
            layer[np.random.randint(0, layer.shape[0]-1)] = np.random.uniform(-2, 2)
            self.weights[l] = layer.reshape(shape)

            if np.random.uniform() < 0.01:
                b = self.weighted_random([(b.flatten().shape[0], i) for i, b in enumerate(self.biases)])
                shape = self.biases[b].shape
                bias = self.biases[b].flatten()
                bias[np.random.randint(0, bias.shape[0]-1)] = np.random.uniform(-2, 2)
                self.biases[b] = bias.reshape(shape)

    @staticmethod
    def activation(x):
        return 1 / (1 + np.exp(-x))

    @staticmethod
    def weighted_random(pairs):
        total = sum(pair[0] for pair in pairs)
        r = np.random.randint(1, total)
        for (weight, value) in pairs:
            r -= weight
            if r <= 0: return value

Phone Location Logger May 29, 2019 04:20 PM

If you are using Google Play Services on your Android phone, Google receives and keeps track of your location history. This includes your GPS coordinates and timestamps. Because of the privacy implications, I have revoked pretty much all permissions from Google Play Services and disabled my Location History on my Google settings (as if they would respect that).

But while it might be creepy if a random company has this data, it would be useful if I still have it. After all, who doesn’t want to know the location of a park that they stumbled upon randomly on a vacation 3 years ago.

I remember seeing some location trackers while browsing through F-Droid. I found various applications there, and picked one that was recently updated. The app was a Nextcloud companion app, with support for custom servers. Since I didn’t want a heavy Nextcloud install just to keep track of my location, I decided to go with the custom server approach.

In the end, I decided that the easiest path is to make a small CGI script in Python that appends JSON encoded lines to a text file. Because of this accessible data format, I can process this file in pretty much every programming language, import it to whatever database I want and query it in whatever way I see fit.

The app I went with is called PhoneTrack. You can find the APK and source code links on F-Droid. It replaces the parameters in the URL, logging every parameter looks like this: https://example.com/cgi-bin/locationrecorder.py ?acc=%ACC&alt=%ALT&batt=%BATT&dir=%DIR&lat=%LAT&lon=%LON&sat=%SAT&spd=%SPD &timestamp=%TIMESTAMP

Here’s the script in all it’s glory.

import cgi
import json

PATH = '/home/databases/location.txt'

print('Content-Type: text/plain\n')
form = cgi.FieldStorage()

# Check authentication token
if form.getvalue('token') != 'SECRET_VALUE':
    raise Exception('Nope')

obj = {
    'accuracy':   form.getvalue('acc'),
    'altitude':   form.getvalue('alt'),
    'battery':    form.getvalue('batt'),
    'bearing':    form.getvalue('dir'),
    'latitude':   form.getvalue('lat'),
    'longitude':  form.getvalue('lon'),
    'satellites': form.getvalue('sat'),
    'speed':      form.getvalue('spd'),
    'timestamp':  form.getvalue('timestamp'),
}

with open(PATH, 'a+') as log:
    line = json.dumps(obj)
    log.write(f'{line}\n')

Reverse Engineering the Godot File Format May 29, 2019 04:20 PM

I’ve been messing around with the Godot game engine recently. After writing some examples that load the assets and map data from files, I exported it and noticed that Godot bundled all the resources into a single .pck file. It was packing all the game resources and providing them during runtime as some sort of virtual file system.

Of course; after I was finished for the day with learning gamedev, I was curious about that file. I decided to give myself a small challenge and parse that file using only hexdump and Python. I opened the pack file with my hex editor and I was met with a mixture of binary and ASCII data. Here’s the beginning of the file from hexdump -C game.pck.

00000000  47 44 50 43 01 00 00 00  03 00 00 00 01 00 00 00  |GDPC............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000050  00 00 00 00 1e 00 00 00  40 00 00 00 72 65 73 3a  |........@...res:|
00000060  2f 2f 2e 69 6d 70 6f 72  74 2f 61 6d 67 31 5f 66  |//.import/amg1_f|
00000070  72 31 2e 70 6e 67 2d 32  32 66 30 32 33 32 34 65  |r1.png-22f02324e|
00000080  63 62 39 34 32 34 39 37  62 61 61 31 38 30 37 33  |cb942497baa18073|
00000090  37 36 62 37 31 64 30 2e  73 74 65 78 00 3d 00 00  |76b71d0.stex.=..|
000000a0  00 00 00 00 28 04 00 00  00 00 00 00 de a0 23 30  |....(.........#0|
000000b0  cf 7c 59 5c fb 73 5d f6  a7 f8 12 a7 40 00 00 00  |.|Y\.s].....@...|
000000c0  72 65 73 3a 2f 2f 2e 69  6d 70 6f 72 74 2f 61 6d  |res://.import/am|
000000d0  67 31 5f 6c 66 31 2e 70  6e 67 2d 62 33 35 35 35  |g1_lf1.png-b3555|
000000e0  34 66 64 31 39 36 37 64  65 31 65 62 32 63 64 31  |4fd1967de1eb2cd1|
000000f0  32 33 32 65 32 31 38 33  33 30 32 2e 73 74 65 78  |232e2183302.stex|

Immediately we can see that the file starts with some magic bytes, and before our ASCII filename begins there’s a lot of zeros. They might be keeping that space as a padding or for future extensions, or maybe it can even contain data depending on your settings, but for our immediate goal it doesn’t matter.

What looks interesting here is the two integers right before our path. 1e 00 00 00 and 40 00 00 00, probably lengths or counts since they are before real data. Saying these two are little-endian unsigned integers would be a good assumption, because otherwise they would be huge numbers that have no business being the length of anything.

The first number is 30 and the second one is 64. Now, what’s the name of that file? res://.import/amg1_fr1.png-22f02324ecb942497baa1807376b71d0.stex. Exactly 64 bytes. That means we now know that the paths are prefixed by their length.

If we look at the next path, we can see that a similar pattern of being length prefixed still applies. The first integer we found, 30, is most likely the number of files we have. And a rough eyeballing of the file contents reveals that to be the case.

Let’s get a little Python here and try to read the file with our knowledge so far. We’ll read the first integer and loop through all the files, trying to print their names.

pack = open('game.pck', 'rb')
pack.read(0x54) # Skip the empty padding

file_count = struct.unpack('<I', pack.read(4))[0]

name_len = struct.unpack('<I', pack.read(4))[0]
name = pack.read(name_len).decode('utf-8')

print(f'The first file is {name}')

Running this code produces the following output, success!

The first file is res://.import/amg1_fr1.png-22f02324ecb942497baa1807376b71d0.stex

Now let’s try to loop file_count times and see if our results are good. One thing we should notice is the data following the ASCII text, before the new one begins. If we miss that, we will read the rest of the data wrong and end up with garbage. Let’s go back to our hexdump and count how many bytes we need to skip until the next length. Looks like we have 32 extra bytes. Let’s account for those and print everything.

for i in range(file_count):
  name_len = struct.unpack('<I', pack.read(4))[0]
  name = pack.read(name_len)
  pack.read(32)
  print(name)
...
b'res://MapLoader.gd.remap'
b'res://MapLoader.gdc\x00'
b'res://MapLoader.tscn'
b'res://Player.gd.remap\x00\x00\x00'
b'res://Player.gdc'
...

Much better, the only issue is the trailing null bytes on some file names. This shouldn’t be a huge issue though, these are probably random padding and they don’t even matter if we consider the strings null-terminated. Let’s just get rid of trailing null bytes.

name = pack.read(name_len).rstrip(b'\x00').decode('utf-8')

After this change, we can get a list of all the resource files in a Godot pack file.

Getting the file contents

Sure, getting the list of files contained in the file is useful. But it’s not super helpful if our tool can’t get the file contents as well.

The file contents are stored separately from the file names. The thing we parsed so far is only the file index, like a table of contents. It’s useful to have it that way so when the game needs a resource at the end of the file, the game engine won’t have to scan the whole file to get there.

But how can we get find where the contents are without going through the whole thing? With offsets of course. Every entry we read from the index, along with the file name, contains the offset and the size of the file. It’s not the easiest thing to explain how you discover something after the fact, but it’s a combination of being familiar with other file formats and a bunch of guesswork. Now I’d like to direct your attention to the 32 bytes we skipped earlier.

Since we already have the file names, we can make assumptions like text files being smaller than other resources like images and sprites. This can be made even easier by putting files with known lengths there, but just for the sake of a challenge let’s pretend that we can’t create these files.

After each guess, we can easily verify this by checking with hexdump or plugging the new logic in to our Python script.

The 4 byte integer that follows the file name is our offset. If we go to the beginning of the file and count that many bytes, we should end up where our file contents begin.

offset = struct.unpack('<I', pack.read(4))[0]

This is followed by 4 empty bytes and then another integer which is our size. Those 4 bytes might be used for other purposes, but again they are irrelevant for our goal.

pack.read(4)
size = struct.unpack('<I', pack.read(4))[0]

The offset is where the file contents began, and the size is how many bytes it is. So the range between offset and offset + size is the file contents. And because we ended up reading 12 bytes from the file, we should make our padding 20 instead of 32.

Reading a File

To finish up, let’s read the map file I put in my game. It’s a plaintext file with JSON contents, so it should be easy to see if everything looks complete.

pack = open('game.pck', 'rb')
pack.read(0x54) # Skip the empty padding

file_count = struct.unpack('<I', pack.read(4))[0]

for i in range(file_count):
  name_len = struct.unpack('<I', pack.read(4))[0]
  name = pack.read(name_len).rstrip(b'\x00').decode('utf-8')

  offset = struct.unpack('<I', pack.read(4))[0]
  pack.read(4)
  size = struct.unpack('<I', pack.read(4))[0]
  pack.read(20)

  print(name)

  if name == 'res://maps/map01.tres':
    pack.seek(offset)
    content = pack.read(size).decode('utf-8')
    print(content)
    break
...
res://Player.gd.remap
res://Player.gdc
res://Player.tscn
res://block.gd.remap
res://block.gdc
res://block.tscn
res://default_env.tres
res://icon.png
res://icon.png.import
res://maps/map01.tres
{
    "spawn_point": {"x": 5, "y": 3},
    "blocks": [
        {"x": 5, "y": 5, "msg": "Welcome to Mr Jumpy Man", "jumpLimit": 0},
        {"x": 7, "y": 5, "msg": "We're still waiting on the trademark"},
        {"x": 9, "y": 5, "texture": "platformIndustrial_001.png"},
        {"x": 11, "y": 6, "msg": "If you fall down,\nyou will be teleported to the start point"},
...

Everything looks good! Using this code, we should be able to extract the resources of Godot games.

Finishing words

Now; before the angry comments section gets all disappointed, let me explain. I am fully aware that Godot is open source. Yes, I could’ve looked at the code to know exactly how it works. No, that wouldn’t be as fun. Kthxbye.

Mastodon Bot in Common Lisp May 29, 2019 04:20 PM

If you post a programming article to Hacker News, Reddit or Lobsters; you will notice that soon after it gets to the front page, it gets posted to Twitter automatically.

But why settle for Twitter when you can have this on Mastodon? In this article we will write Mastodon bot that regularly checks the Lobste.rs front page and posts new links to Mastodon.

Since this is a Mastodon bot, let’s start by sending a post to our followers.

Sending a Mastodon post

To interact with Mastodon, we are going to use a library called Tooter. To get the API keys you need; just log in to Mastodon, go to Settings > Development > New Application. Once you create an application, the page will show all the API keys you need.

(defun get-mastodon-client ()
  (make-instance 'tooter:client
                 :base "https://botsin.space"
                 :name "lobsterbot"
                 :key "Your client key"
                 :secret "Your client secret"
                 :access-token "Your access token")
  )

This function will create a Mastodon client whenever you call it. Now, let’s send our first message.

(tooter:make-status (get-mastodon-client) "Hello world!")

Now that we can send messages, the next step in our project is to fetch the RSS feed.

Fetching the RSS feed

Fetching resources over HTTP is really straightforward with Common Lisp, the drakma library provides an easy-to-use function called http-request. In order to get the contents of my blog, all you need to do is

(drakma:http-request "https://gkbrk.com")

So let’s write a function that takes a feed URL and returns the RSS items.

There is one case we need to handle with this. When you are fetching text/html, drakma handles the decoding for you; but it doesn’t do this when we fetch application/rss. Instead, it returns a byte array.

(defvar feed-path "https://lobste.rs/rss")

(defun get-rss-feed ()
  "Gets rss feed of Lobste.rs"
  (let* ((xml-text (babel:octets-to-string (drakma:http-request feed-path)))
         (xml-tree (plump:parse xml-text)))
    (plump:get-elements-by-tag-name xml-tree "item")
    ))

This function fetches an RSS feed, parses the XML and returns the <item> tags in it. In our case, these tags contain each post Lobste.rs.

Creating structs for the links

A struct in Common Lisp is similar to a struct in C and other languages. It is one object that stores multiple fields.

(defstruct lobsters-post
  title
  url
  guid
  )

Getting and setting fields of a struct can be done like this.

; Pretend that we have a post called p
(setf (lobsters-post-title p) "An interesting article") ; Set the title
(print (lobsters-post-title p))                         ; Print the title

Let’s map the RSS tags to our struct fields.

(defun find-first-element (tag node)
  "Search the XML node for the given tag name and return the text of the first one"
  (plump:render-text (car (plump:get-elements-by-tag-name node tag)))
  )

(defun parse-rss-item (item)
  "Parse an RSS item into a lobsters-post"
  (let ((post (make-lobsters-post)))
    (setf (lobsters-post-title post) (find-first-element "title" item))
    (setf (lobsters-post-url post) (find-first-element "link" item))
    (setf (lobsters-post-guid post) (find-first-element "guid" item))
    post
    ))

Now, we can make the previous get-rss-feed function return lobsters-post‘s instead of raw XML nodes.

(defun get-rss-feed ()
  "Gets rss feed of Lobste.rs"
  (let* ((xml-text (babel:octets-to-string (drakma:http-request *feed-url*)))
         ; Tell the parser that we want XML tags instead of HTML
         ; This is needed because <link> is a self-closing tag in HTML
         (plump:*tag-dispatchers* plump:*xml-tags*)
         (xml-tree (plump:parse xml-text))
         (items (plump:get-elements-by-tag-name xml-tree "item"))
         )
    (reverse (map 'list #'parse-rss-item items))
    ))

Posting the first link to Mastodon

(defun share-post (item)
  "Takes a lobsters-post and posts it on Mastodon"
  (tooter:make-status (get-mastodon-client) (format nil "~a - ~a ~a"
                                                    (lobsters-post-title item)
                                                    (lobsters-post-guid item)
                                                    (lobsters-post-url item)))
  )

(share-post (car (get-rss-feed)))

Keeping track of shared posts

We don’t want our bot to keep posting the same links. One solution to this is to keep all the links we already posted in a file called links.txt.

Every time we come accross a link, we will record it to our “database”. This basically appends the link follewed by a newline to the file. Not very fancy, but certainly enough for our purposes.

(defun record-link-seen (item)
  "Writes a link to the links file to keep track of it"
  (with-open-file (stream "links.txt"
                          :direction :output
                          :if-exists :append
                          :if-does-not-exist :create)
    (format stream "~a~%" (lobsters-post-guid item)))
  )

In order to filter our links before posting, we will go through each line in that file and check if our link is in there.

(defun is-link-seen (item)
  "Returns if we have processed a link before"
  (with-open-file (stream "links.txt"
                          :if-does-not-exist :create)
    (loop for line = (read-line stream nil)
       while line
       when (string= line (lobsters-post-guid item)) return t))
  )

Now let’s wrap this all up by creating a task that

  • Fetches the RSS feed
  • Gets the top 10 posts
  • Filters out the links that we shared before
  • Posts them to Mastodon
(defun run-mastodon-bot ()
  (let* ((first-ten (subseq (get-rss-feed) 0 10))
         (new-links (remove-if #'is-link-seen first-ten))
         )
    (loop for item in new-links do
         (share-post item)
         (record-link-seen item))
    ))

How you schedule this to run regularly is up to you. Set up a cron job, make a timer or just run it manually all the time.

You can find the full code here.

Fetching ActivityPub Feeds May 29, 2019 04:20 PM

Mastodon is a federated social network that uses the ActivityPub protocol to connect separate communities into one large network. Both Mastodon and the ActivityPub protocol are increasing in usage every day. Compared to formats like RSS, which are pull-based, ActivityPub is push-based. This means rather than your followers downloading your feed regularly to check if you have shared anything, you send each follower (or each server as an optimization) the content you shared.

While this decreases latency in your followers receiving your updates, it does complicate the implementation of readers. But fortunately, it is still possible to pull the feed of ActivityPub users. Just like the good old days.

In this article; we’re going to start from a handle like leo@niu.moe, and end up with a feed of my latest posts.

WebFinger

First of all, let’s look at how the fediverse knows how to find the ActivityPub endpoint for a given handle. The way this is done is quite similar to email.

To find the domain name, let’s split the handle into the username and domain parts.

handle           = 'leo@niu.moe'
username, domain = handle.split('@')

Next, we need to make a request to the domain’s webfinger endpoint in order to find more data about the account. This is done by performing a GET request to /.well-known/webfinger.

wf_url = 'https://{}/.well-known/webfinger'.format(domain)
wf_par = {'resource': 'acct:{}'.format(handle)}
wf_hdr = {'Accept': 'application/jrd+json'}

# Perform the request
wf_resp = requests.get(wf_url, headers=wf_hdr, params=wf_par).json()

Now we have our WebFinger response. We can filter this data in order to find the correct ActivityPub endpoint. We need to do this because webfinger can return a variety of URLs, not just ActivityPub.

Filtering the endpoints

The response we get from WebFinger looks like this.

{
  "subject": "acct:leo@niu.moe",
  "aliases": [
    "https://niu.moe/@leo",
    "https://niu.moe/users/leo"
  ],
  "links": [
    {
      "rel": "http://webfinger.net/rel/profile-page",
      "type": "text/html",
      "href": "https://niu.moe/@leo"
    },
    {
      "rel": "http://schemas.google.com/g/2010#updates-from",
      "type": "application/atom+xml",
      "href": "https://niu.moe/users/leo.atom"
    },
    {
      "rel": "self",
      "type": "application/activity+json",
      "href": "https://niu.moe/users/leo"
    }
  ]
}

Depending on the server, there might be more or less entries in the links key. What we are intereted in is the URL with the type application/activity+json. Let’s go through the array and find the link URL we’re looking for.

matching = (link['href'] for link in wf_resp['links'] if link['type'] == 'application/activity+json')
user_url = next(matching, None)

Fetching the feed link

We can fetch our feed URL using requests like before. One detail to note here is the content type that we need to specify in order to get the data in the format we want.

as_header = {'Accept': 'application/ld+json; profile="https://www.w3.org/ns/activitystreams"'}
user_json = requests.get(user_url, headers=as_header).json()

user_json is a dictionary that contains information about the user. This information includes the username, profile summary, profile picture and other URLs related to the user. One such URL is the “Outbox”, which is basically a feed of whatever that user shares publicly.

This is the final URL we need to follow, and we will have the user feed.

feed_url  = user_json['outbox']

In ActivityPub, the feed is an OrderedCollection. And those can be paginated. The first page can be empty, or have all the content. Or it can be one event for each page. This is completely up to the implementation. In order to handle this transparently, let’s write a generator that will fetch the next pages when they are requested.

def parse_feed(url):
    feed = requests.get(url, headers=as_header).json()

    if 'orderedItems' in feed:
        for item in feed['orderedItems']:
            yield item

    next_url = None
    if 'first' in feed:
        next_url = feed['first']
    elif 'next' in feed:
        next_url = feed['next']

    if next_url:
        for item in parse_feed(next_url):
            yield item

Now; for the purposes of a blog post and for writing simple feed parsers, this code works with most servers. But this is not a fully spec-complient function for grabbing all the pages of content. Technically next and first can be lists of events instead of other links, but I haven’t come across that in the wild. It is probably a good idea to write your code to cover more edge cases when dealing with servers on the internet.

Printing the first 10 posts

The posts in ActivityPub contain HTML and while this is okay for web browsers, we should strip the HTML tags before printing them to the terminal.

Here’s how we can do that with the BeautifulSoup and html modules.

def clean_html(s):
    text = BeautifulSoup(s, 'html.parser').get_text()
    return html.unescape(text)

i = 0
for item in parse_feed(feed_url):
    try:
        # Only new tweets
        assert item['type'] == 'Create'
        content = item['object']['content']
        text = clean_html(content)

        print(text)
        i += 1
    except:
        continue

    if i == 10:
        break

Future Work

Mastodon is not the only implementation of ActivityPub, and each implementation can do things in slightly different ways. While writing code to interact with ActivityPub servers, you should always consult the specification document.

Useful Links

Plaintext budgeting May 29, 2019 04:20 PM

For the past ~6 months, I’ve been using an Android application to keep track of my daily spending. To my annoyance, I found out that the app doesn’t have an export functionality. I didn’t want to invest more time in a platform that I couldn’t get my data out of, so I started looking for another solution.

I’ve looked into budgeting systems before, and I’ve seen both command-line (ledger) and GUI systems (GNUCash). Now; both of these are great software, and I can appreciate how Double-entry bookkeeping is a useful thing for accounting purposes. But while they are powerful, they’re not as simple as they could be.

I decided to go with CSV files. CSV is one of the most universal file formats, it’s simple and obvious. I can process it with pretty much every programming language and import it to pretty much every spreadsheet software. Or… I could use a shell script to run calculations with SQLite.

If I ever want to migrate to another system; it will probably be possible to convert this file with a shell script, or even a sed command.

I create monthly CSV files in order to keep everything nice and tidy, but the script adapts to everything from a single CSV file to one file for each day/hour/minute.

Here’s what an example file looks like:

Date,Amount,Currency,Category,Description
2019-04-02,5.45,EUR,Food,Centra
2019-04-03,2.75,EUR,Transport,Bus to work

And here’s the script:

#!/bin/sh

days=${1:-7}

cat *.csv | sed '/^Date/d' > combined.csv.temp

output=$(sqlite3 <<EOF
create table Transactions(Date, Amount, Currency, Category, Description);
.mode csv
.import combined.csv.temp Transactions
.mode list

select 'Amount spent today:',
coalesce(sum(Amount), 0) from Transactions where Date = '$(date +%Y-%m-%d)';

select '';
select 'Last $days days average:',
sum(Amount)/$days, Currency from Transactions where Date > '$(date --date="-$days days" +%Y-%m-%d)'
group by Currency;

select '';
select 'Last $days days by category';
select '=======================';

select Category, sum(Amount) from Transactions
where Date > '$(date --date="-$days days" +%Y-%m-%d)'
group by Category order by sum(Amount) desc;
EOF
      )

rm combined.csv.temp

echo "$output" | sed 's/|/ /g'

This is the output of the command

[leo@leo-arch budget]$ ./budget.sh
Amount spent today: 8.46

Last 7 days average: 15.35 EUR

Last 7 days by category
=======================
Groceries 41.09
Transport 35.06
Food 31.35
[leo@leo-arch budget]$ ./budget.sh 5
Amount spent today: 8.46

Last 5 days average: 11.54 EUR

Last 5 days by category
=======================
Groceries 29.74
Transport 17.06
Food 10.9
[leo@leo-arch budget]$

Rendering GPS traces May 29, 2019 04:20 PM

If you ask a bunch of people to upload GPS traces when they walk/drive and you combine those traces, you can get a rudimentary map. In fact, this is one of the primary data sources of OpenStreetMap. The data for those is freely available, so we can use it in a small project.

To draw simple outlines, iterating over the GPS track points and putting them on an image should be enough. It will give us the main roads and the general city structure, and will be possible to recognize when compared to an actual map.

To begin, let’s get the coordinates of the place we’ll be mapping. In my case, this will be Sheffield. If you go to OpenStreetMap and hit Export, it will let you select and area with a bounding box and get the coordinates of it. We’ll get the coordinates and write it to to our script.

# Area format is left, bottom, right, top
AREA = [-1.4853, 53.3730, -1.4557, 53.3893]

The other thing we should get out of the way is the output size. We should go with a nice 720p picture.

WIDTH  = 1280
HEIGHT = 720

Getting the GPS data

OpenStreetMap provides an API that we can use in order to fetch GPS track data. It gives us the data for a given region in the XML format.

A small disclaimer about the API. It’s normally meant for editing, which means you should try to keep your usage very light. While making this project and iterating on the code, I kept all my API calls in a local cache. I’ll write about this in the future.

You can find the documentation for the API here. Here’s the gist of it.

bbox = ','.join(AREA)
url = 'https://api.openstreetmap.org/api/0.6/trackpoints'
xml = requests.get(url, params={'bbox': bbox}).text

This should get an XML document with the GPS trackpoints. Let’s parse it and get the latitude/longitude pairs. The coordinates are held in <trkpt> tags.

root = ET.fromstring(xml)
selector = './/{http://www.topografix.com/GPX/1/0}trkpt'

for trkpt in root.findall(selector):
  print(trkpt.attrib['lat'], trkpt.attrib['lon'])

As you can see, this is relatively straightforward. The XML selector might look weird. It just means get all the trkpt elements that belong to the namespace of that URL.

Plotting the data

Let’s start with a small example, creating an empty image and drawing a pixel in it.

img = Image.new('L', (WIDTH, HEIGHT), color='white')
img.putpixel((5, 5), 1)

This code will draw a single black pixel on an empty image. The rest of the way should be looking pretty clear now, go through the lat/lon pairs and plot them as pixels. But before we get to that step, there is one more hurdle to get through. And that is mapping the GPS coordinates to image coordinates.

Mapping coordinates for our map

The problem is, we have a 1280x720 image. And we can’t ask Python to put a pixel on (52.6447, -8.6337). We already know the exact area of the map we’re drawing, and the size of our output. What we need to do is get those two ranges and interpolate where a given coordinate falls on our image. For this, we can use the interp function from numpy.

y, x = point
x = int(interp(x, [AREA[0], AREA[2]], [0, WIDTH]))
y = int(interp(y, [AREA[1], AREA[3]], [HEIGHT, 0]))

try:
    img.putpixel((x, y), 1)
except:
    # In case math goes wrong
    pass

Drawing the map

We are know able to get GPS trackpoints and know how to map them to image coordinates. So let’s loop through everything and put the pixels on our map. Since each page has a limit of 5000 points, we should also iterate through the pages.

for page in range(15):
  for point in get_points(AREA, page):
    y, x = point
    x = int(interp(x, [AREA[0], AREA[2]], [0, WIDTH]))
    y = int(interp(y, [AREA[1], AREA[3]], [HEIGHT, 0]))
    try:
      img.putpixel((x, y), 1)
    except:
      pass

Getting points with pagination

Here’s a generator function to return OpenStreetMap trackpoints with pagination.

def get_points(area, page=0):
  bbox = ','.join(map(str, area))
  xml = sess.get('https://api.openstreetmap.org/api/0.6/trackpoints',
    params={'bbox': bbox, 'page': page}).text
  root = ET.fromstring(xml)

  for trkpt in root.findall('.//{http://www.topografix.com/GPX/1/0}trkpt'):
    yield trkpt.attrib['lat'], trkpt.attrib['lon']

Results

Rendering of Tokyo Rendering of Limerick Rendering of Sheffield

If you come up with any cool-looking renders, or better ways to plot this data, either leave a comment about it or send me an email about.

Free Hotel Wifi with Python and Selenium May 29, 2019 04:20 PM

Recently I took my annual leave and decided to visit my friend during the holidays. I stayed at a hotel for a few days but to my surprise, the hotel charged money to use their wifi. In $DEITY‘s year 2000 + 18, can you imagine?

But they are not so cruel. You see, these generous people let you use the wifi for 20 minutes. 20 whole minutes. That’s almost half a Minecraft video.

If they let each device use the internet for a limited amount of time, they must have a way of identifying each device. And a router tells devices apart is by their MAC addresses. Fortunately for us, we can change our MAC address easily.

Enter macchanger

There is a really useful command-line tool called macchanger. It lets you manually change, randomize and restore the MAC address of your devices. The idea here is randomizing our MAC regularly (every 20 minutes) in order to use the free wifi over and over indefinitely.

There are 3 small commands you need to run. This is needed because macchanger can’t work while your network interface is connected to the router.

# Bring network interface down
ifconfig wlp3s0 down

# Get random MAC address
macchanger -r wlp3s0

# Bring the interface back up
ifconfig wlp3s0 up

In the commands above, wlp3s0 is the name of my network interface. You can find yours by running ip a. If you run those commands, you can fire up your browser and you will be greeted with the page asking you to pay or try it for 20 minutes. After your time is up, you can run the commands again and keep doing it.

But this is manual labor, and doing it 3 times an hour is too repetitive. Hmm. What’s a good tool to automate repetitive stuff?

Enter Selenium

First, lets get those commands out of the way. Using the os module, we can run macchanger from our script.

import os

interface = 'wlp3s0'

os.system(f'sudo ifconfig {interface} down')
os.system(f'sudo macchanger -r {interface}')
os.system(f'sudo ifconfig {interface} up')

After these commands our computer should automatically connect to the network as a completely different device. Let’s fire up a browser and try to use the internet.

d = webdriver.Chrome()
d.get('http://example.com')
d.get('https://www.wifiportal.example/cp/sponsored.php')

The sponsored.php URL is where I ended up after pressing the Free Wifi link, so the script should open the registration form for us. Let’s fill the form.

In my case, all it asked for was an email address and a full name. If there are more fields, you can fill them in a similar fashion.

num   = random.randint(0, 99999)
email = f'test{num}@gmail.com'

d.find_element_by_name('email').send_keys(email)
d.find_element_by_name('name').send_keys('John Doe\n')

This should fill the form and press enter to submit it. Afterwards, the portal asked me if I wanted to subscribe to their emails or something like that. Of course, we click Reject without even reading it and close the browser.

d.find_elements_by_class_name('reject')[0].click()
d.close()

After this, you should have an internet connection. You can either run the script whenever you notice your connection is gone, or put it on a cron job / while loop.

Generating Vanity Infohashes for Torrents May 29, 2019 04:20 PM

In the world of Bittorrent, each torrent is identified by an infohash. It is basically the SHA1 hash of the torrent metadata that tells you about the files. And people, when confronted with something that’s supposed to be random, like to control it to some degree. You can see this behaviour in lots of different places online. People try to generate special Bitcoin wallets, Tor services with their nick or 4chan tripcodes that look cool. These are all done by repeatedly generating the hash until you find a result that you like. We can do the exact same thing with torrents as well.

The structure of torrent files

Before we start tweaking our infohash, let’s talk about torrent files first. A torrent file is a bencoded dictionary. It contains information about the files, their names, how large they are and hashes for each piece. This is stored in the info section of the dictionary. The rest of the dictionary includes a list of trackers, the file comment, the creation date and other optional metadata. The infohash is quite literally the SHA1 hash of the info section of the torrent. Any modification to the file contents changes the infohash, while changing the other metadata doesn’t.

This gives us two ways of affecting the hash without touching the file contents. The first one is adding a separate key called vanity and chaning the value of it. While this would be really flexible and cause the least change that the user can see, it adds a non-standard key to our dictionary. Fortunately, torrent files are supposed to be flexible and handle unknown keys gracefully.

The other thing we can do is to add a prefix to the file name. This should keep everything intact aside from a random value in front of our filename.

Parsing the torrent file

First of all, let’s read our torrent file and parse it. For this purpose, I’m using the bencoder module.

import bencoder

target = 'arch-linux.torrent'
with open(target, 'rb') as torrent_file:
    torrent = bencoder.decode(torrent_file.read())

Calculating the infohash

The infohash is the hash of the info section of the file. Let’s write a function to calculate that. We also encode the binary of the hash with base 32 to bring it to the infohash format.

import hashlib
import base64

def get_infohash(torrent):
    encoded = bencoder.encode(torrent[b'info'])
    sha1 = hashlib.sha1(encoded).hexdigest()
    return sha1

Prefixing the name

Let’s do the method with prefixing the name first. We will start from 0 and keep incrementing the name prefix until the infohash starts with cafe.

original_name = torrent[b'info'][b'name'].decode('utf-8')

vanity = 0
while True:
    torrent[b'info'][b'name'] = '{}-{}'.format(vanity, original_name)
    if get_infohash(torrent).startswith('cafe'):
        print(vanity, get_infohash(torrent))
        break
    vanity += 1

This code will increment our vanity number in a loop and print it and the respective infohash when it finds a suitable one.

Adding a separate key to the info section

While the previous section works well, it still causes a change that is visible to the user. Let’s work around that by modifying the data in a bogus key called vanity.

vanity = 0
while True:
    torrent[b'info'][b'vanity'] = str(vanity)
    if get_infohash(torrent).startswith('cafe'):
        print(vanity, get_infohash(torrent))
        break
    vanity += 1

Saving the modified torrent files

While it is possible to do the modification to the file yourself, why not go all the way and save the modified torrent file as well? Let’s write a function to save a given torrent file.

def save_torrent(torrent, name):
    with open(name, 'wb+') as torrent_file:
        torrent_file.write(bencoder.encode(torrent))

You can use this function after finding an infohash that you like.

Cool ideas for infohash criteria

  • Release groups can prefix their infohashes with their name/something unique to them
  • Finding smaller infohashes - should slowly accumulate 0’s in the beginning
  • Infohashes with the least entropy - should make them easier to remember
  • Infohashes with the more digits
  • Infohashes with no digits

May 28, 2019

Jan van den Berg (j11g)

Use PostgreSQL REPLACE() to replace dots with commas (dollar to euro) May 28, 2019 08:58 AM

If you have set up your database tables correctly you might be using double-precision floating numbers to store currency values. This works great because dollars use dots to represent decimals.

The problem starts when it’s not actually dollars you are storing but euros, and maybe you need to copy query output to Excel or LibreOffice Calc to work with these Euro values.

Both of these spreadsheet programs don’t know how to correctly handle the dots or how to correctly import them– at least without some tricks. There are different ways to deal with this, but this is all after you copied the data over to your spreadsheet. Find and replace is a common one.

But I like to start at the source. (Yes, you can change your system locale and all that, but I would advise against that for other reasons).

So assuming this is a query you would like to run regularly, instead of running this (which will give you the dotted price):

SELECT product, 
price as price_with_dot
FROM products

You can use REPLACE(), to replace the dot with commas and cast the double-precision float to text.

SELECT product, 
REPLACE(ROUND(price),2)::text, '.', ',') as "price_with_comma"
FROM products

For good measure, I also use ROUND() to round to two decimals.

The post Use PostgreSQL REPLACE() to replace dots with commas (dollar to euro) appeared first on Jan van den Berg.

Andreas Zwinkau (qznc)

Should version control and build systems merge? May 28, 2019 12:00 AM

At scale version control and build system seem to merge but there is no unified tool available yet.

Read full article!

May 27, 2019

Pete Corey (petecorey)

How I Actually Wrote My First Ebook May 27, 2019 12:00 AM

It’s been nearly three months since I released my first book, Secure Meteor. Time has flown, and I couldn’t be happier with how it’s been embraced by the Meteor community. In the early days of creating Secure Meteor (and the middle days, and the late days…), I wasn’t sure about the best way of actually writing a self-published, technical ebook.

I’m not talking about how to come up with the words and content. You’re on your own for that. I’m talking about how to get those words from my mind into a digital artifact that can be consumed by readers.

What editor do I use? Word? Emacs? Ulysses? Scrivener? Something else?

If I’m using a plain-text editor, what format do I write in? Markdown? If so, what flavor? LaTeX? if so, what distribution? HTML? Something else?

How do I turn what I’ve written into a well typeset final product? Pandoc? LaTeX? CSS? Something else?

The fact that you can purchase a copy of Secure Meteor is proof enough that I landed on answers to all of these questions. Let’s dive into the nuts and bolts of the process and workflow I came up with to create the digital artifact that is Secure Meteor!

Please note that I’m not necessarily advocating for this workflow. This process has taught me lots of lessons, and I’ll go over what I’ve come to believe towards the end of this article.

Writing in Scrivener

I’ve been a long-time user of Ulysses; I use it to write all of my online content. That said, I wasn’t sure it was up to the task of writing a several-hundred page technical book. I had heard wonderful things about Scrivener, so I decided to try it out on this project.

At its heart, Scrivener is a rich-text editor. To write Secure Meteor, I used a subset of Scrivener’s rich-text formatting tools to describe the pieces of my book. “Emphasis” and “code span” character styles were used for inline styling, and the “code block” style was used for sections of source code.

For example, this section of text in Scrivener:

Eventually looks like this in the final book:

I added a few application keyboard shortcuts to make toggling between these styles easier:

With those shortcuts I can hit ^I to switch to the inline “code span” style, ^C to switch to a “code block”, and ^N to clear the current style. Scrivener’s built-in i shortcut for “emphasis” was also very helpful.

I also added a custom “Pete’s Tips” paragraph style which is used to highlight callouts and points of emphasis throughout various chapters. In Scrivener, my tips are highlighted in yellow:

And in the final book, they’re floated left and styled for emphasis:

Organizing Content

In the early days, I was lost in the various ways of organizing a Scrivener project. Should I have one document per chapter? Should I have a folder per chapter and a document per section? Should I use the “Title”/”Header 1”/”Header 2” paragraph styles with unnamed Scrivener documents, or should I just use document names to indicate chapter/section names?

Ultimately I landed on a completely hierarchical organization scheme that doesn’t use any “Title” or “Header” paragraph styles.

Every document in the root of my Scrivener project is considered a chapter in Secure Meteor. Chapters without sub-sections are simply named documents. Chapters with sub-sections are named folders. The first document in that folder is unnamed, and any following sub-sections are named documents (or folders, if we want to go deeper).

This organization scheme worked out really well for me when it came time to lay out my final document and build my table of contents.

Scrivomatic

Unfortunately, Scrivener’s compiler support for syntax-highlighted code blocks isn’t great (read: non-existent). If I wanted my book to be styled the way I wanted, I had no choice but to do the final rendering outside of Scrivener.

I decided on using Pandoc to render my book into HTML, and found Scrivomatic to be an unbelievably useful tool for working with Pandoc within the context of a Scrivener project.

After installing Scrivomatic and its various dependencies, I added a “front matter” document to my Scrivener project:


---
title: "<$projecttitle>"
author:
  - Pete Corey
keywords: 
  - Meteor
  - Security
pandocomatic_:
  use-template:
    - secure-meteor-html
---

After adding my front matter, I added a “Scrivomatic” compile format, once again, following Scrivomatic’s instructions. It’s in this compile format that I added a prefix and suffix for “Pete’s Tips” paragraph styles that wraps each tip in a <p> tag with a tip class:

Next, I added the secure-meteor-html template referenced in my front matter to my ~/.pandoc/pandocomatic.yaml configuration file:


  secure-meteor-html:
    setup: []
    preprocessors: []
    pandoc:
      standalone: true
    metadata:
      notes-after-punctuation: false
    postprocessors: []
    cleanup: []
    pandoc:
      from: markdown
      to: html5
      standalone: true
      number-sections: false
      section-divs: true
      css: ./stylesheet.css
      self-contained: true
      toc: true
      toc-depth: 4
      base-header-level: 1
      template: ./custom.html
      

Note that I’m using ./custom.html and ./stylesheet.css as my HTML and CSS template files. Those will live within my Scrivener project folder (~/Secure Meteor).

Also note that I’m telling Pandoc to build a table of contents, which it happily does, thanks to the project structure we went over previously.

My custom.html is a stripped down and customized version of Scrivomatic’s default HTML template. To get the styling and structure of my title page just right, I built it out manually in the template:


$if(title)$
<header id="title-block-header">
    <div>
        <h1 class="title">Secure Meteor</h1>
        <p class="subtitle">Learn the ins and outs of securing your Meteor application from a Meteor security professional.</p>
        <p class="author">Written by Pete Corey.</p>
    </div>
</header>
$endif$

My CSS template, which you can see here, was also based on a stripped down version of Scrivomatic’s default CSS template. A few callouts to mention are that I used Typekit to pull down the font I wanted to use:


@import url("https://use.typekit.net/ssa1tke.css");

body { 
  font-family: "freight-sans-pro",sans-serif;
  ...
}

I added the styling for “Pete’s Tips” floating sections:


.tip {
    font-size: 1.6em;
    float: right;
    max-width: 66%;
    margin: 0.5em 0 0.5em 1em;
    line-height: 1.6;
    color: #ccc;
    text-align: right;
}

And I set up various page-break-* rules around the table of contents, chapters, sections, and code blocks:


#TOC {
    page-break-after: always;
}

h1 {
    page-break-before: always
}

h1,h2,h3,h4,h5,h6 {
    page-break-after: avoid;
}

.sourceCode {
    page-break-inside: avoid;
}

My goals with these rules were to always start a chapter on a new page, to avoid section headings hanging at the end of pages, and to avoid code blocks being broken in half by page breaks.

Generating a well-formatted HTML version of my book had the nice side effect of letting me easily publish sample chapters online.

HTML to PDF

Pandoc, through Scrivomatic, was doing a great job of converting my Scrivener project into an HTML document, but now I wanted to generate a PDF document as a final artifact that I could give to my customers. Pandoc’s PDF generation uses LaTeX to typeset and format documents, and after much pain and strife, I decided I didn’t want to go that route.

I wanted to turn my HTML document, which was perfectly styled, into a distributable PDF.

The first route I took was to simply open the HTML document in Chrome and “print” it to a PDF document. This worked, but I wanted an automated solution that didn’t require I remember margin settings and page sizes. I also wanted a solution that allowed me to append styled page numbers to the footer of every page in the book, aside from the title page (which was built in our HTML template, outside the context of our Scrivener project and our generated table of contents).

I landed on writing a Puppeteer script that renders the HTML version of Secure Meteor into its final PDF. There are quite a few things going on in this script. First, it renders the title page by itself into first.pdf:


await page.pdf({
  path: "first.pdf",
  pageRanges: "1",
  ...
});

Next, it saves the rest of the pages to rest.pdf, including a custom footer that renders the current page number:


await page.pdf({
  path: "rest.pdf",
  pageRanges: "2-",
  footerTemplate: "...",
  ...
});

Finally, first.pdf and rest.pdf are merged together using the pdf-merge NPM package, which uses pdftk under the hood:


await pdfMerge([`${__dirname}/first.pdf`, `${__dirname}/rest.pdf`], {
  output: `${__dirname}/out.pdf`,
  libPath: "/usr/local/bin/pdftk"
});

By rendering the title separately from the rest of the book we’re able to place page numbers on the internal pages of our book, while keeping the title page footer free. This is another reason for building the title page into our HTML template. If we built it with Scrivener, Scrivomatic would count it as a page when generating our table of contents, which we don’t want.

Fine Tuning Page Breaks and Line Wraps

Finally, I had a mostly automated process for going from a draft in Scrivener to a rendered PDF. I could compile my Scrivener project down to HTML and then run my ./puppeteer script to generate a final PDF.

After looking through this final PDF, I realized that it still needed quite a bit of work.

Some code blocks overflowed out of the page. I went through each page, looking for these offending blocks of code and manually trimmed them down to size by truncating lines cleanly at a certain character count, when appropriate, or by adding line breaks where possible.

I also noticed many unaesthetic page breaks: section headers too close to the bottom of a page, large gaps at the bottom of pages caused by subsequent large code blocks, poorly floated “Pete’s Tips”. I had no choice but to start on page one and work my way through each of these issues.

I didn’t want to change the text of the book, so my only choice was to manually modify the generated HTML and add page-break-* styles on specific elements. Eventually, I massaged the book into a form I was happy with. Unfortunately, any changes I make to the text in Scrivener will force me to redo these manual changes.

Eventually, I had my final PDF. If you’d like to see how it turned out, go grab a copy of Secure Meteor or check out a few of the sample chapters!

Final Thoughts

I’m a few months removed from this whole process, and I have far more thoughts now than I did when I first started.

Would I use this workflow to write another book? Probably not. For all of Scrivener’s power, I don’t think rich-text editing is my jam. I’m more inclined to use Ulysses, which I know and love, to write in a plain-text format. If I had to choose today, I’d write in a flavor of Markdown or begin my journey up LaTeX’s the steep learning curve.

I also need to find a better renderer than a browser. There’s a whole host of CSS functionality that’s proposed or deprecated that would make rendering paged media in the browser more feasible, like CSS-only page numbers, orphans and widows, and more, but none of it works in current versions of Chrome and Firefox. Prince seems to promise some of this functionality, but its price tag is too steep for me. Then again, working directly with LaTeX seems like it would aleviate these problems altogether.

Ultimately, I wanted to document this process because figuring this stuff out was ridiculously difficult. Writing the words of the book was easy in comparison. Hopefully this will act as a guide to others to show what’s currently possible, and some potential pitfalls to avoid.

May 26, 2019

Derek Jones (derek-jones)

Evidence on the distribution and diversity of Christianity: 1900-2000 May 26, 2019 10:11 PM

I recently read an article saying that Christianity had 33,830 denominations, with 150 having more than 1 million followers. Checking the references, World Christian Encyclopedia was cited as the source; David Barrett had spent 12 years traveling the world, talking to people to collect the data. An evidence-based man, after my own heart.

Checking the second-hand book sites, I found a copy of the 1982 edition available for a few pounds, and placed an order (this edition lists 20,800 denominations; how many more are there to be ‘discovered’).

The book that arrived was a bit larger than I had anticipated. This photograph shows just how large this book is, compared to other dead-tree data sources in my collection (on top, in red, is your regular 400 page book):

World Christian Encyclopedia.

My interest in a data-driven discussion of the spread and diversity of religions, was driven by wanting ideas for measuring the spread and diversity of programming languages. Bill Kinnersley’s language list contains information on 2,500 programming languages, and there are probably an order of magnitude more languages waiting to be written about.

The data is available to researchers, but is not public :-(

The World Christian Encyclopedia is way too detailed for my needs. I usually leave unwanted books on the book table of my local train station’s Coffee shop. I have left some unusual books there in the past, but this one feels like it needs a careful owner; I will see whether the local charity shop will take it in.

Gonçalo Valério (dethos)

Pixels Camp v3 May 26, 2019 06:24 PM

Like I did in previous years/versions, this year I participated again on Pixels.camp, a kind of conference plus hackathon. For those who aren’t aware, it is one of the biggest (if not the biggest) technology event in Portugal (from a technical perspective not counting with the Web Summit).

So, as I did in previous editions, I’m gonna leave here a small list with the nicest talks I was able to attend.

Lockpicking versus IT security

This one was super interesting, Walter Belgers showed the audience a set of problems in make locks and compared those mistakes with the ones regularly done by software developers.

Al least for me the more impressive parts of the whole presentation were the demonstrations of the flaws on regular (and high security) locks.

Talk description here.


Containers 101

“Everybody” uses containers nowadays, on this talk the speaker took a step back and went through the history and the major details behind this technology. Then he shows how you could implement a part of it yourself using common Linux features and tools.

I will add the video here, as soon as it becomes available online.

Talk description here.


Static and dynamic analysis of events for threat detection

This one was a nice overview about Siemens infrastructure for threat detection, their approaches and used tools. It was also possible to understand some of the obstacles and challenges a company must address to protect a global infrastructure.

Talk description here.


Protecting Crypto exchanges from a new wave of man-in-the-browser attacks

This presentation used the theme of protecting crypto-currency exchanges but gave lots of good hints on how to improve security of any website or web application. The second half of the talk focused on a kind of attack called man-in-the-browser and focused on a demonstration of it. In my opinion, this last part was weaker and I left with the impression it lacked details about the most crucial part of the attack while spending a lot of time on less important stuff.

Talk description here.

May 25, 2019

Indrek Lasn (indreklasn)

This article is wrong, implausible and fully misinformed. May 25, 2019 10:42 AM

This article is wrong, implausible and fully misinformed. Plenty of 20 year-olds are successful CEOs. To give you a couple of examples of successful companies founded by a young CEO.

Michael Dell, 20's
Mark Zuckerberg, 20's
Bill Gates, 20's
Steve Jobs, 20's

Do you need more proof? And yes, I’m in 20's.

May 24, 2019

Siddhant Goel (siddhantgoel)

Not everything needs to be async May 24, 2019 10:00 PM

Writing asynchronous code is popular these days. Look at this search trend from the last 5 years.

Async Python

I have the feeling that the number of tutorials on the internet explaining asynchronous code has increased quite a bit since Python started supporting the async/await keywords. Even though Python has always had support for running asynchronous code using the asyncore module (or using libraries like Twisted), I don't think that asyncore was used as much as the new asyncio. This is a pure gut feeling though; I have no numbers to back that claim up.

Anyway, asyncio makes it slightly easier to write asynchronous code. Slightly, because I don't know if I can call the API as intuitive, or dare I say, "Pythonic". This article does a much better job of explaining why asyncio is what it is.

Even if we put asyncio aside, I don't think asynchronous code is ever easy. There's just so much going on under the hood that it's difficult to keep your head from spinning, before you can actually get to writing the application logic.

But that's not what this blog post is about. This blog post is about how not everything needs to be async. And that if some code you're working on absolutely necessarily must be async, then why it makes sense to stop for a minute and consider the consequences of introducing this extra level of complexity.

This has nothing to do with Python, or asyncio, or any async framework in general. All I want to say, is, if you think you want to write asynchronous code, think twice.

Synchronous is much simpler

Synchronous code is simple to write. It's also much easier to reason about, and it's lot less likely to contain concurrency or thread-safety bugs than asynchronous code. As programmers, our job is to solve business problems reliably in the least possible time. Synchronous code fits that criteria quite well. So if I'm given a choice between writing synchronous or asynchronous, I can say with a reasonable amount of confidence that I'll prefer synchronous.

Would async really help?

Next, if asynchronous code is absolutely required, it makes sense to think about what it's going to do underneath, and what performance gains it's going to bring.

For instance, if you're writing a web request handler which calls out a few external APIs and combines those responses to finally return a response to your user, yes, asynchronous code would absolutely help. The time that the external resources make your request handler wait can be used to serve other user requests.

On the other hand, if your request handler is fetching a few rows from a database server that's running on the same machine as the app server, it's not going to make that much of a difference if it were async.

Is it safe?

Often times we end up using abstractions that hide away the implementation details and provide a nice API for us to work with. In these cases, it's important to know what exactly is being hidden, or how that abstraction is working underneath.

For example, Python provides an abstraction called ThreadPoolExecutor, which allows you to run functions in separate threads (there is also ProcessPoolExecutor which lets you separate things on a process-level).

The way this works is that you submit a callable to the pool, and the pool returns a Future object immediately. And when the function has finished running, the results (or the exception) would be stored in this future object.

Since there are Future objects involved (which you can await on), it can be tempting to use this abstraction to write async code. But because now there are multiple threads involved, it's not that simple anymore. The functions being submitted to the thread pool should now only make use of resources that are thread-safe. In case two callables are submitted to the pool, both referencing a particular object which is not thread-safe, there's potential for weird concurrency bugs.


Closing thoughts - async is useful (and cool), but there is a time and place for everything. It may result in an increased CPU utilization without necessarily bringing speed improvements, so it's helpful to keep that in mind when writing async code.

Gustaf Erikson (gerikson)

March May 24, 2019 07:53 AM

Skisser för sommaren - Bosön mars 2019

Kristallvertikalaccent i grönt - Stockholm mars 2019

Mar 2018 | Mar 2017 | Mar 2016 | Mar 2015 | Mar 2014 | Mar 2013 | Mar 2012 | Mar 2011 | Mar 2010 | Mar 2009

May 23, 2019

Joe Nelson (begriffs)

Unicode programming, with examples May 23, 2019 12:00 AM

Most programming languages evolved awkwardly during the transition from ASCII to 16-bit UCS-2 to full Unicode. They contain internationalization features that often aren’t portable or don’t suffice.

Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.

Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.

This article illustrates text processing ideas with example programs. We’ll use the International Components for Unicode (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.

IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.

Table of Contents:

Concepts

Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.

What is a “character?”

“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.

Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.

You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.

In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways:

  • A: U+006f (o) + U+0302 (◌̂) + U+0323 (◌̣)
  • B: U+006f (o) + U+0323 (◌̣) + U+0302 (◌̂)
  • C: U+00f4 (ô) + U+0323 (◌̣)
  • D: U+1ecd (ọ) + U+0302 (◌̂)
  • E: U+1ed9 (ộ)

The numbers (written U+xxxx) for each abstract character and each combining symbol are called “codepoints.” Every Unicode string is expressed as a list of codepoints. As illustrated above, multiple strings of codepoints may render into the same sequence of graphemes.

To meaningfully compare strings codepoint by codepoint for equality, both strings should both be represented in a consistent way. A standardized choice of codepoint decomposition for graphemes is called a “normal form.”

One choice is to decompose a string into as many codepoints as possible like examples A and B (with a weighting factor of which combining marks should come first). That is called Normalization Form Canonical Decomposition (NFD). Another choice is to do the opposite and use the fewest codepoints possible like example E. This is called Normalization Form Canonical Composition (NFC).

A core concept to remember is that, although codepoints are the building blocks of text, they don’t match up 1-1 with user-perceived characters (graphemes). Operations such as taking the length of an array of codepoints, or accessing arbitrary array positions are typically not useful for Unicode programs. Programs must also be mindful of the combining characters, like diacritical marks, when inserting or deleting codepoints. Inserting U+0061 into the asterisk position U+006f U+0302 (*) U+0323 changes the string “ộ” into “ôạ” rather than “ộa”.

Glyphs vs graphemes

It’s not just fonts that cause graphemes to be rendered into varying glyphs. The rules of some languages cause glyphs to change through contextual shaping. For instance the Arabic letter “heh” has four forms, depending on which sides are flanked by letters. When isolated it appears as ﻩ and in the final/initial/medial position in a word it appears as ﻪ/ﻫ/ﻬ respectively. Similarly, Greek displays lower-case sigma differently at the end of the word (final form) than elsewhere. Some glyphs change based on visual order. In a right-to-left language the starting parenthesis “(” mirrors to display as “)”.

Not only do individual graphemes’ glyphs vary, graphemes can combine to form single glyphs. One way is through ligatures. The latin letters “fi” often join the dot of the i with the curve of the f (presentation form U+FB01 fi). Another way is language irregularity. The Arabic ا and ل, when contiguous, must form ﻻ.

Conversely, a single grapheme can split into multiple glyphs. For instance in some Indic languages, vowels can split and surround preceding consonants. In Bengali, U+09CC ৌ surrounds U+09AE ম to become মৌ . Try placing a cursor at the end of this text box and pressing backspace:

How are codepoints encoded?

In 1990, Unicode codepoints were 16 bits wide. That choice turned out to be too small for the symbols and languages people wanted to represent, so the committee extended the standard to 21 bits. That’s fine in the abstract, but how the 21 bits are stored in memory or communicated between computers depends on practical factors.

It’s an unusual memory size. Computer hardware doesn’t typically access memory in 21-bit chunks. Networking protocols, too, are better geared toward transmitting eight bits at a time. Thus, codepoints are broken into sequences of more conventionally sized blocks called code units for persistence on disk, transmission over networks, and manipulation in memory.

The Unicode Transformation Formats (UTF) describe different ways to map between codepoints and code units. The transformation formats are named after the bit width of their code units (7, 8, 16, or 32), as well as the endianness (BE or LE). For instance: UTF-8, or UTF-16BE. In addition to the UTFs, there’s another – more complex – encoding called Punycode. It is designed to conform with the limited ASCII character subset used for Internet host names.

A final bit of terminology. A “plane” is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes (1 through 16) are called “supplementary planes.”

Which encoding should you choose?

For transmission and storage, use UTF-8. Programs which move ASCII data can handle it without modification. Machine endianness does not affect UTF-8, and the byte-sized units work well in networks and filesystems.

Some sites, like UTF-8 Everywhere go even further and recommend using UTF-8 for internal manipulation of text in program memory. However, I would suggest you use whatever encoding your Unicode library favors for this. You’ll be performing operations through the library API, not directly on code units. As we’re seeing, there is too much complexity between glyphs, graphemes, codepoints and code units to be manipulating the units directly. Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.

It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint. However, the relationship between codepoints and glyphs isn’t straightforward, so there isn’t a programmatic advantage to storing the string this way.

UTF-32 also wastes at minimum 11 (32 - 21) bits per codepoint, and typically more. For instance, UTF-16 requires only one 16-bit code unit to encode points in the Base Multilingual Plane (the most commonly encountered points). Thus UTF-32 can typically double the space required for the BMP.

There are times to manipulate UTF-32, such as when examining a single codepoint. We’ll see examples below.

ICU example programs

The programs in this article are ready to compile and run. They require the ICU C library called ICU4C, which is available on most platforms through the operating system package manager.

ICU provides five libraries for linking (we need the first two):

Package Contents
icu-uc Common (uc) and Data (dt/data) libraries
icu-io Ustdio/iostream library (icuio)
icu-i18n Internationalization (in/i18n) library
icu-le Layout Engine
icu-lx Paragraph Layout

To use ICU4C, set the compiler and linker flags with pkg-config in your Makefile. (Pkg-config may also need to be installed on your computer.)

CFLAGS  = -std=c99 -pedantic -Wall -Wextra \
          `pkg-config --cflags icu-uc icu-io`
LDFLAGS = `pkg-config --libs icu-uc icu-io`

The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style (//) comments.

Generating random codepoints

To start getting a feel for ICU’s I/O and codepoint manipulation, let’s make a program to output completely random (but valid) codepoints. You could use this program as a basic fuzz tester, to see whether its output confuses other programs. A real fuzz tester ought to have the ability to take an explicit seed for repeatable output, but we will omit that functionality from our simple demo.

This program has limited portability because it gets entropy from /dev/urandom, a Unix device. To generate good random numbers using only the C standard library, see my other article. Also POSIX provides pseudo-random number functions.

/* for constants like EXIT_FAILURE */
#include <stdlib.h>
/* we'll be using standard C I/O to read random bytes */
#include <stdio.h>

/* to determine codepoint categories */
#include <unicode/uchar.h>
/* to output UTF-32 codepoints in proper encoding for terminal */
#include <unicode/ustdio.h>

int main(int argc, char **argv)
{
	long i = 0, linelen;
	/* somewhat non-portable: /dev/urandom is unix specific */
	FILE *f = fopen("/dev/urandom", "rb");
	UFILE *out;
	/* UTF-32 code unit can hold an entire codepoint */
	UChar32 c;
	/* to learn about c */
	UCharCategory cat;

	if (!f)
	{
		fputs("Unable to open /dev/urandom\n", stderr);
		return EXIT_FAILURE;
	}

	/* optional length to insert line breaks */
	linelen = argc > 1 ? strtol(argv[1], NULL, 10) : 0;

	/* have to obtain a Unicode-aware file handle. This function
	 * has no failure return code, it always works. */
	out = u_get_stdout();

	/* read a random 32 bits, presumably forever */
	while (fread(&c, sizeof c, 1, f))
	{
		/* Scale 32-bit value to a number within code planes
		 * zero through fourteen. (Planes 15-16 are private-use)
		 *
		 * The modulo bias is insignificant. The first 65535
		 * codepoints are minutely favored, being generated by
		 * 4370 different 32-bit numbers each. The remaining
		 * 917505 codepoints are generated by 4369 numbers each.
		 */
		c %= 0xF0000;
		cat = u_charType(c);

		/* U_UNASSIGNED are "non-characters" with no assigned
		 * meanings for interchange. U_PRIVATE_USE_CHAR are
		 * reserved for use within organizations, and
		 * U_SURROGATE are designed for UTF-16 code units in
		 * particular. Don't print any of those. */
		if (cat != U_UNASSIGNED && cat != U_PRIVATE_USE_CHAR &&
		    cat != U_SURROGATE)
		{
			u_fputc(c, out);
			if (linelen && ++i >= linelen)
			{
				i = 0;
				/* there are a number of Unicode
				 * linebreaks, but the standard ASCII
				 * \n is valid, and will interact well
				 * with a shell */
				u_fputc('\n', out);
			}
		}
	}

	/* should never get here */
	fclose(f);
	return EXIT_SUCCESS;
}

A note about the mysterious U_UNASSIGNED category, the “non-characters.” These are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. The Unicode Standard sets aside 66 non-character code points. The last two code points of each plane are noncharacters (U+FFFE and U+FFFF on the BMP). In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0…U+FDEF.

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. They are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.

Manipulating codepoints

We discussed non-characters in the previous section, but there are also Private Use codepoints. Unlike non-characters, those for private use are designated for interchange between systems. However the precise meaning and glyphs for these characters is specific to the organization using them. The same codepoints can be used for different things by different people.

Unicode provides a large area for private use. Both a small code block in the BMP, as well as two entire planes: 15 and 16. Because no browser or text editor will render PUA codepoints beyond (typically) empty boxes, we can exploit plane 15 to make a visually confusing code. Ultimately it’s a cheesy transposition cypher, but it’s kind of fun.

Below is a program to shift characters in the BMP to/from plane 15, the Private Use Area A. Example output of an encoded string: 󰁂󰁥󰀠󰁳󰁵󰁲󰁥󰀠󰁴󰁯󰀠󰁤󰁲󰁩󰁮󰁫󰀠󰁹󰁯󰁵󰁲󰀠󰁏󰁶󰁡󰁬󰁴󰁩󰁮󰁥󰀡󰀊

#include <stdio.h>
#include <stdlib.h>
/* for strcmp in argument parsing */
#include <string.h>

#include <unicode/ustdio.h>

void usage(const char *prog)
{
	puts("Shift base multilingual plane to/from PUA-A\n");
	printf("Usage: %s [-d]\n\n", prog);
	puts("Encodes stdin (or decode with -d)");
	exit(EXIT_SUCCESS);
}

int main(int argc, char **argv)
{
	UChar32 c;
	UFILE *in, *out;
	enum { MODE_ENCODE, MODE_DECODE } mode = MODE_ENCODE;

	if (argc > 2)
		usage(argv[0]);
	else if(argc > 1)
	{
		if (strcmp(argv[1], "-d") == 0)
			mode = MODE_DECODE;
		else
			usage(argv[0]);
	}

	out = u_get_stdout();

	in = u_finit(stdin, NULL, NULL);
	if (!in)
	{
		fputs("Error opening stdout as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* u_fgetcx returns UTF-32. U_EOF happens to be 0xFFFF,
	 * not -1 like EOF typically is in stdio.h */
	while ((c = u_fgetcx(in)) != U_EOF)
	{
		/* -1 for UChar32 actually signifies invalid character */
		if (c == (UChar32)0xFFFFFFFF)
		{
			fputs("Invalid character.\n", stderr);
			continue;
		}
		if (mode == MODE_ENCODE)
		{
			/* Move the BMP into the Supplementary
			 * Private Use Area-A, which begins
			 * at codepoint 0xf0000 */
			if (0 < c && c < 0xe000)
				c += 0xf0000;
		}
		else
		{
			/* Move the Supplementary Private Use
			 * Plane down into the BMP */
			if (0xf0000 < c && c < 0xfe000)
				c -= 0xf0000;
		}
		u_fputc(c, out);
	}

	/* if you u_finit it, then u_fclose it */
	u_fclose(in);

	return EXIT_SUCCESS;
}

Examining UTF-8 code units

So far we’ve been working entirely with complete codepoints. This next example gets into their representation as code units in a transformation format, namely UTF-8. We will read the codepoint as a hexadecimal program argument, and convert it to between 1-4 bytes in UTF-8, and print the hex values of those bytes.

/*** utf8.c ***/

#include <stdio.h>
#include <stdlib.h>

#include <unicode/utf8.h>

int main(int argc, char **argv)
{
	UChar32 c;
	/* ICU defines its own bool type to be used
	 * with their macro */
	UBool err = FALSE;
	/* ICU uses C99 types like uint8_t */
	uint8_t bytes[4] = {0};
	/* probably should be size_t not int32_t, but
	 * just matching what their macro expects */
	int32_t written = 0, i;
	char *parsed;

	if (argc != 2)
	{
		fprintf(stderr, "Usage: %s codepoint\n", *argv);
		exit(EXIT_FAILURE);
	}
	c = strtol(argv[1], &parsed, 16);
	if (!*argv[1] || *parsed)
	{
		fprintf(stderr,
			"Cannot parse codepoint: U+%s\n", argv[1]);
		exit(EXIT_FAILURE);
	}

	/* this is a macro, and updates the variables
	 * directly. No need to pass addresses.
	 * We're saying: write to "bytes", tell us how
	 * many were "written", limit it to four */
	U8_APPEND(bytes, written, 4, c, err);
	if (err == TRUE)
	{
		fprintf(stderr, "Invalid codepoint: U+%s\n", argv[1]);
		exit(EXIT_FAILURE);
	}

	/* print in format 'xxd -r' can read */
	printf("0: ");
	for (i = 0; i < written; ++i)
		printf("%2x", bytes[i]);
	puts("");
	return EXIT_SUCCESS;
}

Suppose you compile this to a program named utf8. Here are some examples:

# ascii characters are unchanged
$ ./utf8 61
0: 61

# other codepoints require more bytes
$ ./utf8 1F41A
0: f09f909a

# format is compatible with "xxd"
$ ./utf8 1F41A | xxd -r
🐚

# surrogates (used in UTF-16) are not valid codepoints
$ ./utf8 DC00
Invalid codepoint: U+DC00

Reading lines into internal UTF-16 representation

Unlimited line length

Here’s a useful helper function named u_wholeline() which reads a line of any length into a dynamically allocated buffer. It reads as UChar*, which is ICU’s standard UTF-16 code unit array.

/* to properly test realloc */
#include <errno.h>
#include <stdlib.h>

#include <unicode/ustdio.h>

/* line Feed, vertical tab, form feed, carriage return,
 * next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
	((c) >= 0xa && (c) <= 0xd) || \
	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )

/* allocates buffer, caller must free */
UChar *u_wholeline(UFILE *f)
{
	/* assume most lines are shorter
	 * than 128 UTF-16 code units */
	size_t i, sz = 128;
	UChar c, *s = malloc(sz * sizeof(*s)), *s_new;

	if (!s)
		return NULL;

	/* u_fgetc returns UTF-16, unlike u_fgetcx */
	for (i = 0; (s[i] = u_fgetc(f)) != U_EOF && !NEWLINE(s[i]); ++i)
		if (i >= sz)
		{
			/* double the buffer when it runs out */
			sz *= 2;
			errno = 0;
			s_new = realloc(s, sz * sizeof(*s));
			if (errno == ENOMEM)
				free(s);
			if ((s = s_new) == NULL)
				return NULL;
		}

	/* if terminated by CR, eat LF */
	if (s[i] == 0xd && (c = u_fgetc(f)) != 0xa)
		u_fungetc(c, f);
	/* s[i] will either be U_EOF or a newline; wipe it */
	s[i] = '\0';

	return s;
}

Limited line length

The previous example reads an entire line. However, reading a limited number of code units from UTF-16 lines is more tricky. Truncating a Unicode string is always a little dangerous due to possibly splitting a word and breaking contextual shaping.

UTF-16 also has surrogate pairs, which are how that translation format expresses codepoints outside the BMP. Ending a UTF-16 string early can split surrogate pairs without the proper precaution.

The following example reads lines in chunks of at most three UTF-16 code units at a time. If it reads two consecutive codepoints from supplementary planes it will fail. The program accepts a “fix” argument to make it push a final unpaired surrogate back onto the stream for a future read.

/*** codeunit.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utf16.h>

/* BUFSZ set to be very small so that lines must be read in
 * many chunks. Helps illustrate split surrogate pairs */
#define BUFSZ 4

void printHex(const UChar *s)
{
	while (*s)
		printf("%x ", *s++);
	putchar('\n');
}

/* yeah, slightly annoying duplication */
void printHex32(const UChar32 *s)
{
	while (*s)
		printf("%x ", *s++);
	putchar('\n');
}

int main(int argc, char **argv)
{
	UFILE *in;
	/* read line into ICU's default UTF-16 representation */
	UChar line[BUFSZ];
	/* A buffer to hold codepoints of "line" as UTF-32 code
	 * units.  The length is sufficient because it requires
	 * fewer (or at least no greater) code units in UTF-32 to
	 * encode the string */
	UChar32 codepoints[BUFSZ];
	UChar *final;
	UErrorCode err = U_ZERO_ERROR;

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* read lines one small BUFSZ chunk at a time */
	while (u_fgets(line, BUFSZ, in))
	{
		/* correct for split surrogate pairs only
		 * if the "fix" argument is present */
		if (argc > 1 && strcmp(argv[1], "fix") == 0)
		{
			final = line + u_strlen(line);
			/* want to consider the character before \0
			 * if such exists */
			if (final > line)
				final--;
			/* if it is the lead unit of a surrogate pair */
			if (U16_IS_LEAD(*final))
			{
				/* push it back for a future read, and
				 * truncate the string */
				u_fungetc(*final, in);
				*final = '\0';
			}
		}

		printf("UTF-16    : ");
		printHex(line);
		u_strToUTF32(
			codepoints, BUFSZ, NULL,
			line, -1, &err);
		printf("Error?    : %s\n", u_errorName(err));
		printf("Codepoints: ");
		printHex32(codepoints);

		/* reset potential errors and go for another chunk */
		err = U_ZERO_ERROR;
		*codepoints = '\0';
	}

	u_fclose(in);
	return EXIT_SUCCESS;
}

If the program reads two weird numerals 𝟘𝟙 (different from 01), neither of which are in the BMP, it finds one codepoint but chokes on the broken pair:

$ echo -n 𝟘𝟙 | ./codeunit
UTF-16    : d835 dfd8 d835
Error?    : U_INVALID_CHAR_FOUND
Codepoints: 1d7d8
UTF-16    : dfd9
Error?    : U_INVALID_CHAR_FOUND
Codepoints:

However if we pass the “fix” argument, the program will read two complete codepoints:

$ echo -n 𝟘𝟙 | ./codeunit fix
UTF-16    : d835 dfd8
Error?    : U_ZERO_ERROR
Codepoints: 1d7d8
UTF-16    : d835 dfd9
Error?    : U_ZERO_ERROR
Codepoints: 1d7d9

Perhaps a better way to read a line with limited length is to use a “break iterator” to stop on a word boundary. We’ll see more about that later.

Extracting, iterating codepoints in UTF-16 string

Our next example will rather laboriously remove diacritical marks from a string. There’s an easier way to do this called “transformation,” but doing it manually provides an opportunity to decompose characters and iterate over them with the U16_NEXT macro.

/*** nomarks.c ***/

#include <stdlib.h>

#include <unicode/uchar.h>
#include <unicode/unorm2.h>
#include <unicode/ustdio.h>
#include <unicode/utf16.h>

/* Limit to how many decomposed UTF-16 units a single
 * codepoint will become in NFD. I don't know the
 * correct value here so I chose a value that seems
 * to be overkill */
#define MAX_DECOMP_LEN 16

int main(void)
{
	long i, n;
	UChar32 c;
	UFILE *in, *out;
	UChar decomp[MAX_DECOMP_LEN];
	UErrorCode status = U_ZERO_ERROR;
	UNormalizer2 *norm;

	out = u_get_stdout();

	in = u_finit(stdin, NULL, NULL);
	if (!in)
	{
		/* using stdio functions with stderr and ustdio
		 * with stdout. Mixing the two on a single file
		 * handle would probably be bad. */
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* create a normalizer, in this case one going to NFD */
	norm = (UNormalizer2 *)unorm2_getNFDInstance(&status);
	if (U_FAILURE(status)) {
		fprintf(stderr,
			"unorm2_getNFDInstance(): %s\n",
			u_errorName(status));
		return EXIT_FAILURE;
	}

	/* consume input as UTF-32 units one by one */
	while ((c = u_fgetcx(in)) != U_EOF)
	{
		/* Decompose c to isolate its n combining character
		 * codepoints. Saves them as UTF-16 code units.  FYI,
		 * this function ignores the type of "norm" and always
		 * denormalizes */
		n = unorm2_getDecomposition(
			norm, c, decomp, MAX_DECOMP_LEN, &status
		);

		if (U_FAILURE(status)) {
			fprintf(stderr,
				"unorm2_getDecomposition(): %s\n",
				u_errorName(status));
			u_fclose(in);
			return EXIT_FAILURE;
		}

		/* if c does not decompose and is not itself
		 * a diacritical mark */
		if (n < 0 && ublock_getCode(c) !=
		    UBLOCK_COMBINING_DIACRITICAL_MARKS)
			u_fputc(c, out);

		/* walk canonical decomposition, reuse c variable */
		for (i = 0; i < n; )
		{
			/* the U16_NEXT macro iterates over UChar (aka
			 * UTF-16, advancing by one or two elements as
			 * needed to get a codepoint. It saves the result
			 * in UTF-32. The macro updates i and c. */
			U16_NEXT(decomp, i, n, c);
			/* output only if not combining diacritical */
			if (ublock_getCode(c) !=
			    UBLOCK_COMBINING_DIACRITICAL_MARKS)
				u_fputc(c, out);
		}
	}

	u_fclose(in);
	/* u_get_stdout() doesn't need to be u_fclose'd */
	return EXIT_SUCCESS;
}

Here’s an example of running the program:

$ echo "résumé façade" | ./nomarks
resume facade

Transformation

ICU provides a rich domain specific language for transforming strings. For example, our entire program in the previous section can be replaced by the transformation NFD; [:Nonspacing Mark:] Remove; NFC. This means to perform a canonical decomposition, remove nonspacing marks, and then canonically compose again. (In fact our program above didn’t re-compose.)

The program below echoes stdin to stdout, but passes the output through a transformation.

/*** trans-stream.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>

int main(int argc, char **argv)
{
	UChar32 c;
	UParseError pe;
	UFILE *in, *out;
	UTransliterator *t;
	UErrorCode status = U_ZERO_ERROR;
	UChar *xform_id;
	size_t n;

	if (argc != 2)
	{
		fprintf(stderr,
			"Usage: %s \"translation rules\"\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* the UTF-16 string should never be longer than the UTF-8
	 * argv[1], so this should be safe */
	n = strlen(argv[1]) + 1;
	xform_id = malloc(n * sizeof(UChar));
	u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);

	/* create transliterator by identifier */
	t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
	                 NULL, -1, &pe, &status);
	/* don't need the identifier any more */
	free(xform_id);
	if (U_FAILURE(status)) {
		fprintf(stderr, "utrans_open(%s): %s\n",
		        argv[1], u_errorName(status));
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* transparently transliterate stdout */
	u_fsettransliterator(out, U_WRITE, t, &status);
	if (U_FAILURE(status)) {
		fprintf(stderr,
		        "Failed to set transliterator on stdout: %s\n",
		        u_errorName(status));
		u_fclose(in);
		return EXIT_FAILURE;
	}

	/* what looks like a simple echo loop actually
	 * transliterate characters */
	while ((c = u_fgetcx(in)) != U_EOF)
		u_fputc(c, out);

	utrans_close(t);
	u_fclose(in);
}

As mentioned, it can emulate our earlier “nomarks” program:

$ echo "résumé façade" | ./trans "NFD; [:Nonspacing Mark:] Remove; NFC"
resume facade

It can also transliterate between scripts like this:

$ echo "miirekkaḍiki veḷutunnaaru?" | ./trans "Telugu"
మీరెక్కడికి వెళుతున్నఅరు?

Applying the transformation to a stream with u_fsettransliterator is a simple way to do things. However I did discover and file an ICU bug which will be fixed in version 65.1.

A more robust way to apply transformations is by manipulating UChar strings directly. The technique is also probably more applicable in real applications.

Here’s a rewrite of trans-stream that operates on strings directly:

/*** trans-string.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>

/* max number of UTF-16 code units to accumulate while looking
 * for an unambiguous transliteration. Has to be fairly long to
 * handle names in Name-Any transliteration like
 * \N{LATIN CAPITAL LETTER O WITH OGONEK AND MACRON} */
#define CONTEXT 100

int main(int argc, char **argv)
{
	UErrorCode status = U_ZERO_ERROR;
	UChar c, *end;
	UChar input[CONTEXT] = {0}, *buf, *enlarged;
	UFILE *in, *out; 
	UTransPosition pos;
	int32_t width, sizeNeeded, bufLen;

	size_t n;
	UChar *xform_id;
	UTransliterator *t;

	/* bufLen must be able to hold at least CONTEXT, and
	 * will be increased as needed for transliteration */
	bufLen = CONTEXT;
	buf = malloc(sizeof(UChar) * bufLen);

	if (argc != 2)
	{
		fprintf(stderr,
			"Usage: %s \"translation rules\"\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* allocate and read identifier, like earlier example */
	n = strlen(argv[1]) + 1;
	xform_id = malloc(n * sizeof(UChar));
	u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);

	t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
	                 NULL, -1, NULL, &status);
	free(xform_id);
	if (U_FAILURE(status)) {
		fprintf(stderr, "utrans_open(%s): %s\n",
		        argv[1], u_errorName(status));
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	end = input;
	/* append UTF-16 code units one at a time for incremental
	 * transliteration */
	while ((c = u_fgetc(in)) != U_EOF)
	{
		/* we consider at most CONTEXT consecutive code units
		 * for transliteration (minus one for \0) */
		if (end - input >= CONTEXT-1)
		{
			fprintf(stderr,
				"Exceeded max (%i) code units "
				"for context.\n",
				CONTEXT);
			break;
		}
		*end++ = c;
		*end = '\0';

		/* copy string so far to buf to operate on */
		u_strcpy(buf, input);
		pos.start = pos.contextStart = 0;
		pos.limit = pos.contextLimit = end - input;
		sizeNeeded = -1;
		utrans_transIncrementalUChars(
			t, buf, &sizeNeeded, bufLen, &pos, &status
		);
		/* if buf not big enough for transliterated result */
		if (status == U_BUFFER_OVERFLOW_ERROR)
		{
			/* utrans_transIncrementalUChars sets sizeNeeded,
			 * so resize the buffer */
			if ((enlarged =
			     realloc(buf, sizeof(UChar)*sizeNeeded))
			    == NULL)
			{
				fprintf(stderr,
					"Unable to grow buffer.\n");
				/* fail gracefully and display
				 * what we can */
				break;
			}
			buf = enlarged;
			bufLen = sizeNeeded;
			u_strcpy(buf, input);
			pos.start = pos.contextStart = 0;
			pos.limit = pos.contextLimit = end - input;
			sizeNeeded = -1;

			/* one more time, but with sufficient space */
			status = U_ZERO_ERROR;
			utrans_transIncrementalUChars(
				t, buf, &sizeNeeded, bufLen,
				&pos, &status
			);
		}
		/* handle errors other than U_BUFFER_OVERFLOW_ERROR */
		if (U_FAILURE(status)) {
			fprintf(stderr,
				"utrans_transIncrementalUChars(): %s\n",
				u_errorName(status));
			break;
		}

		/* print buf[0 .. pos.start - 1] */
		u_printf("%.*S", pos.start, buf);

		/* Remove the code units which were processed,
		 * shifting back the remaining ones which could
		 * not be unambiguously transliterated. Then hit
		 * the loop to get another code unit and try again. */
		u_strcpy(input, buf+pos.start);
		end = input + (pos.limit - pos.start);
	}

	/* if any leftovers from incremental transliteration */
	if (end > input)
	{
		/* transliterate input array in place, do our best */
		width = end - input;
		utrans_transUChars(
			t, input, NULL, CONTEXT, 0, &width, &status);
		u_printf("%S", input);
	}

	utrans_close(t);
	u_fclose(in);
	free(buf);
	return U_SUCCESS(status) ? EXIT_SUCCESS : EXIT_FAILURE;
}

Punycode

Punycode is a representation of Unicode within the limited ASCII character subset used for internet host names. If you enter a non-ASCII URL into a web browser navigation bar, the browser translates to Punycode before making the actual DNS lookup.

The encoding is part of the more general process of Internationalizing Domain Names in Applications (IDNA), which also normalizes the string.

Note that not all Unicode strings can be successfully encoded. For instance codepoints like “⒈” include a period in the glyph and are used for numbered lists. Converting that dot to the ASCII hostname would inadvertently specify a subdomain. ICU turns the offending character into U+FFFD (the “replacement character”) in the output and returns an error.

The following program uses uidna_nameToASCII or uidna_nameToUnicode as needed to translate between Unicode and punycode.

/*** puny.c ***/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* uidna stands for International Domain Names in 
 * Applications and contains punycode routines */
#include <unicode/uidna.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

void chomp(UChar *s)
{
	/* unicode characters that split lines */
	UChar splits[] =
		{0xa, 0xb, 0xc, 0xd, 0x85, 0x2028, 0x2029, '\0'};
	if (s)
		s[u_strcspn(s, splits)] = '\0';
}

int main(int argc, char **argv)
{
	UFILE *in;
	UChar input[1024], output[1024];
	UIDNAInfo info = UIDNA_INFO_INITIALIZER;
	UErrorCode status = U_ZERO_ERROR;
	UIDNA *idna = uidna_openUTS46(UIDNA_DEFAULT, &status);

	/* default action is performing punycode */
	int32_t (*action)(
			const UIDNA*, const UChar*, int32_t, UChar*, 
			int32_t, UIDNAInfo*, UErrorCode*
		) = uidna_nameToASCII;

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* the "decode" option reverses our action */
	if (argc > 1 && strcmp(argv[1], "decode") == 0)
		action = uidna_nameToUnicode;

	/* u_fgets includes the newline, so we chomp it */
	u_fgets(input, sizeof(input)/sizeof(*input), in);
	chomp(input);

	action(idna, input, -1, output,
		sizeof(output)/sizeof(*output),
		&info, &status);

	if (U_SUCCESS(status) && info.errors!=0)
		fputs("Bad input.\n", stderr);

	u_printf("%S\n", output);

	uidna_close(idna);
	u_fclose(in);
	return 0;
}

Example of using the program:

$ echo "façade.com" | ./puny
xn--faade-zra.com

# not every string is allowed

$ echo "a⒈.com" | ./puny
Bad input.
a�.com

Changing case

The C standard library has functions like toupper which operate on a single character at a time. ICU has equivalents like u_toupper, but working on single codepoints isn’t sufficient for proper casing. Let’s examine the program and see why.

/*** pointcase.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/uchar.h>
#include <unicode/ustdio.h>

int main(int argc, char **argv)
{
	UChar32 c;
	UFILE *in, *out;
	UChar32 (*op)(UChar32) = NULL;

	/* set op to one of the casing operations
	 * in uchar.h */
	if (argc < 2 || strcmp(argv[1], "upper") == 0)
		op = u_toupper;
	else if (strcmp(argv[1], "lower") == 0)
		op = u_tolower;
	else if (strcmp(argv[1], "title") == 0)
		op = u_totitle;
	else
	{
		fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* operates on UTF-32 */
	while ((c = u_fgetcx(in)) != U_EOF)
		u_fputc(op(c), out);

	u_fclose(in);
	return EXIT_SUCCESS;
}
# not quite right, ß should become SS:

$ echo "Die große Stille" | ./pointcase upper
DIE GROßE STILLE

# also wrong, final sigma should be ς:

$ echo "ΣΊΣΥΦΟΣ" | ./pointcase lower
σίσυφοσ

As you can see, some graphemes need to “expand” into a greater number, and others are position-sensitive. To do this properly, we have to operate on entire strings rather than individual characters. Here is a program to do it right:

/*** strcase.c ***/

#include <locale.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 1024

/* wrapper function for u_strToTitle with signature
 * matching the other casing functions */
int32_t title(UChar *dest, int32_t destCapacity,
		const UChar *src, int32_t srcLength,
		const char *locale, UErrorCode *pErrorCode)
{
	return u_strToTitle(dest, destCapacity, src,
			srcLength, NULL, locale, pErrorCode);
}

int main(int argc, char **argv)
{
	UFILE *in;
	char *locale;
	UChar line[BUFSZ], cased[BUFSZ];
	UErrorCode status = U_ZERO_ERROR;
	int32_t (*op)(
			UChar*, int32_t, const UChar*, int32_t,
			const char*, UErrorCode*
		) = NULL;

	/* casing is locale-dependent */
	if (!(locale = setlocale(LC_CTYPE, "")))
	{
		fputs("Cannot determine system locale\n", stderr);
		return EXIT_FAILURE;
	}

	if (argc < 2 || strcmp(argv[1], "upper") == 0)
		op = u_strToUpper;
	else if (strcmp(argv[1], "lower") == 0)
		op = u_strToLower;
	else if (strcmp(argv[1], "title") == 0)
		op = title;
	else
	{
		fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* Ideally we should change case up to the last word
	 * break and push the remaining characters back for
	 * a future read if the line was longer than BUFSZ.
	 * Currently, if the string is truncated, the final
	 * character would incorrectly be considered
	 * terminal, which affects casing rules in Greek. */
	while (u_fgets(line, BUFSZ, in))
	{
		op(cased, BUFSZ, line, -1, locale, &status);
		/* if casing increases string length, and goes
		 * beyond buffer size like the german ß -> SS */
		if (status == U_BUFFER_OVERFLOW_ERROR)
		{
			/* Just issue a warning and read another line.
			 * Don't treat it as severely as other errors. */
			fputs("Line too long\n", stderr);
			status = U_ZERO_ERROR;
		}
		else if (U_FAILURE(status))
		{
			fputs(u_errorName(status), stderr);
			break;
		}
		else
			u_printf("%S", cased);
	}

	u_fclose(in);
	return U_SUCCESS(status)
		? EXIT_SUCCESS : EXIT_FAILURE;
}

This works better.

$ echo "Die große Stille" | ./strcase upper
DIE GROSSE STILLE

$ echo "ΣΊΣΥΦΟΣ" | ./strcase lower
σίσυφος

Counting words and graphemes

Let’s make a version of wc (the Unix word count program) that knows more about Unicode. Our version will properly count grapheme clusters and word boundaries.

For example, regular wc gets confused by the ancient Ogham script. This was a series of notches scratched into fence posts, and has a space character which is nonblank.

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | wc
       1       1      37

One word, you say? Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it. Here’s one that can:

/*** uwc.c ***/

#include <locale.h>
#include <stdlib.h>

#include <unicode/ubrk.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 512

/* line Feed, vertical tab, form feed, carriage return, 
 * next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
	((c) >= 0xa && (c) <= 0xd) || \
	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )

int main(void)
{
	UFILE *in;
	char *locale;
	UChar line[BUFSZ];
	UBreakIterator *brk_g, *brk_w;
	UErrorCode status = U_ZERO_ERROR;
	long ngraph = 0, nword = 0, nline = 0;
	size_t len;

	/* word breaks are locale-specific, so we'll obtain
	 * LC_CTYPE from the environment */
	if (!(locale = setlocale(LC_CTYPE, "")))
	{
		fputs("Cannot determine system locale\n", stderr);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* create an iterator for graphemes */
	brk_g = ubrk_open(
		UBRK_CHARACTER, locale, NULL, -1, &status);
	/* and another for the edges of words */
	brk_w = ubrk_open(
		UBRK_WORD, locale, NULL, -1, &status);

	/* yes, this is sensitive to splitting end of line
	 * surrogate pairs and can be improved by our previous
	 * function for reading bounded lines */
	while (u_fgets(line, BUFSZ, in))
	{
		len = u_strlen(line);

		ubrk_setText(brk_g, line, len, &status);
		ubrk_setText(brk_w, line, len, &status);

		/* Start at beginning of string, count breaks.
		 * Could have been a for loop, but this looks
		 * simpler to me. */
		ubrk_first(brk_g);
		while (ubrk_next(brk_g) != UBRK_DONE)
			ngraph++;

		ubrk_first(brk_w);
		while (ubrk_next(brk_w) != UBRK_DONE)
			if (ubrk_getRuleStatus(brk_w) ==
			    UBRK_WORD_LETTER)
				nword++;

		/* count the newline if it exists */
		if (len > 0 && NEWLINE(line[len-1]))
			nline++;
	}

	printf("locale  : %s\n"
	       "Grapheme: %zu\n"
	       "Word    : %zu\n"
	       "Line    : %zu\n",
	       locale, ngraph, nword, nline);

	/* clean up iterators after use */
	ubrk_close(brk_g);
	ubrk_close(brk_w);
	u_fclose(in);
}

Much better:

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | ./uwc
locale  : en_US.UTF-8
Grapheme: 14
Word    : 4
Line    : 1

When comparing strings, we can be more or less strict. A familiar example is case sensitivity, but Unicode provides other options. Comparing strings for equality is a degenerate case of sorting, where the strings must not only be determined as equal, but put in order. Sorting is called “collation” and the Unicode collation algorithm supports multiple levels of increasing strictness.

Level Description
Primary base characters
Secondary accents
Tertiary case/variant
Quaternary punctuation

Each level acts as a tie-breaker when strings match in previous levels. When searching we can choose how deep to check before declaring strings equal. To illustrate, consider a text file called words.txt containing these words:

Cooperate
coöperate
COÖPERATE
co-operate
final
fides

We will write a program called ugrep, where we can specify a comparison level and search string. If we search for “cooperate” and allow comparisons up to the tertiary level it matches nothing:

$ ./ugrep 3 cooperate < words.txt
# it's an exact match, no results

It is possible to shift certain “ignorable” characters (like ‘-’) down to the quaternary level while conducting the original level 3 search:

$ ./ugrep 3i cooperate < words.txt
4: co-operate

Doing the same search at the secondary level disregards case, but is still sensitive to accents.

$ ./ugrep 2 cooperate < words.txt
1: Cooperate

Once again, can allow ignorables at this level.

$ ./ugrep 2i cooperate < words.txt
1: Cooperate
4: co-operate

Finally, going only to the primary level, we match words with the same base letters, modulo case and accents.

$ ./ugrep 1 cooperate < words.txt
1: Cooperate
2: coöperate
3: COÖPERATE

Note that the idea of a “base character” is dependent on locale. In Swedish, the letters o and ö are quite distinct, and not minor variants as in English. Setting the locale prior to search restricts the results even at the primary level.

$ LC_COLLATE=sv_SE ./ugrep 1 cooperate < fun.txt
1: Cooperate

One note about the tertiary level. It distinguishes not just case, but ligature presentation forms.

$ ./ugrep 3 fi < words.txt
6: fides

# vs

$ ./ugrep 2 fi < words.txt
5: final
6: fides

Pretty flexible, right? Let’s see the code.

/*** ugrep.c ***/

#include <locale.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/ucol.h>
#include <unicode/usearch.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 1024

int main(int argc, char **argv)
{
	char *locale;
	UFILE *in;
	UCollator *col;
	UStringSearch *srch = NULL;
	UErrorCode status = U_ZERO_ERROR;
	UChar *needle, line[BUFSZ];
	UColAttributeValue strength;
	int ignoreInsignificant = 0, asymmetric = 0;
	size_t n;
	long i;

	if (argc != 3)
	{
		fprintf(stderr,
			"Usage: %s {1,2,@,3}[i] pattern\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* cryptic parsing for our cryptic options */
	switch (*argv[1])
	{
		case '1':
			strength = UCOL_PRIMARY;
			break;
		case '2':
			strength = UCOL_SECONDARY;
			break;
		case '@':
			strength = UCOL_SECONDARY, asymmetric = 1;
			break;
		case '3':
			strength = UCOL_TERTIARY;
			break;
		default:
			fprintf(stderr,
				"Unknown strength: %s\n", argv[1]);
			return EXIT_FAILURE;
	}
	/* length of argv[1] is >0 or we would have died */
	ignoreInsignificant = argv[1][strlen(argv[1])-1] == 'i';

	n = strlen(argv[2]) + 1;
	/* if UTF-8 could encode it in n, then UTF-16
	 * should be able to as well */
	needle = malloc(n * sizeof(*needle));
	u_strFromUTF8(needle, n, NULL, argv[2], -1, &status);

	/* searching is a degenerate case of collation,
	 * so we read the LC_COLLATE locale */
	if (!(locale = setlocale(LC_COLLATE, "")))
	{
		fputs("Cannot determine system collation locale\n",
		      stderr);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	col = ucol_open(locale, &status);
	ucol_setStrength(col, strength);

	if (ignoreInsignificant)
		/* shift ignorable characters down to
		 * quaternary level */
		ucol_setAttribute(col, UCOL_ALTERNATE_HANDLING,
		                  UCOL_SHIFTED, &status);

	/* Assumes all lines fit in BUFSZ. Should
	 * fix this in real code and not increment i */
	for (i = 1; u_fgets(line, BUFSZ, in); ++i)
	{
		/* first time through, set up all options */
		if (!srch)
		{
			srch = usearch_openFromCollator(
				needle, -1, line, -1,
			    col, NULL, &status
			);
			if (asymmetric)
				usearch_setAttribute(
					srch, USEARCH_ELEMENT_COMPARISON,
					USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD,
					&status
				);
		}
		/* afterward just switch text */
		else
			usearch_setText(srch, line, -1, &status);

		/* check if keyword appears in line */
		if (usearch_first(srch, &status) != USEARCH_DONE)
			u_printf("%ld: %S", i, line);
	}

	usearch_close(srch);
	ucol_close(col);
	u_fclose(in);
	free(needle);

	return EXIT_SUCCESS;
}

Comparing strings modulo normalization

In the concepts section, we saw a single grapheme can be constructed with different combinations of codepoints. In many cases when comparing strings for equality, we’re most interested in the strings being perceived by the user in the same way rather than a simple byte-for-byte match.

The ICU library provides a unorm_compare function which returns a value similar to strcmp, and acts in a normalization independent way. It normalizes both strings incrementally while comparing them, so it can stop early if it finds a difference.

Here is code to check that the five ways of representing ộ are equivalent:

#include <stdio.h>
#include <unicode/unorm2.h>

int main(void)
{
	UErrorCode status = U_ZERO_ERROR;
	UChar s[][4] = {
		{0x006f,0x0302,0x0323,0},
		{0x006f,0x0323,0x0302,0},
		{0x00f4,0x0323,0,0},
		{0x1ecd,0x0302,0,0},
		{0x1ed9,0,0,0}
	};

	const size_t n = sizeof(s)/sizeof(s[0]);
	size_t i;

	for (i = 0; i < n; ++i)
		printf("%zu == %zu: %d\n", i, (i+1)%n,
			unorm_compare(
				s[i], -1, s[(i+1)%n], -1, 0, &status));
}

Output:

0 == 1: 0
1 == 2: 0
2 == 3: 0
3 == 4: 0
4 == 0: 0

A return value of 0 means the strings are equal.

Confusable strings

Because Unicode introduces so many graphemes, there are more possibilities for scammers to confuse people using lookalike glyphs. For instance, domains like adoḅe.com or pаypal.com (with Cyrillic а) can direct unwary visitors to phishing sites. ICU contains an entire module for detecting “confusables,” those strings which are known to look too similar when rendered in common fonts. Each string is assigned a “skeleton” such that confusable strings get the same skeleton.

For an example, see my utility utofu. It has a little extra complexity with sqlite access code, so I am not reproducing it here. It’s designed to check Unicode strings to detect changes over time that might be spoofing.

The method of operation is this:

  1. Read line as UTF-8
  2. Convert to Normalization Form C for consistency
  3. Calculate skeleton string
  4. Insert UTF-8 version of normalized input and its skeleton into a database if the skeleton doesn’t already exist
  5. Compare the normalized input string to the string in the database having corresponding skeleton. If not an exact match die with an error.

Further reading

Unicode and internationalization is a huge topic. I could only scratch the surface in this article. I read and enjoyed sections from these books and reference materials, and would recommend them:

May 22, 2019

Indrek Lasn (indreklasn)

Hey, not the type of guy to promote, but we’re building a community of doers/makers. May 22, 2019 09:45 AM

Hey, not the type of guy to promote, but we’re building a community of doers/makers. Heck, who knows — you might find a potential co-founder here. https://app.getnewly.com/join/?r=G2KE-kzff

May 20, 2019

Pete Corey (petecorey)

Minimum Viable Phoenix May 20, 2019 12:00 AM

Starting at the Beginning

Phoenix ships with quite a few bells and whistles. Whenever you fire up mix phx.new to create a new web application, forty six files are created and spread across thirty directories!

This can be overwhelming to developers new to Phoenix.

To build a better understanding of the framework and how all of its moving pieces interact, let’s strip Phoenix down to its bare bones. Let’s start from zero and slowly build up to a minimum viable Phoenix application.

.gitignore


+.DS_Store

Minimum Viable Elixir

Starting at the beginning, we need to recognize that all Phoenix applications are Elixir applications. Our first step in the process of building a minimum viable Phoenix application is really to build a minimum viable Elixir application.

Interestingly, the simplest possible Elixir application is simply an *.ex file that contains some source code. To set ourselves up for success later, let’s place our code in lib/minimal/application.ex. We’ll start by simply printing "Hello." to the console.


IO.puts("Hello.")

Surprisingly, we can execute our newly written Elixir application by compiling it:


➜ elixirc lib/minimal/application.ex
Hello.

This confused me at first, but it was explained to me that in the Elixir world, compilation is also evaluation.

lib/minimal/application.ex


+IO.puts("Hello.")

Generating Artifacts

While our execution-by-compilation works, it’s really nothing more than an on-the-fly evaluation. We’re not generating any compilation artifacts that can be re-used later, or deployed elsewhere.

We can fix that by moving our code into a module. Once we compile our newly modularized application.ex, a new Elixir.Minimal.Application.beam file will appear in the root of our project.

We can run our compiled Elixir program by running elixir in the directory that contains our *.beam file and specifying an expression to evaluate using the -e flag:


➜ elixir -e "Minimal.Application.start()"
Hello.

Similarly, we could spin up an interactive shell (iex) in the same directory and evaluate the expression ourselves:


iex(1)> Minimal.Application.start()
Hello.

.gitignore


+*.beam
.DS_Store

lib/minimal/application.ex


-IO.puts("Hello.")
+defmodule Minimal.Application do
+  def start do
+    IO.puts("Hello.")
+  end
+end

Incorporating Mix

This is great, but manually managing our *.beam files and bootstrap expressions is a little cumbersome. Not to mention the fact that we haven’t even started working with dependencies yet.

Let’s make our lives easier by incorporating the Mix build tool into our application development process.

We can do that by creating a mix.exs Elixir script file in the root of our project that defines a module that uses Mix.Project and describes our application. We write a project/0 callback in our new MixProject module who’s only requirement is to return our application’s name (:minimal) and version ("0.1.0").


def project do
  [
    app: :minimal,
    version: "0.1.0"
  ]
end

While Mix only requires that we return the :app and :version configuration values, it’s worth taking a look at the other configuration options available to us, especially :elixir and :start_permanent, :build_path, :elixirc_paths, and others.

Next, we need to specify an application/0 callback in our MixProject module that tells Mix which module we want to run when our application fires up.


def application do
  [
    mod: {Minimal.Application, []}
  ]
end

Here we’re pointing it to the Minimal.Application module we wrote previously.

During the normal application startup process, Elixir will call the start/2 function of the module we specify with :normal as the first argument, and whatever we specify ([] in this case) as the second. With that in mind, let’s modify our Minimal.Application.start/2 function to accept those parameters:


def start(:normal, []) do
  IO.puts("Hello.")
  {:ok, self()}
end

Notice that we also changed the return value of start/2 to be an :ok tuple whose second value is a PID. Normally, an application would spin up a supervisor process as its first act of life and return its PID. We’re not doing that yet, so we simply return the current process’ PID.

Once these changes are done, we can run our application with mix or mix run, or fire up an interactive Elixir shell with iex -S mix. No bootstrap expression required!

.gitignore


 *.beam
-.DS_Store
+.DS_Store
+/_build/

lib/minimal/application.ex


 defmodule Minimal.Application do
-  def start do
+  def start(:normal, []) do
     IO.puts("Hello.")
+    {:ok, self()}
   end

mix.exs


+defmodule Minimal.MixProject do
+  use Mix.Project
+
+  def project do
+    [
+      app: :minimal,
+      version: "0.1.0"
+    ]
+  end
+
+  def application do
+    [
+      mod: {Minimal.Application, []}
+    ]
+  end
+end

Pulling in Dependencies

Now that we’ve built a minimum viable Elixir project, let’s turn our attention to the Phoenix framework. The first thing we need to do to incorporate Phoenix into our Elixir project is to install a few dependencies.

We’ll start by adding a deps array to the project/0 callback in our mix.exs file. In deps we’ll list :phoenix, :plug_cowboy, and :jason as dependencies.

By default, Mix stores downloaded dependencies in the deps/ folder at the root of our project. Let’s be sure to add that folder to our .gitignore. Once we’ve done that, we can install our dependencies with mix deps.get.

The reliance on :phoenix makes sense, but why are we already pulling in :plug_cowboy and :jason?

Under the hood, Phoenix uses the Cowboy web server, and Plug to compose functionality on top of our web server. It would make sense that Phoenix relies on :plug_cowboy to bring these two components into our application. If we try to go on with building our application without installing :plug_cowboy, we’ll be greeted with the following errors:

** (UndefinedFunctionError) function Plug.Cowboy.child_spec/1 is undefined (module Plug.Cowboy is not available)
    Plug.Cowboy.child_spec([scheme: :http, plug: {MinimalWeb.Endpoint, []}
    ...

Similarly, Phoenix relies on a JSON serialization library to be installed and configured. Without either :jason or :poison installed, we’d receive the following warning when trying to run our application:

warning: failed to load Jason for Phoenix JSON encoding
(module Jason is not available).

Ensure Jason exists in your deps in mix.exs,
and you have configured Phoenix to use it for JSON encoding by
verifying the following exists in your config/config.exs:

config :phoenix, :json_library, Jason

Heeding that advice, we’ll install :jason and add that configuration line to a new file in our project, config/config.exs.

.gitignore


 /_build/
+/deps/

config/config.exs


+use Mix.Config
+
+config :phoenix, :json_library, Jason

mix.exs


   app: :minimal,
-  version: "0.1.0"
+  version: "0.1.0",
+  deps: [
+    {:jason, "~> 1.0"},
+    {:phoenix, "~> 1.4"},
+    {:plug_cowboy, "~> 2.0"}
+  ]
 ]
 

Introducing the Endpoint

Now that we’ve installed our dependencies on the Phoenix framework and the web server it uses under the hood, it’s time to define how that web server incorporates into our application.

We do this by defining an “endpoint”, which is our application’s interface into the underlying HTTP web server, and our clients’ interface into our web application.

Following Phoenix conventions, we define our endpoint by creating a MinimalWeb.Endpoint module that uses Phoenix.Endpoint and specifies the :name of our OTP application (:minimal):


defmodule MinimalWeb.Endpoint do
  use Phoenix.Endpoint, otp_app: :minimal
end

The __using__/1 macro in Phoenix.Endpoint does quite a bit of heaving lifting. Among many other things, it loads the endpoint’s initial configuration, sets up a plug pipeline using Plug.Builder, and defines helper functions to describe our endpoint as an OTP process. If you’re curious about how Phoenix works at a low level, start your search here.

Phoenix.Endpoint uses the value we provide in :otp_app to look up configuration values for our application. Phoenix will complain if we don’t provide a bare minimum configuration entry for our endpoint, so we’ll add that to our config/config.exs file:


config :minimal, MinimalWeb.Endpoint, []

But there are a few configuration values we want to pass to our endpoint, like the host and port we want to serve from. These values are usually environment-dependent, so we’ll add a line at the bottom of our config/config.exs to load another configuration file based on our current environment:


import_config "#{Mix.env()}.exs"

Next, we’ll create a new config/dev.exs file that specifies the :host and :port we’ll serve from during development:


use Mix.Config

config :minimal, MinimalWeb.Endpoint,
  url: [host: "localhost"],
  http: [port: 4000]

If we were to start our application at this point, we’d still be greeted with Hello. printed to the console, rather than a running Phoenix server. We still need to incorporate our Phoenix endpoint into our application.

We do this by turning our Minimal.Application into a proper supervisor and instructing it to load our endpoint as a supervised child:


use Application

def start(:normal, []) do
  Supervisor.start_link(
    [
      MinimalWeb.Endpoint
    ],
    strategy: :one_for_one
  )
end

Once we’ve done that, we can fire up our application using mix phx.server or iex -S mix phx.server and see that our endpoint is listening on localhost port 4000.

Alternatively, if you want to use our old standby of mix run, either configure Phoenix to serve all endpoints on startup, which is what mix phx.server does under the hood:


config :phoenix, :serve_endpoints, true

Or configure your application’s endpoint specifically:


config :minimal, MinimalWeb.Endpoint, server: true

config/config.exs


+config :minimal, MinimalWeb.Endpoint, []
+
 config :phoenix, :json_library, Jason
+
+import_config "#{Mix.env()}.exs"

config/dev.exs


+use Mix.Config
+
+config :minimal, MinimalWeb.Endpoint,
+  url: [host: "localhost"],
+  http: [port: 4000]

lib/minimal/application.ex


 defmodule Minimal.Application do
+  use Application
+
   def start(:normal, []) do
-    IO.puts("Hello.")
-    {:ok, self()}
+    Supervisor.start_link(
+      [
+        MinimalWeb.Endpoint
+      ],
+      strategy: :one_for_one
+    )
   end
 

lib/minimal_web/endpoint.ex


+defmodule MinimalWeb.Endpoint do
+  use Phoenix.Endpoint, otp_app: :minimal
+end

Adding a Route

Our Phoenix endpoint is now listening for inbound HTTP requests, but this doesn’t do us much good if we’re not serving any content!

The first step in serving content from a Phoenix application is to configure our router. A router maps requests sent to a route, or path on your web server, to a specific module and function. That function’s job is to handle the request and return a response.

We can add a route to our application by making a new module, MinimalWeb.Router, that uses Phoenix.Router:


defmodule MinimalWeb.Router do
  use Phoenix.Router
end

And we can instruct our MinimalWeb.Endpoint to use our new router:


plug(MinimalWeb.Router)

The Phoenix.Router module generates a handful of helpful macros, like match, get, post, etc… and configures itself to a module-based plug. This is the reason we can seamlessly incorporate it in our endpoint using the plug macro.

Now that our router is wired into our endpoint, let’s add a route to our application:


get("/", MinimalWeb.HomeController, :index)

Here we’re instructing Phoenix to send any HTTP GET requests for / to the index/2 function in our MinimalWeb.HomeController “controller” module.

Our MinimalWeb.HomeController module needs to use Phoenix.Controller and provide our MinimalWeb module as a :namespace configuration option:


defmodule MinimalWeb.HomeController do
  use Phoenix.Controller, namespace: MinimalWeb
end

Phoenix.Controller, like Phoenix.Endpoint and Phoenix.Router does quite a bit. It establishes itself as a plug and by using Phoenix.Controller.Pipeline, and it uses the :namespace module we provide to do some automatic layout and view module detection.

Because our controller module is essentially a glorified plug, we can expect Phoenix to pass conn as the first argument to our specified controller function, and any user-provided parameters as the second argument. Just like any other plug’s call/2 function, our index/2 should return our (potentially modified) conn:


def index(conn, _params) do
  conn
end

But returning an unmodified conn like this is essentially a no-op.

Let’s spice things up a bit and return a simple HTML response to the requester. The simplest way of doing that is to use Phoenix’s built-in Phoenix.Controller.html/2 function, which takes our conn as its first argument, and the HTML we want to send back to the client as the second:


Phoenix.Controller.html(conn, """
  

Hello.

""")

If we dig into html/2, we’ll find that it’s using Plug’s built-in Plug.Conn.send_resp/3 function:


Plug.Conn.send_resp(conn, 200, """
  

Hello.

""")

And ultimately send_resp/3 is just modifying our conn structure directly:


%{
  conn
  | status: 200,
    resp_body: """
      

Hello.

""", state: :set }

These three expressions are identical, and we can use whichever one we choose to return our HTML fragment from our controller. For now, we’ll follow best practices and stick with Phoenix’s html/2 helper function.

lib/minimal_web/controllers/home_controller.ex


+defmodule MinimalWeb.HomeController do
+  use Phoenix.Controller, namespace: MinimalWeb
+
+  def index(conn, _params) do
+    Phoenix.Controller.html(conn, """
+      

Hello.

+ """) + end +end

lib/minimal_web/endpoint.ex


   use Phoenix.Endpoint, otp_app: :minimal
+
+  plug(MinimalWeb.Router)
 end
 

lib/minimal_web/router.ex


+defmodule MinimalWeb.Router do
+  use Phoenix.Router
+
+  get("/", MinimalWeb.HomeController, :index)
+end

Handling Errors

Our Phoenix-based web application is now successfully serving content from the / route. If we navigate to http://localhost:4000/, we’ll be greeted by our friendly HomeController:

But behind the scenes, we’re having issues. Our browser automatically requests the /facicon.ico asset from our server, and having no idea how to respond to a request for an asset that doesn’t exist, Phoenix kills the request process and automatically returns a 500 HTTP status code.

We need a way of handing requests for missing content.

Thankfully, the stack trace Phoenix gave us when it killed the request process gives us a hint for how to do this:

Request: GET /favicon.ico
  ** (exit) an exception was raised:
    ** (UndefinedFunctionError) function MinimalWeb.ErrorView.render/2 is undefined (module MinimalWeb.ErrorView is not available)
        MinimalWeb.ErrorView.render("404.html", %{conn: ...

Phoenix is attempting to call MinimalWeb.ErrorView.render/2 with "404.html" as the first argument and our request’s conn as the second, and is finding that the module and function don’t exist.

Let’s fix that:


defmodule MinimalWeb.ErrorView do
  def render("404.html", _assigns) do
    "Not Found"
  end
end

Our render/2 function is a view, not a controller, so we just have to return the content we want to render in our response, not the conn itself. That said, the distinctions between views and controllers may be outside the scope of building a “minimum viable Phoenix application,” so we’ll skim over that for now.

Be sure to read move about the ErrorView module, and how it incorporates into our application’s endpoint. Also note that the module called to render errors is customizable through the :render_errors configuration option.

lib/minimal_web/views/error_view.ex


+defmodule MinimalWeb.ErrorView do
+  def render("404.html", _assigns) do
+    "Not Found"
+  end
+end

Final Thoughts

So there we have it. A “minimum viable” Phoenix application. It’s probably worth pointing out that we’re using the phrase “minimum viable” loosely here. I’m sure there are people who can come up with more “minimal” Phoenix applications. Similarly, I’m sure there are concepts and tools that I left out, like views and templates, that would cause people to argue that this example is too minimal.

The idea was to explore the Phoenix framework from the ground up, building each of the requisite components ourselves, without relying on automatically generated boilerplate. I’d like to think we accomplished that goal.

I’ve certainly learned a thing or two!

If there’s one thing I’ve taken away from this process, it’s that there is no magic behind Phoenix. Everything it’s doing can be understood with a little familiarity with the Phoenix codebase, a healthy understanding of Elixir metaprogramming, and a little knowledge about Plug.

May 19, 2019

Ponylang (SeanTAllen)

Last Week in Pony - May 19, 2019 May 19, 2019 08:47 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

May 18, 2019

Andreas Zwinkau (qznc)

Companies are AI May 18, 2019 12:00 AM

Depending on the definition of intelligence, companies are intelligent beings.

Read full article!

May 17, 2019

Derek Jones (derek-jones)

Background checks on pointer values being considered for C May 17, 2019 06:58 PM

DR 260 is a defect report submitted to WG14, the C Standards’ committee, in 2001 that was never resolved, then generally ignored for 10-years, then caught the attention of a research group a few years ago, and is now back on WG14’s agenda. The following discussion covers two of the three questions raised in the DR.

Consider the following fragment of code:

int *p, *q;

    p = malloc (sizeof (int)); assert (p != NULL);  // Line A
    (free)(p);                                      // Line B
    // more code
    q = malloc (sizeof (int)); assert (q != NULL);  // Line C
    if (memcmp (&p, &q, sizeof p) == 0)             // Line D
       {*p = 42;                                    // Line E
        *q = 43;}                                   // Line F

Section 6.2.4p2 of the C Standard says:
“The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime.”

The call to free, on line B, ends the lifetime of the storage (allocated on line A) pointed to by p.

There are two proposed interpretations of the sentence, in 6.2.4p2.

  1. “becomes indeterminate” is treated as effectively storing a value in the pointer, i.e., some bit pattern denoting an indeterminate value. This interpretation requires that any other variables that had been assigned p‘s value, prior to the free, also have an indeterminate value stored into them,
  2. the value held in the pointer is to be treated as an indeterminate value (for instance, a memory management unit may prevent any access to the corresponding storage).

What are the practical implications of the two options?

The call to malloc, on line C, could return a pointer to a location that is identical to the pointer returned by the first call to malloc, i.e., the second call might immediately reuse the free‘ed storage.

Effectively storing a value in the pointer, in response to the call to free means the subsequent call to memcmp would always return a non-zero value, and the questions raised below do not apply; it would be a nightmare to implement, especially in a multi-process environment.

If the sentence in section 6.2.4p2 is interpreted as treating the pointer value as indeterminate, then the definition of malloc needs to be updated to specify that all returned values are determinate, i.e., any indeterminacy that may exist gets removed before a value is returned (the memory management unit must allow read/write access to the storage).

The memcmp, on line D, does a byte-wise compare of the pointer values (a byte-wise compare side-steps indeterminate value issues). If the comparison is exact, an assignment is made via p, line E, and via q, line F.

Does the assignment via p result in undefined behavior, or is the conformance status of the code unaffected by its presence?

Nobody is impuning the conformance status of the assignment via q, on line F.

There are people who think that the assignment via p, on line E, should be treated as undefined behavior, despite the fact that the values of p and q are byte-wise identical. When this issue was first raised (by those trouble makers in the UK ;-), yours truly was less than enthusiastic, but there were enough knowledgeable people in the opposing camp to keep the ball rolling for a while.

The underlying issue some people have with some subsequent uses of p is its provenance, the activities it has previously been associated with.

Provenance can be included in the analysis process by associating a unique number with the address of every object, at the start of its lifetime; these p-numbers are not reused.

The value returned by the call to malloc, on line A, would include a pointer to the allocated storage, plus an associated p-number; the call on line C could return a pointer having the same value, but its p-number is required to be different. Implementations are not required to allocate any storage for p-numbers, treating them purely as conceptual quantities. Your author knows of two implementations that do allocate storage for p-numbers (in a private area), and track usage of p-numbers; the Model Implementation C Checker was validated as handling all of C90, and Cerberus which handles a substantial subset of C11, and I don’t believe that the other tools that check array bounds and use after free are based on provenance (corrections welcome).

If provenance is included as part of a pointer’s value, the behavior of operators needs to be expanded to handle the p-number (conceptual or not) component of a pointer.

The rules might specify that p-numbers are conceptually compared by the call to memcmp, on line C; hence p and q are considered to never compare equal. There is an existing practice of regarding byte compares as just that, i.e., no magic ever occurs when comparing bytes (otherwise known as objects having type unsigned char).

Having p-numbers be invisible to memcmp would be consistent with existing practice. The pointer indirection operation on line E (generating undefined behavior) is where p-numbers get involved and cause the undefined behavior to occur.

There are other situations where pointer values, that were once indeterminate, can appear to become ‘respectable’.

For a variable, defined in a function, “… its lifetime extends from entry into the block with which it is associated until execution of that block ends in any way.”; section 6.2.4p3.

In the following code:

int x;
static int *p=&x;

void f(int n)
{
   int *q = &n;
   if (memcmp (&p, &q, sizeof p) == 0)
      *p = 0;
   p = &n; // assign an address that will soon cease to exist.
} // Lifetime of pointed to object, n, terminates here

int main(void)
{
   f(1); // after this call, p has an indeterminate value
   f(2);
}

the pointer p has an indeterminate value after any call to f returns.

In many implementations, the second call to f will result in n having the same address it had on the first call, and memcmp will return zero.

Again, there are people who have an issue with the assignment involving p, because of its provenance.

One proposal to include provenance contains substantial changes to existing word in the C Standard. The rationale for is proposals looks more like a desire to change wording to make things clearer for those making the change, than a desire to address DR 260. Everybody thinks their proposed changes make the wording clearer (including yours truly), such claims are just marketing puff (and self-delusion); confirmation from the results of an A/B test would add substance to such claims.

It is probably possible to explicitly include support for provenance by making a small number of changes to existing wording.

Is the cost of supporting provenance (i.e., changing existing wording may introduce defects into the standard, the greater the amount of change the greater the likelihood of introducing defects), worth the benefits?

What are the benefits of introducing provenance?

Provenance makes it possible to easily specify that the uses of p, in the two previous examples (and a third given in DR 260), are undefined behavior (if that is WG14’s final decision).

Provenance also provides a model that might make it easier to reason about programs; it’s difficult to say one way or the other, without knowing what the model is.

Supporters claim that provenance would enable tool vendors to flag various snippets of code as suspicious. Tool vendors can already do this, they don’t need permission from the C Standard to flag anything they fancy.

The C Standard requires a conforming implementation to diagnose certain constructs. A conforming implementation can issue as many messages as it likes, for any other construct, e.g., for line A in the first example, a compiler might print “This is the 1,000,000’th call to malloc I have translated, ring this number to claim your prize!

Before any changes are made to wording in the C Standard, WG14 needs to decide what the behavior should be for these examples; it could decide to continue ignoring them for another 20-years.

Once a decision is made, the next question is how to update wording in the standard to specify the behavior that has been decided on.

While provenance is an interesting idea, the benefits it provides appear to be not worth the cost of changing the C Standard.

Indrek Lasn (indreklasn)

Not yet ;) May 17, 2019 03:38 PM

Not yet ;)

I wholeheartedly agreed with this. May 17, 2019 03:34 PM

I wholeheartedly agreed with this.

May 14, 2019

Derek Jones (derek-jones)

A prisoner’s dilemma when agreeing to a management schedule May 14, 2019 11:52 PM

Two software developers, both looking for promotion/pay-rise by gaining favorable management reviews, are regularly given projects to complete by a date specified by management; the project schedules are sometimes unachievable, with probability p.

Let’s assume that both developers are simultaneously given a project, and the corresponding schedule. If the specified schedule is unachievable, High quality work can only be performed by asking for more time, otherwise performing Low quality work is the only way of meeting the schedule.

If either developer faces an unachievable deadline, they have to immediately decide whether to produce High or Low quality work. A High quality decision requires that they ask management for more time, and incur a penalty they perceive to be C (saying they cannot meet the specified schedule makes them feel less worthy of a promotion/pay-rise); a Low quality decision is perceived to be likely to incur a penalty of Q_1 (because of its possible downstream impact on project completion), if one developer chooses Low, and Q_2, if both developers choose Low. It is assumed that: Q_1 < Q_2 < C Q_2 C" /> Q_2 C" /> Q_2 C" title="Q_1 Q_2 C"/>.

This is a prisoner’s dilemma problem. The following mathematical results are taken from: “The Effects of Time Pressure on Quality in Software Development: An Agency Model”, by Robert D. Austin (cannot find a downloadable pdf).

There are two Nash equilibriums, for the decision made by the two developers: Low-Low and High-High (i.e., both perform Low quality work, or both perform High quality work). Low-High is not a stable equilibrium, in that on the next iteration the two developers may switch their decisions.

High-High is a pure strategy (i.e., always use it), when: 1-{Q_1}/C <= p= p" />= p" />= p" title="1-{Q_1}/C = p"/>

High-High is Pareto superior to Low-Low when: 1-{Q_2}/{C-Q_1+Q_2} < p < 1-{Q_1}/C p 1-{Q_1}/C" /> p 1-{Q_1}/C" /> p 1-{Q_1}/C" title="1-{Q_2}/{C-Q_1+Q_2} p 1-{Q_1}/C"/>

How might management use this analysis to increase the likelihood that a High-High quality decision is made?

Evidence shows that 50% of developer estimates, of task effort, underestimate the actual effort; there is sufficient uncertainty in software development that the likelihood of consistently produce accurate estimates is low (i.e., p is a very fuzzy quantity). Managers wanting to increase the likelihood of a High-High decision could be generous when setting deadlines (e.g., multiple developer estimates by 200%, when setting the deadline for delivery), but managers are often under pressure from customers, to specify aggressively short deadlines.

The penalty for a developer admitting that they cannot deliver by the specified schedule, C, could be set very low (e.g., by management not taking this factor into account when deciding developer promotion/pay-rise). But this might encourage developers to always give this response. If all developers mutually agreed to cooperate, to always give this response, none of them would lose relative to the others; but there is an incentive for the more capable developers to defect, and the less capable developers to want to use this strategy.

Regular code reviews are a possible technique for motivating High-High, by increasing the likelihood of any lone Low decision being detected. A Low-Low decision may go unreported by those involved.

To summarise: an interesting analysis that appears to have no practical use, because reasonable estimates of the values of the variables involved are unavailable.

May 13, 2019

Simon Zelazny (pzel)

How I learned to never match on os:cmd output May 13, 2019 10:00 PM

A late change in requirements from a customer had me scrambling to switch an HDFS connector script — from a Python program — to the standard Hadoop tool hdfs.

The application that was launching the connector script was written in Erlang, and was responsible for uploading some files to an HDFS endpoint, like so:

UploadCmd = lists:flatten(io_lib:format("hdfs put ~p ~p", [Here, There])),
"" = os:cmd(UploadCmd),

This was all fine and dandy when the UploadCmd was implemented in full by me. When I switched out the Python script for the hdfs command, all my tests continued to work, and the data was indeed being written successfully to my local test hdfs node. So off to production it went.

Several hours later I got notified that there's some problems with the new code. After inspecting the logs it became clear that the hdfs command was producing unexpected output (WARN: blah blah took longer than expected (..)) and causing the Erlang program to treat the upload operation as failed.

As is the case for reasonable Erlang applications, the writing process would crash upon a failed match, then restart and attempt to continue where it left off — by trying to upload Here to There. Now, this operation kept legitimately failing, because it had in fact succeeded the first time, and HDFS would not allow us to overwrite There (unless we added a -f flag to put).

The solution

The quick-and-dirty solution was to wrap the UploadCmd in a script that captured the exit code, and then printed it out at the end, like so:

sh -c '{UploadCmd}; RES=$?; echo; echo $RES'

Now, your Erlang code can match on the last line of the output and interpret it as a integer exit code. Not the most elegant of solutions, but elegant enough to work around os:cmd/1's blindess to exit codes.

Lesson learned

The UNIX way states that programs should be silent on success and vocal on error. Sadly, many applications don't follow the UNIX way, and the bigger the application at hand, the higher the probability that one of its dependencies will use STDOUT or STDERR as its own personal scratchpad.

My lesson: never rely on os:cmd/1 output in production code, unless the command you're running is fully under your control, and you can be certain that its outputs are completely and exhaustively specified by you.

I do heavily rely on os:cmd output in test code, and I have no intention of stopping. Early feedback about unexpected output is great in tests.

Indrek Lasn (indreklasn)

How to setup continuous integration (CI) with React, CircleCI, and GitHub May 13, 2019 01:07 PM

To ensure the highest grade of quality code, we need to run multiple checks on each commit/pull request. Running code checks is especially useful when working in a team and making sure everyone follows the best and latest practices.

What kind of checks are we talking about? For starters, running our unit tests to make sure everything passes, building and bundling our frontend to make sure the build won’t fail on production, and running our linters to enforce a standard.

At my current company, we run many checks before any code can be committed to the repository.

Code checks at Newly

CI lets us run code checks automatically. Who wants to run all those commands before pushing code to the repository?

Getting started

I’ve chosen CircleCI due to its generous free tier, Github thanks to its community, and React since it’s easy and fun to use.

Create React App

Create your react app, however, you like. For simplicity sake, I’m using CRA.

Creating Github repository

Once you’re finished with CRA, push the code to your Github repository.

Setting up CI with CircleCI

If you already have a CircleCI account, great! If not, make one here.

Once you logged in, click on “Add Projects”

Adding a Project to CircleCI

Find your repository and click “Set Up Project”

Setting up a project

Now we should see instructions.

Installation instructions

Simple enough, let’s create a folder called .circleci and place the config.yml inside the folder.

CircleCI config.yml

We specify the CircleCI version, orbs, and workflows. Orbs are shareable configuration packages for your builds. A workflow is a set of rules for defining a collection of jobs and their run order.

Push the code to your repository

Start building

Head back to CircleCI and press “Start building”

STAAAART BUILDIN! :DBuild succeeded

If you click on the build, you can monitor what actually happened. For this case, the welcome orb is a demo and doesn’t do much.

Setting up our CircleCI with React

Use config.yml setup to run test, lint and build checks with React.

https://medium.com/media/817e5dfe08bf7977b337ef97ab2561a6/href

After you pushed this code, give the orb the permissions it needs.

Settings -> Security -> Yes, allow orbs

Now each commit/PR runs the workflow jobs.

Check CircleCI for the progress of jobs. Here’s what CircleCI is doing for each commit:

  • Set up the React project
  • Runs eslint to check the formatting of the code
  • Runs unit tests
  • Runs test coverage

All of the above workflow jobs have to succeed for the commit and build to be successful.

Now each commit has a green, red or yellow tick indicating the status! Handy.

You can find the demo repository here;

wesharehoodies/circleci-react-example

Thanks for reading, check out my Twitter for more.

Indrek Lasn (@lasnindrek) | Twitter

Here are some of my previous articles you might enjoy:


How to setup continuous integration (CI) with React, CircleCI, and GitHub was originally published in freeCodeCamp.org on Medium, where people are continuing the conversation by highlighting and responding to this story.

Pete Corey (petecorey)

Is My Apollo Client Connected to the Server? May 13, 2019 12:00 AM

When you’re building a real-time, subscription-heavy front-end application, it can be useful to know if your client is actively connected to the server. If that connection is broken, maybe because the server is temporarily down for maintenance, we’d like to be able to show a message explaining the situation to the user. Once we re-establish our connection, we’d like to hide that message and go back to business as usual.

That’s the dream, at least. Trying to implement this functionality using Apollo turned out to be more trouble than we expected on a recent client project.

Let’s go over a few of the solutions we tried that didn’t solve the problem, for various reasons, and then let’s go over the final working solution we came up with. Ultimately, I’m happy with what we landed on, but I didn’t expect to uncover so many roadblocks along the way.

What Didn’t Work

Our first attempt was to build a component that polled for an online query on the server. If the query ever failed with an error on the client, we’d show a “disconnected” message to the user. Presumably, once the connection to the server was re-established, the error would clear, and we’d re-render the children of our component:


const Connected = props => {
  return (
    <Query query={gql'{ online }'} pollInterval={5000}>
      {({error, loading}) => {
        if (loading) {
            return <Loader/>;
        }
        else if (error) {
            return <Message/>;
        }
        else {
            return props.children;
        }
      }}
    </Query>
  );
}

Unfortunately, our assumptions didn’t hold up. Apparently when a query fails, Apollo (react-apollo@2.5.5) will stop polling on that failing query, stopping our connectivity checker dead in its tracks.

NOTE: Apparently, this should work, and in various simplified reproductions I built while writing this article, it did work. Here are various issues and pull requests documenting the problem, merging in fixes (which others claim don’t work), and documenting workarounds:


We thought, “well, if polling is turned off on error, let’s just turn it back on!” Our next attempt used startPolling to try restarting our periodic heartbeat query.


if (error) {
  startPolling(5000);
}

No dice.

Our component successfully restarts polling and carries on refetching our query, but the Query component returns values for both data and error, along with a networkStatus of 8, which indicates that “one or more errors were detected.”

If a query returns both an error and data, how are we to know which to trust? Was the query successful? Or was there an error?

We also tried to implement our own polling system with various combinations of setTimeout and setInterval. Ultimately, none of these solutions seemed to work because Apollo was returning both error and data for queries, once the server had recovered.

NOTE: This should also work, though it would be unnecessary, if it weren’t for the issues mentioned above.


Lastly, we considered leveraging subscriptions to build our connectivity detection system. We wrote a online subscription which pushes a timestamp down to the client every five seconds. Our component subscribes to this publication… And then what?

We’d need to set up another five second interval on the client that flips into an error state if it hasn’t seen a heartbeat in the last interval.

But once again, once our connection to the server is re-established, our subscription won’t re-instantiate in a sane way, and our client will be stuck showing a stale disconnected message.

What Did Work

We decided to go a different route and implemented a solution that leverages the SubscriptionClient lifecycle and Apollo’s client-side query functionality.

At a high level, we store our online boolean in Apollo’s client-side cache, and update this value whenever Apollo detects that a WebSocket connection has been disconnected or reconnected. Because we store online in the cache, our Apollo components can easily query for its value.

Starting things off, we added a purely client-side online query that returns a Boolean!, and a resolver that defaults to being “offline”:


const resolvers = {
    Query: { online: () => false }
};

const typeDefs = gql`
  extend type Query {
    online: Boolean!
  }
`;

const apolloClient = new ApolloClient({
  ...
  typeDefs,
  resolvers
});

Next we refactored our Connected component to query for the value of online from the cache:


const Connected = props => {
  return (
    <Query query={gql'{ online @client }'}>
      {({error, loading}) => {
        if (loading) {
            return <Loader/>;
        }
        else if (error) {
            return <Message/>;
        }
        else {
            return props.children;
        }
      }}
    </Query>
  );
}

Notice that we’re not polling on this query. Any time we update our online value in the cache, Apollo knows to re-render this component with the new value.

Next, while setting up our SubscriptionClient and WebSocketLink, we added a few hooks to detect when our client is connected, disconnected, and later reconnected to the server. In each of those cases, we write the appropriate value of online to our cache:


subscriptionClient.onConnected(() =>
    apolloClient.writeData({ data: { online: true } })
);

subscriptionClient.onReconnected(() =>
    apolloClient.writeData({ data: { online: true } })
);

subscriptionClient.onDisconnected(() =>
    apolloClient.writeData({ data: { online: false } })
);

And that’s all there is to it!

Any time our SubscriptionClient detects that it’s disconnected from the server, we write offline: false into our cache, and any time we connect or reconnect, we write offline: true. Our component picks up each of these changes and shows a corresponding message to the user.

Huge thanks to this StackOverflow comment for pointing us in the right direction.

May 12, 2019

Ponylang (SeanTAllen)

Last Week in Pony - May 12, 2019 May 12, 2019 08:57 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Pages From The Fire (kghose)

Mixins or composition? May 12, 2019 12:14 AM

Mixins are great for “horizontal scaling” by adding functionality to a class over time. Reading mixed in code has an element of “gotcha” because the methods are scattered over multiple classes. Composition is great for handling complex functionality by insulating individual parts into their own classes and just exposing the bare interface to each other …

May 10, 2019

Carlos Fenollosa (carlesfe)

What are the differences between OpenBSD and Linux? May 10, 2019 09:32 AM

Maybe you have been reading recently about the release of OpenBSD 6.5 and wonder, "What are the differences between Linux and OpenBSD?"

I've also been there at some point in the past and these are my conclusions.

They also apply, to some extent, to other BSDs. However, an important disclaimer applies to this article.

This list is aimed at people who are used to Linux and are curious about OpenBSD. It is written to highlight the most important changes from their perspective, not the absolute most important changes from a technical standpoint.

Please bear with me.

A terminal is a terminal is a terminal

The first thing to realize is that, on the surface, the changes are minimal. Both are UNIX-like. You get a terminal, X windows, Firefox, Libreoffice...

Most free software can be recompiled, though some proprietary software isn't on OpenBSD. Don't expect any visual changes. Indeed, the difference between KDE and GNOME on Linux is bigger than the difference between KDE on Linux and KDE on OpenBSD.

Under the hood, there are some BIG differences with relatively little practical impact:

  • BSD licensing vs GNU licensing
  • "Whole OS" model where some base packages are treated as first-class citizens with the kernel, VS bare Kernel + everything is 3rd party
  • Documentation is considered as important as code VS good luck with Stack Overflow and reading mailing lists
  • Whenever a decision has to be made, security and correctness is prioritized VS general-purpose and popularity and efficiency

Do these make little sense to you? I know, it's difficult to fully understand. Your reference is "Windows VS Linux" which are so different on many aspects, like an elephant with a sparrow. To the untrained eye, distinguishing a pigeon with a turtledove may not be so evident.

They're philosophical distinctions which ramifications are not immediately visible. They can't be explained, you need to understand them by usage. That's why the typical recommendation is "just try OpenBSD and see"

Practical differences

So, what are some of the actual, tangible, practical differences?

Not many, really. Some are "features" and some are "undesired" side effects. With every decision there is a trade-off. Let's see some of them.

First of all, OpenBSD is a simpler system. It's very comfortable for sysadmins. All pieces are glued together following the UNIX philosophy, focusing on simplicity. Not sure what this means? Think rc VS systemd. This cannot be understated: many people are attracted to OpenBSD in the first place because it's much more minimal than Linux and even FreeBSD.

OpenBSD also has excellent man pages with practical examples. Use man. Really.

The base system prefers different default daemons/servers/defaults than Linux.

  • apache/nginx: httpd
  • postfix/sendmail: opensmtpd
  • ntp: openntpd
  • bash: ksh

Are these alternatives better or worse? Well, these cover 90% of the use cases, while being robust and simpler to admin. Think: "knowing what we now today about email, how would we write a modern email courier from scratch, without all the old cruft?"

Voilà, OpenSMTPd.

The same goes for the rest, and there are more projects on the way (openssl -> libressl)

Security and system administration

W^X, ipsec, ASLR, kernel relinking, RETGUARD, pledge, unveil, etc.

Do these sound familiar? Most were OpenBSD innovations which trickled down to the rest of the unices

"Does this mean that OpenBSD is more secure than Linux?"

I'd say it's different but equivalent, but OpenBSD's security approach is more robust over time.

System administration and package upgrading is a bit different, but equivalent too, at least on x86. If you use a different arch, you'll need to recompile OpenBSD stuff from time to time.

"But Carlos, you haven't yet told me a single feature which is relevant for my day to day use!"

That's because there is probably none. There are very few things OpenBSD does that Linux does not.

However, what they do, they do better. Is that important for you?

Why philosophical differences matter

Let's jump to some of the not-so-nice ramifications of OpenBSD's philosophy:

Most closed-source Linux software does not work: skype, slack, etc. If that's important for you, use the equivalent web apps, or try FreeBSD, which has a Linux compatibility layer

Some Linux-kernel-specific software does not work either. Namely, docker.

The same for drivers: OpenBSD has excellent drivers, but a smaller number of them. You need to choose your hardware carefully. Hint: choose a Thinkpad

This includes compatibility drivers: modern/3rd party filesystems, for example, are not so well supported.

Because of the focus on security and simplicity, and not on speed or optimizations, software runs a bit slower than on Linux. In my experience (and in some benchmarks) about 10%-20% slower.

Battery life on laptops is also affected. My x230 can run for 5 hours on Linux, 3:30 on OpenBSD. More modern laptops and bigger batteries are a practical solution for most of the people.

So what do I choose?

"Are you telling me that the positives are intangible and the negatives mean a slower system and less software overall?"

At the risk of being technically wrong, but with the goal of empathizing with the Linux user, I'll say yes.

But think about what attracted you to Linux in the first place. It was not a faster computer, more driver availability or more software than Windows. It was probably a sense of freedom, the promise of a more robust, more secure, more private system.

OpenBSD is just the next step on that ladder.

In reality: it means that the intangibles are intangible for you, at this point in time. For other people, these features are what draws them to OpenBSD. For me, the system architecture, philosophy, and administration is 10x better than Linux's.

Let me turn the question around: can you live with these drawbacks if it means you will get a more robust, easier to admin, simpler system?

Now you're thinking: "Maybe Linux is a good tradeoff between freedom, software availability, and newbie friendliness". And, for most people, that can be the case. Hey, I use Linux too. I'm just opening another door for you.

How to try OpenBSD

So what, did I pique your interest? Are you just going to close this browser tab without trying? Go ahead and spin up a VM or install OpenBSD on an old machine and see for yourself.

Life isn't black or white. Maybe OpenBSD can not be your daily OS, but it can be your "travel-laptop OS". Honestly, I know very few people that use OpenBSD as their only system.

That is my case, for example. My daily driver is OSX, not Linux, because I need to use MS Office and other software which is Windows or Mac only for work.

However, when I arrive home, I switch to OpenBSD on my x230 I enjoy using OpenBSD much more than OSX these days.

What are you waiting for? Download OpenBSD and learn what all the fuzz's about!

Tags: openbsd, unix

&via=cfenollosa">&via=cfenollosa">Comments? Tweet  

Stig Brautaset (stig)

Learning Guitar Update May 10, 2019 08:53 AM

I try to keep myself honest--and on target!--by posting an update on my guitar learning journey.

May 09, 2019

Frederik Braun (freddyb)

Chrome switching the XSSAuditor to filter mode re-enables old attack May 09, 2019 10:00 PM

Recently, Google Chrome changed the default mode for their Cross-Site Scripting filter XSSAuditor from block to filter. This means that instead of blocking the page load completely, XSSAuditor will now continue rendering the page but modify the bits that have been detected as an XSS issue.

In this blog post, I will argue that the filter mode is a dangerous approach by re-stating the arguments from the whitepaper titled X-Frame-Options: All about Clickjacking? that I co-authored with Mario Heiderich in 2013.

After that, I will elaborate XSSAuditor's other shortocmings and revisit the history of back-and-forth in its default settings. In the end, I hope to convince you that XSSAuditor's contribution is not just neglegible but really negative and should therefore be removed completely.


JavaScript à la Carte

When you allow websites to frame you, you basically give them full permission to decide, what part of JavaScript of your very own script can be executed and what cannot. That sounds crazy right? So, let’s say you have three script blocks on your website. The website that frames you doesn’t mind two of them - but really hates the third one. maybe a framebuster, maybe some other script relevant for security purposes. So the website that frames you just turns that one script block off - and leave the other two intact. Now how does that work?

Well, it’s easy. All the framing website is doing, is using the browser’s XSS filter to selectively kill JavaScript on your page. This has been working in IE some years ago but doesn’t anymore - but it still works perfectly fine in Chrome. Let’s have a look at an annotated code example.

Here is the evil website, framing your website on example.com and sending something that looks like an attempt to XSS you! Only that you don’t have any XSS bugs. The injection is fake - and resembles a part of the JavaScript that you actually use on your site:

<iframe src="//example.com/index.php?code=%3Cscript%20src=%22/js/security-libraries.js%22%3E%3C/script%3E"></iframe>

Now we have your website. The content of the code parameter above is part of your website anyway - no injection here, just a match between URL and site content:

<!doctype html>
<h1>HELLO</h1>
<script src="/js/security-libraries.js"></script>
<script>
// assumes that the libraries are included
</script>

The effect is compelling. The load of the security libraries will be blocked by Chrome’s XSS Auditor, violating the assumption in the following script block, which will run as usual.

Existing and Future Countermeasures

So, as we see defaulting to filter was a bad decision and it can be overriden with the X-XSS-Protection: 1; mode=block header. You could also disallow websites from putting you in an iframe with X-Frame-Options: DENY, but it still leaves an attack vector as your websites could be opened as a top-level window. (The Cross-Origin-Opener-Policy will help, but does not yet ship in any major browser). Surely, Chrome might fix that one bug and stop exposing onerror from internal error pages . But that's not enough.

Other shortcomings of the XSSAuditor

XSSAuditor has numerous problems in detecting XSS. In fact, there are so many that the Chrome Security Team does not treat bypasses as security bugs in Chromium. For example, the XSSAuditor scans parameters individually and thus allows for easy bypasses on pages that have multiple injections points, as an attacker can just split their payload in half. Furthermore, XSSAuditor is only relevant for reflected XSS vulnerabilities. It is completely useless for other XSS vulnerabilities like persistent XSS, Mutation XSS (mXSS) or DOM XSS. DOM XSS has become more prevalent with the rise of JavaScript libraries and frameworks such as jQuery or AngularJS. In fact, a 2017 research paper about exploiting DOM XSS through so-called script gadgets discovered that XSSAuditor is easily bypassed in 13 out of 16 tested JS frameworks

History of XSSAuditor defaults

Here's a rough timeline

Conclusion

Taking all things into considerations, I'd highly suggest removing the XSSAuditor from Chrome completely. In fact, Microsoft has announced they'd remove the XSS filter from Edge last year. Unfortunately, a suggestion to retire XSSAuditor initiated by the Google Security Team was eventually dismissed by the Chrome Security Team.

This blog post does not represent the position of my employer.
Thanks to Mario Heiderich for providing valuable feedback: Supporting arguments and useful links are his. Mistakes are all mine.

Andreas Zwinkau (qznc)

What is ASPICE? May 09, 2019 12:00 AM

The automotive industry knows how to develop software as demonstrated by ASPICE.

Read full article!

May 07, 2019

Phil Hagelberg (technomancy)

in which another game is jammed May 07, 2019 07:52 PM

All the games I've created previously have used the LÖVE framework, which I heartily recommend and have really enjoyed using. It's extremely flexible but provides just the right level of abstraction to let you do any kind of 2D game. I have even created a text editor in it. But for the 2019 Lisp Game Jam I teamed up again with Emma Bukacek (we first worked together on Goo Runner for the previous jam) and wanted to try something new: TIC-80.

tic-80 screenshot

TIC-80 is what's referred to as a "fantasy console"1; that is, a piece of software which embodies an imaginary computer which never actually existed. Hearkening back to the days of the Commodore 64, it has a 16-color palette, a 64kb limit on the amount of code you can load into it, and 80kb of space for data (sprites, maps, sound, and music). While these limitations may sound severe, the idea is that they can be liberating because there is no pressure to create something polished; the medium demands a rough, raw style.

The really impressive thing about TIC-80 you notice right away is how it makes game development so accessible. It's one file to download (or not even download; it runs perfectly fine in a browser) and you're off to the races; the code editor, sprite editor, mapper, sound editor, and music tracker are all built-in. But the best part is that you can explore other people's games (with the SURF command), and once you've played them, hit ESC to open the editor and see how they did it. You can make changes to the code, sprites, etc and immediately see them reflected. This kind of explore-and-tinker approach encourages you to experiment and see for yourself what happens.

In fact, try it now! Go to This is my Mech and hit ESC, then go down to "close game" and press Z to close it. You're in the console now, so hit ESC again to go to the editor, and press the sprite editor button at the top left. Change some of the character sprites, then hit ESC to go back to the console and type RUN to see what it does! The impact of the accessibility and immediacy of the tool simply can't be overstated; it calls out to be hacked and fiddled and tweaked.

Having decided on the platform, Emma and I threw around a few game ideas but landed on making an adventure/comedy game based on the music video I'll form the Head by MC Frontalot, which is in turn a parody of the 1980s cartoon Voltron, a mecha series about five different pilots who work together to form a giant robot that fights off the monster of the week. Instead of making the game about combat, I wanted a theme of cooperation, which led to a gameplay focused around dialog and conversation.

I'll form the head music video

I focused more on the coding and the art, and Emma did most of the writing and all of the music. One big difference when coding on TIC-80 games vs LÖVE is that you can't pull in any 3rd-party libraries; you have the Lua/Fennel standard library, the TIC-80 API, and whatever you write yourself. In fact, TIC-80's code editor supports only a single file. I'm mostly OK with TIC-80's limitations, but that seemed like a bit much, especially when collaborating, so I split out several different files and edited them in Emacs, using a Makefile to concatenate them together and TIC-80's "watch" functionality to load it in upon changes. In retrospect, while having functionality organized into different files was nice, it wasn't worth the downside of having the line numbers be incorrect, so I wouldn't do that part again.

The file watch feature was pretty convenient, but it's worth noting that the changes were only applied when you started a new game. (Not necessarily restarting the whole TIC-80 program, just the RUN command.) There's no way to load in new code from a file without restarting the game. You can evaluate new code with the EVAL command in the console and then RESUME to see the effect it has on a running game, but that only applies to a single line of code typed into the console, which is pretty limiting compared to LÖVE's full support for hot-loading any module from disk at any time that I wrote about previously. This was the biggest disadvantage of developing in TIC-80 by a significant margin. Luckily our game didn't have much state, so constantly restarting it wasn't a big deal, but for other games it would be.2

Another minor downside of collaborating on a TIC-80 game is that the cartridge is a single binary file. You can set it up so it loads the source from an external file, but the rest of the game (sprites, map, sound, and music) are all stored in one place. If you use git to track it, you will find that one person changing a sprite and another changing a music track will result in a conflict you can't resolve using git. Because of this, we would claim a "cartridge lock" in chat so that only one of us was working on non-code assets at a time, but it would be much nicer if changes to sprites could happen independently of changes to music without conflict.

screenshot of the game

Since the game consisted of mostly dialog, the conversation system was the central place to start. We used coroutines to allow a single conversation to be written in a linear, top-to-bottom way and react to player input but still run without blocking the main event loop. For instance, the function below moves the Adam character, says a line, and then asks the player a question which has two possible responses, and reacts differently depending on which response is chosen. In the second case, it sets convos.Adam so that the next time you talk to that character, a different conversation will begin:

(fn all.Adam2 []
  (move-to :Adam 48 25)
  (say "Hey, sorry about that.")
  (let [answer (ask "What's up?" ["What are you doing?"
                                  "Where's the restroom?"])]
    (if (= answer "Where's the restroom?")
        (say "You can pee in your pilot suit; isn't"
             "technology amazing? Built-in"
             "waste recyclers.")
        (= answer "What are you doing?")
        (do (say "Well... I got a bit flustered and"
                 "forgot my password, and now I'm"
                 "locked out of the system!")
            (set convos.Adam all.Adam25)
            (all.Adam25)))))

There was some syntactic redundancy with the questions which could have been tidied up with a macro. In older versions of Fennel, the macro system is tied to the module system, which is normally fine, but TIC-80's single-file restriction makes it so that style of macros were unavailable. Newer versions of Fennel don't have this restriction, but unfortunately the latest stable version of TIC-80 hasn't been updated yet. Hopefully this lands soon! The new version of Fennel also includes pattern matching, which probably would have made a custom question macro unnecessary.

The vast majority of the code is dialog/conversation code; the rest is for walking around with collision detection, and flying around in the end-game sequence. This is pretty standard animation fare but was a lot of fun to write!

rhinos animation

I mentioned TIC-80's size limit already; with such a dialog-heavy game we did run into that on the last day. We were close enough to the deadline with more we wanted to add that it caused a bit of a panic, but all we had to do was remove a bunch of commented code and we were able to squeeze what we needed in. Next time around I would use single-space indents just to save those few extra bytes.

All in all I think the downsides of TIC-80 were well worth it for a pixel-art style, short game. Being able to publish the game to an HTML file and easily publish it to itch.io (the site hosting the jam) was very convenient. It's especially helpful in a jam situation because you want to make it easy for as many people as possible to play your game so they can rate it; if it's difficult to install a lot of people won't do it. I've never done my own art for a game before, but having all the tools built-in convinced me to give it a try, and it turned out pretty good despite me not having any background in pixel art, or art of any kind.

Anyway, I'd encourage you to give the game a try. The game won first place in the game jam, and you can finish it in around ten minutes in your browser. And if it looks like fun, why not make your own in TIC-80?


[1] The term "fantasy console" was coined by PICO-8, a commercial product with limitations even more severe than TIC-80. I've done a few short demos with PICO-8 but I much prefer TIC-80, not just because it's free software, but because it supports Fennel, has a more comfortable code editor, and has a much more readable font. PICO-8 only supports a fixed-precision decimal fork of Lua. The only two advantages of PICO-8 are the larger community and the ability to set flags on sprites.

[2] I'm considering looking into adding support in TIC-80 for reloading the code without wiping the existing state. The author has been very friendly and receptive to contributions in the past, but this change might be a bit too much for my meager C skills.

Pepijn de Vos (pepijndevos)

Google Summer of Code is excluding half the world from participating May 07, 2019 12:00 AM

I recently came across someone who wanted to mentor a Yosys VHDL frontent as a Google Summer of Code project. This sounded fun, so I wrote a proposal, noting that GSoC starts before my summer holiday, and planning accordingly. Long story short, there are limited spots and my proposal was not accepted. I have confirmed with the mentoring organization that my availability was the primary factor in this.

While I understand their decision, it seems odd from an organizational viewpoint. Surely others would have the same problem? Indeed I heard from one person that they coped by just working ridiculous hours, while another said they never applied because of the mismatch. Google seems to be aware that this is an issue, stating in their FAQ:

Can the schedule be adjusted if my school ends late/starts early? No. We know that the schedule doesn’t work for some students, but it’s impossible to make a single timeline that works for everyone. Some organizations may allow a participant to start a little early or end a little late – but this is usually measured in days, not weeks. The monthly evaluation dates cannot be changed.

But how big is this problem, and where do accepted proposals come from? I decided to find out. Wikipedia has a long page of summer vacation dates for each country, and there is also this pdf which contains the following helpful graphic.

holidays holidays

Most summer vacations are from July to August, while GSoC runs from May 27 to August 19, excluding most of Europe and many other countries from participating. (unless you lie in your proposal or work 70 hours per week)

The next question is if this is reflected in accepted proposals. Since country of origin is not disclosed, this requires some digging. I scraped a few hundred names from the GSoC website, and scraped their location from a Linkedin search. This is of course not super reliable, but should give some indication.

      1 Argentina
      5 Australia
      1 Bangladesh
      6 Brazil
      1 Canada
      1 Chile
      7 China
      1 Denmark
      3 Egypt
      5 France
      9 Germany
      2 Ghana
      4 Greece
      2 Hong Kong
    212 India
      4 Indonesia
      1 Israel
      4 Italy
      2 Kazakhstan
      2 Kenya
      1 Lithuania
      2 Malaysia
      2 Mexico
      1 Nepal
      2 Nigeria
      1 Paraguay
      1 Peru
      3 Poland
      1 Portugal
      2 Qatar
      4 Romania
      4 Russian Federation
      1 Serbia
      4 Singapore
      1 South Africa
     10 Spain
      8 Sri Lanka
      2 Sweden
      2 Switzerland
      1 Tank
      2 Turkey
      4 Ukraine
      2 United Arab Emirates
      1 United Kingdom
     78 United States
     70 unknown
      1 Uruguay
      1 Uzbekistan
      3 Vietnam

Holy moly, so many Indians(212), followed by a large number of Americans(78), and then Spain(10), Germany(9), and the rest of the world. No Dutchies in this subset. For all European countries I counted a combined 51 participants, still a reasonable number. Even though Spain and Germany have the same holiday mismatch as the Netherlands. Tell me your secret! Interestingly, Wikipedia states that India has very short holidays, but special exceptions for summer programmes:

Summer vacation lasts for no more than six weeks in most schools. The duration may decrease to as little as three weeks for older students, with the exception of two month breaks being scheduled to allow some high school and university students to participate in internship and summer school programmes.

Anyway, I think a big international company like Google could try to be a bit more flexible, and for example let students work for a subset of the monthly evaluation periods that align with their holiday.

Appendix

To scrape the names, I scrolled down on the project page until I got bored, and then entered some JS in the browser console.

Array.prototype.map.call(document.querySelectorAll(".project-card h2"), function(x) { return x.innerText })

I saved this to a file and wrote a Selenium script to search Linkedin. Likedin was being really annoying by serving me different versions of various pages with completely different html tags, so this only works half of the time.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import json
import time
from urllib.parse import urlencode

with open('data.json') as f:
    data = json.load(f)

driver = webdriver.Firefox()
driver.implicitly_wait(5)
driver.get('https://www.linkedin.com')

username = driver.find_element_by_id('login-email')
username.send_keys('email')
password = driver.find_element_by_id('login-password')
password.send_keys('password')
sign_in_button = driver.find_element_by_id('login-submit')
sign_in_button.click()

for name in data:
    try:
        first, last = name.split(' ', 1)
    except ValueError:
        continue
    if last.endswith('-1'):
        last = last[:-2]
    params = urlencode({"firstName": first, "lastName": last})
    driver.get("https://www.linkedin.com/search/results/people/?" + params)
    try:
        location = driver.find_element_by_css_selector('.search-result--person .subline-level-2').text
        print('"%s", "%s"' % (name, location))
    except NoSuchElementException:
        print('"%s", "%s"' % (name, 'unknown'))
        continue

And finally some quick Bash hax to count the countries. (All US locations only list their state)

cat output.csv | cut -d\" -f 4 | sed "s/Area$/Area, United States/i" | awk -F, '{print $NF}' | awk '{$1=$1};1' | sort | uniq -c

Andreas Zwinkau (qznc)

Accidentally Turing-Complete May 07, 2019 12:00 AM

A list of things that were not supposed to be Turing-complete, but are.

Read full article!

May 06, 2019

Ponylang (SeanTAllen)

Last Week in Pony - May 6, 2019 May 06, 2019 04:16 PM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

May 04, 2019

Pierre Chapuis (catwell)

Changing the SSH port on Arch Linux May 04, 2019 06:00 PM

I often change the default SSH port from 22 to something else on servers I run. It kind of is a dangerous operation, especially when the only way you have to connect to that server is SSH.

The historical way to do this is editing sshd_config and setting the Port variable, but with recent versions of Arch Linux and the default configuration, this will not work.

The reason is that SSH is configured with systemd socket activation. So what you need to do is run sudo systemctl edit sshd.socket and set the contents of the file to:

[Socket]
ListenStream=MY_PORT
Accept=yes

where MY_PORT is the port number you want.

I hope this short post will avoid trouble for other people, at least it will be a reminder for me the next time I have to setup an Arch server...

Derek Jones (derek-jones)

C Standard meeting, April-May 2019 May 04, 2019 01:05 AM

I was at the ISO C language committee meeting, WG14, in London this week (apart from the few hours on Friday morning, which was scheduled to be only slightly longer than my commute to the meeting would have been).

It has been three years since the committee last met in London (the meeting was planned for Germany, but there was a hosting issue, and Germany are hosting next year), and around 20 people attended, plus 2-5 people dialing in. Some regular attendees were not in the room because of schedule conflicts; nine of those present were in London three years ago, and I had met three of those present (this week) at WG14 meetings prior to the last London meeting. I had thought that Fred Tydeman was the longest serving member in the room, but talking to Fred I found out that I was involved a few years earlier than him (our convenor is also a long-time member); Fred has attended more meeting than me, since I stopped being a regular attender 10 years ago. Tom Plum, who dialed in, has been a member from the beginning, and Larry Jones, who dialed in, predates me. There are still original committee members active on the WG14 mailing list.

Having so many relatively new meeting attendees is a good thing, in that they are likely to be keen and willing to do things; it’s also a bad thing for exactly the same reason (i.e., if it not really broken, don’t fix it).

The bulk of committee time was spent discussing the proposals contains in papers that have been submitted (listed in the agenda). The C Standard is currently being revised, WG14 are working to produce C2X. If a person wants the next version of the C Standard to support particular functionality, then they have to submit a paper specifying the desired functionality; for any proposal to have any chance of success, the interested parties need to turn up at multiple meetings, and argue for it.

There were three common patterns in the proposals discussed (none of these patterns are unique to the London meeting):

  • change existing wording, based on the idea that the change will stop compilers generating code that the person making the proposal considers to be undesirable behavior. Some proposals fitting this pattern were for niche uses, with alternative solutions available. If developers don’t have the funding needed to influence the behavior of open source compilers, submitting a proposal to WG14 offers a low cost route. Unless the proposal is a compelling use case, affecting lots of developers, WG14’s incentive is to not adopt the proposal (accepting too many proposals will only encourage trolls),
  • change/add wording to be compatible with C++. There are cost advantages, for vendors who have to support C and C++ products, to having the two language be as mutually consistent as possible. Embedded systems are a major market for C, but this market is not nearly as large for C++ (because of the much larger overhead required to support C++). I pointed out that WG14 needs to be careful about alienating a significant user base, by slavishly following C++; the C language needs to maintain a separate identity, for long term survival,
  • add a new function to the C library, based on its existence in another standard. Why add new functions to the C library? In the case of math functions, it’s to increase the likelihood that the implementation will be correct (maths functions often have dark corners that are difficult to get right), and for string functions it’s the hope that compilers will do magic to turn a function call directly into inline code. The alternative argument is not to add any new functions, because the common cases are already covered, and everything else is niche usage.

At the 2016 London meeting Peter Sewell gave a presentation on the Cerberus group’s work on a formal definition of C; this work has resulted in various papers questioning the interpretation of wording in the standard, i.e., possible ambiguities or inconsistencies. At this meeting the submitted papers focused on pointer provenance, and I was expecting to hear about the fancy optimizations this work would enable (which would be a major selling point of any proposal). No such luck, the aim of the work was stated as clearly specifying the behavior (a worthwhile aim), with no major new optimizations being claimed (formal methods researchers often oversell their claims, Peter is at the opposite end of the spectrum and could do with an injection of some positive advertising). Clarifying behavior is a worthwhile aim, but not at the cost of major changes to existing wording. I have had plenty of experience of asking WG14 for clarification of existing (what I thought to be ambiguous) wording, only to be told that the existing wording was clear and not ambiguous (to those reviewing my proposed defect). I wonder how many of the wording ambiguities that the Cerberus group claim to have found would be accepted by WG14 as a defect that required a wording change?

Winner of the best pub quiz question: Does the C Standard require an implementation to be able to exactly represent floating-point zero? No, but it is now required in C2X. Do any existing conforming implementations not support an exact representation for floating-point zero? There are processors that use a logarithmic representation for floating-point, but I don’t know if any conforming implementation exists for such systems; all implementations I know of support an exact representation for floating-point zero. Logarithmic representation could handle zero using a special bit pattern, with cpu instructions doing the right thing when operating on this bit pattern, e.g., 0.0+X == X, (I wonder how much code would break, if the compiler mapped the literal 0.0 to the representable value nearest to zero).

Winner of the best good intentions corrupted by the real world: intmax_t, an integer type capable of representing any value of any signed integer type (i.e., a largest representable integer type). The concept of a unique largest has issues in a world that embraces diversity.

Today’s C development environment is very different from 25 years ago, let alone 40 years ago. The number of compilers in active use has decreased by almost two orders of magnitude, the number of commonly encountered distinct processors has shrunk, the number of very distinct operating systems has shrunk. While it is not a monoculture, things appear to be heading in that direction.

The relevance of WG14 decreases, as the number of independent C compilers, in widespread use, decreases.

What is the purpose of a C Standard in today’s world? If it were not already a standard, I don’t think a committee would be set up to standardize the language today.

Is the role of WG14 now, the arbiter of useful common practice across widely used compilers? Documenting decisions in revisions of the C Standard.

Work on the Cobol Standard ran for almost 60-years; WG14 has to be active for another 20-years to equal this.

May 02, 2019

Maxwell Bernstein (tekknolagi)

Recursive Python objects May 02, 2019 08:24 PM

Recently for work I had to check that self-referential Python objects could be string-ified without endless recursion. In the process of testing my work, I had to come come up with a way of making self-referential built-in types (eg dict, list, set, and tuple).

Making a self-refential list is the easiest task because list is just a dumb mutable container. Make a list and append a reference to itself:

ls = []
ls.append(ls)
>>> ls
[[...]]
>>>

dict is similarly easy:

d = {}
d['key'] = d
>>> d
{'key': {...}}
>>>

Making a self-referential tuple is a little bit tricker because tuples cannot be modified after they are constructed (unless you use the C-API, in which case this is much easier — but that’s cheating). In order to close the loop, we’re going to have to use a little bit of indirection.

class C:
  def __init__(self):
    self.val = (self,)

  def __repr__(self):
    return self.val.__repr__()

>>> C()
((...),)
>>>

Here we create an class that stores a pointer to itself in a tuple. That way the tuple contains a pointer to an object that contains the tuple — A->B->A.

The solution is nearly the same for set:

class C:
  def __init__(self):
    self.val = set((self,))

  def __repr__(self):
    return self.val.__repr__()

>>> C()
{set(...)}
>>>

Note that simpler solutions like directly adding to the set (below) don’t work because sets are not hashable, and hashable containers like tuple depend on the hashes of their contents.

s = set()
s.add(s)  # nope
s.add((s,))  # still nope

There’s not a whole lot of point in doing this, but it was a fun exercise.

Cryptolosophy (awn)

to slice or not to slice May 02, 2019 12:00 AM

Go is an incredibly useful programming language because it hands you a fair amount of power while remaining fairly succinct. Here are are few bits of knowledge I’ve picked up in my time spent with it.

Say you have a fixed-size byte array and you want to pass it to a function that only accepts slices. That’s easy, you can “slice” it:

var bufarray [32]byte
bufslice := bufarray[:] // []byte

Going the other way is harder. The standard solution is to allocate a new array and copy the values over:

bufslice := make([]byte, 32)
var bufarray [32]byte
copy(bufarray[:], bufslice)

“What if I don’t want to make a copy?”, I hear you ask. You could be handling sensitive data or maybe you’re just optimizing the shit out of something. In any case we can grab a pointer and do it ourselves:

bufarrayptr := (*[32]byte)(unsafe.Pointer(&buf[0])) // *[32]byte (same memory region)
bufarraycpy := *(*[32]byte)(unsafe.Pointer(&buf[0])) // [32]byte (copied to new memory region)

A pointer to the first element of the slice is passed to unsafe.Pointer which is then cast to “pointer to fixed-size 32 byte array”. Dereferencing this will return a copy of the data as a new fixed-size byte array.

The unsafe cat is out of the bag so why not get funky with it? We can make our own slices, with blackjack and hookers:

func ByteSlice(ptr *byte, len int, cap int) []byte {
    var sl = struct {
        addr uintptr
        len  int
        cap  int
    }{uintptr(unsafe.Pointer(ptr)), len, cap}
    return *(*[]byte)(unsafe.Pointer(&sl))
}

This function will take a pointer, a length, and a capacity; and return a slice with those attributes. Using this, another way to convert an array to a slice would be:

var bufarray [32]byte
bufslice := ByteSlice(&bufarray[0], 32, 32)

We can take this further to get slices of arbitrary types, []T, as long as the memory region being mapped to divides the size of T. For example, to get a []uint32 representation of our [32]byte we would divide the length and capacity by four (a uint32 value consumes four bytes) and end up with a slice of size eight:

var sl = struct {
    addr uintptr
    len  int
    cap  int
}{uintptr(unsafe.Pointer(&bufarray[0])), 8, 8}
uint32slice := *(*[]uint32)(unsafe.Pointer(&sl))

But there is a catch. This “raw” construction converts the unsafe.Pointer object into a uintptr—a “dumb” integer address—which will not describe the region of memory you want if the runtime or garbage collector moves the original object around. To ensure that this doesn’t happen you can allocate your own memory using system-calls or a C allocator like malloc. This is exactly what we had to in memguard: the system-call wrapper is available here. To avoid memory leaks, remember to free your allocations!

It seems a bit wasteful to have a garbage collector and not use it though, so why don’t we let it catch some of the freeing for us? First create a container structure to work with:

type buffer struct {
    Bytes []byte
}

Add some generic constructor and destructor functions:

import "github.com/awnumar/memguard/memcall"

func alloc(size int) *buffer {
    if size < 1 {
        return nil
    }
    return &buffer{memcall.Alloc(size)}
}

func (b *buffer) free() {
    if b.Bytes == nil {
        // already been freed
        return
    }
    memcall.Free(b.Bytes)
    b.Bytes = nil
}

We use runtime.SetFinalizer to inform the runtime about our object and what to do if it finds it some time after it becomes unreachable. Modifying alloc to include this looks like:

func alloc(size int) *buffer {
    if size < 1 {
        return nil
    }

    buf := &buffer{memcall.Alloc(size)}

    runtime.SetFinalizer(buf, func(buf *buffer) {
        go buf.free()
    })

    return buf
}

Alright I think that’s enough shenanigans for one post.

May 01, 2019

Bogdan Popa (bogdan)

Using GitHub Actions to Test Racket Code May 01, 2019 02:00 PM

Like Alex Harsányi, I’ve been looking for a good, free-as-in-beer, alternative to Travis CI. For now, I’ve settled on GitHub Actions because using them is straightforward and because I saves me from creating yet another account with some other company.

Marc Brooker (mjb)

Some risks of coordinating only sometimes May 01, 2019 12:00 AM

Some risks of coordinating only sometimes

Sometimes-coordinating systems have dangerous emergent behaviors

A classic cloud architecture is built of small clusters of nodes (typically one to nine1), with coordination used inside each cluster to provide availability, durability and integrity in the face of node failures. Coordination between clusters is avoided, making it easier to scale the system while meeting tight availability and latency requirements. In reality, however, systems sometimes do need to coordinate between clusters, or clusters need to coordinate with a central controller. Some of these circumstances are operational, such as around adding or removing capacity. Others are triggered by the application, where the need to present a client API which appears consistent requires either the system itself, or a layer above it, to coordinate across otherwise-uncoordinated clusters.

The costs and risks of re-introducing coordination to handle API requests or provide strong client guarantees are well explored in the literature. Unfortunately, other aspects of sometimes-coordinated systems do not get as much attention, and many designs are not robust in cases where coordination is required for large-scale operations. Results like CAP and CALM2 provide clear tools for thinking through when coordination must occur, but offer little help in understanding the dynamic behavior of the system when it does occur.

One example of this problem is reacting to correlated failures. At scale, uncorrelated node failures happen all the time. Designing to handle them is straightforward, as the code and design is continuously validated in production. Large-scale correlated failures also happen, triggered by power and network failures, offered load, software bugs, operator mistakes, and all manner of unlikely events. If systems are designed to coordinate during failure handling, either as a mesh or by falling back to a controller, these correlated failures bring sudden bursts of coordination and traffic. These correlated failures are rare, so the way the system reacts to them is typically untested at the scale at which it is currently operating when they do happen. This increases time-to-recovery, and sometimes requires that drastic action is taken to recover the system. Overloaded controllers, suddenly called upon to operate at thousands of times their usual traffic, are a common cause of long time-to-recovery outages in large-scale cloud systems.

A related issue is the work that each individual cluster needs to perform during recovery or even scale-up. In practice, it is difficult to ensure that real-world systems have both the capacity required to run, and spare capacity for recovery. As soon as a system can’t do both kinds of work, it runs the risk of entering a mode where it is too overloaded to scale up. The causes of failure here are both technical (load measurement is difficult, especially in systems with rich APIs), economic (failure headroom is used very seldom, making it an attractive target to be optimized away), and social (people tend to be poor at planning for relatively rare events).

Another risk of sometimes-coordination is changing quality of results. It’s well known how difficult it is to program against APIs which offer inconsistent consistency, but this problem goes beyond just API behavior. A common design for distributed workload schedulers and placement systems is to avoid coordination on the scheduling path (which may be latency and performance critical), and instead distribute or discover stale information about the overall state of the system. In steady state, when staleness is approximately constant, the output of these systems is predictable. During failures, however, staleness may increase substantially, leading the system to making worse choices. This may increase churn and stress on capacity, further altering the workload characteristics and pushing the system outside its comfort zone.

The underlying cause of each of these issues is that the worst-case behavior of these systems may diverge significantly from their average-case behavior, and that many of these systems are bistable with a stable state in normal operation, and a stable state at “overloaded”. Within AWS, we are starting to settle on some patterns that help constrain the behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination, independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.

More work is needed in this space, both in understanding how to build systems which strongly avoid congestive collapse during all kinds of failures, and in building tools to characterize and test the behavior of real-world systems. Distributed systems and control theory are natural partners.

Footnotes:

  1. Cluster sizing is a super interesting topic in it's own right. Nine seems arbitrary here, but isn't: for the most durable consensus systems, because when spread across three datacenters allows one datacenter failure (losing 3) and one host failure while still having a healthy majority. Chain replicated and erasure coded systems will obviously choose differently, as will anything with read replicas, or cost, latency or other constraints.
  2. See Keeping CALM: When Distributed Consistency is Easy by Hellerstein and Alvaro. It's a great paper, and a very powerful conceptual tool.

April 29, 2019

Pete Corey (petecorey)

Generating Realistic Pseudonyms with Faker.js and Deterministic Seeds April 29, 2019 12:00 AM

Last week we talked about using decorators to conditionally anonymize users of our application to build a togglable “demo mode”. In our example, we anonymized every user by giving them the name "Jane Doe" and the phone number "555-867-5309". While this works, it doesn’t make for the most exciting demo experience. Ideally, we could incorporate more variety into our anonymized user base.

It turns out that with a little help from Faker.js and deterministic seeds, we can do just that!

Faker.js

Faker.js is a library that “generate[s] massive amounts of realistic fake data in Node.js and the browser.” This sounds like it’s exactly what we need.

As a first pass at incorporating Faker.js into our anonymization scheme, we might try generating a random name and phone number in the anonymize function attached to our User model:


const faker = require('faker');

userSchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: faker.name.findName(),
    phone: faker.phone.phoneNumber()
  });
};

We’re on the right path, but this approach has problems. Every call to anonymize will generate a new name and phone number for a given user. This means that the same user might be given multiple randomly generated identities if they’re returned from multiple resolvers.

Consistent Random Identities

Thankfully, Faker.js once again comes to the rescue. Faker.js lets us specify a seed which it uses to configure it’s internal pseudo-random number generator. This generator is what’s used to generate fake names, phone numbers, and other data. By seeding Faker.js with a consistent value, we’ll be given a consistent stream of randomly generated data in return.

Unfortunately, it looks like Faker.js’ faker.seed function accepts a number as its only argument. Ideally, we could pass the _id of our model being anonymized.

However, a little digging shows us that the faker.seed function calls out to a local Random module:


Faker.prototype.seed = function(value) {
  var Random = require('./random');
  this.seedValue = value;
  this.random = new Random(this, this.seedValue);
}

And the Random module calls out to the mersenne library, which supports seeds in the form of an array of numbers:


if (Array.isArray(seed) && seed.length) {
  mersenne.seed_array(seed);
}

Armed with this knowledge, let’s update our anonymize function to set a random seed based on the user’s _id. We’ll first need to turn our _id into an array of numbers:


this._id.split("").map(c => c.charCodeAt(0));

And then pass that array into faker.seed before returning our anonymized data:


userSchema.methods.anonymize = function() {
  faker.seed(this._id.split("").map(c => c.charCodeAt(0)));
  return _.extend({}, this, {
    name: faker.name.findName(),
    phone: faker.phone.phoneNumber()
  });
};

And that’s all there is to it.

Now every user will be given a consistent anonymous identity every time their user document is anonymized. For example, a user with an _id of "5cb0b6fd8f6a9f00b8666dcb" will always be given a name of "Arturo Friesen", and a phone number of "614-157-9046".

Final Thoughts

My client ultimately decided not to go this route, and decided to stick with obviously fake “demo mode” identities. That said, I think this is an interesting technique that I can see myself using in the future.

Seeding random number generators with deterministic values is a powerful technique for generating pseudo-random, but repeatable data.

That said, it’s worth considering if this is really enough to anonymize our users’ data. By consistently replacing a user’s name, we’re just masking one aspect of their identity in our application. Is that enough to truly anonymize them, or will other attributes or patterns in their behavior reveal their identity? Is it worth risking the privacy of our users just to build a more exciting demo mode? These are all questions worth asking.

April 28, 2019

Ponylang (SeanTAllen)

Last Week in Pony - April 28, 2019 April 28, 2019 08:53 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Jeff Carpenter (jeffcarp)

Measuring My Chinese Progress April 28, 2019 05:20 AM

Last summer I started learning Mandarin Chinese. To start I began taking classes at a Chinese language school in SF. For more practice I started an Instagram @jeffcarp_zh and tried writing a couple blog posts. Almost a year later, I’m still going to Chinese class on a semi-regular basis (1 hour a week except when I’m taking a break) and keep up a daily spaced-repetition flashcard habit using the Pleco Chinese dictionary app (usually on the train into work).

April 25, 2019

Derek Jones (derek-jones)

Dimensional analysis of the Halstead metrics April 25, 2019 05:30 PM

One of the driving forces behind the Halstead complexity metrics was physics envy; the early reports by Halstead use the terms software physics and software science.

One very simple, and effective technique used by scientists and engineers to check whether an equation makes sense, is dimensional analysis. The basic idea is that when performing an operation between two variables, their measurement units must be consistent; for instance, two lengths can be added, but a length and a time cannot be added (a length can be divided by time, returning distance traveled per unit time, i.e., velocity).

Let’s run a dimensional analysis check on the Halstead equations.

The input variables to the Halstead metrics are: eta_1, the number of distinct operators, eta_2, the number of distinct operands, N_1, the total number of operators, and N_2, the total number of operands. These quantities can be interpreted as units of measurement in tokens.

The formula are:

  • Program length: N = N_1 + N_2
    There is a consistent interpretation of this equation: operators and operands are both kinds of tokens, and number of tokens can be interpreted as a length.
  • Calculated program length: hat{N} = eta_1 log_2 eta_1 + eta_2 log_2 eta_2
    There is a consistent interpretation of this equation: the operand of a logarithm has to be dimensionless, and the convention is to treat the operand as a ratio (if no denominator is specified, the value 1 having the same dimensions as the numerator is taken, giving a dimensionless result), the value returned is dimensionless, which can be multiplied by a variable having any kind of dimension; so again two (token) lengths are being added.
  • Volume: V = N * log_2 eta
    A volume has units of length^3 (i.e., it is created by multiplying three lengths). There is only one length in this equation; the equation is misnamed, it is a length.
  • Difficulty: D = {eta_1 / 2 } * {N_2 / eta_2}
    Here the dimensions of eta_1 and eta_2 cancel, leaving the dimensions of N_2 (a length); now Halstead is interpreting length as a difficulty unit (whatever that might be).
  • Effort: E =  D * V
    This equation multiplies two variables, both having a length dimension; the result should be interpreted as an area. In physics work is force times distance, and power is work per unit time; the term effort is not defined.

Halstead is claiming that a single dimension, program length, contains so much unique information that it can be used as a measure of a variety of disparate quantities.

Halstead’s colleagues at Purdue were rather damming in their analysis of these metrics. Their report Software Science Revisited: A Critical Analysis of the Theory and Its Empirical Support points out the lack of any theoretical foundation for some of the equations, that the analysis of the data was weak and that a more thorough analysis suggests theory and data don’t agree.

I pointed out in an earlier post, that people use Halstead’s metrics because everybody else does. This post is unlikely to change existing herd behavior, but it gives me another page to point people at, when people ask why I laugh at their use of these metrics.

Wesley Moore (wezm)

What I Learnt Building a Lobsters TUI in Rust April 25, 2019 05:00 AM

As a learning and practice exercise I built a crate for interacting with the Lobsters programming community website. It's built on the asynchronous Rust ecosystem. To demonstrate the crate I also built a terminal user interface (TUI).

Screenshot of Lobsters TUI A screenshot of the TUI in Alacritty

Try It

crates.io

Pre-built binaries with no runtime dependencies are available for:

  • FreeBSD 12 amd64
  • Linux armv6 (Raspberry Pi)
  • Linux x86_64
  • MacOS
  • NetBSD 8 amd64
  • OpenBSD 6.5 amd64

Downloads Source Code

The TUI uses the following key bindings:

  • j or — Move cursor down
  • k or — Move cursor up
  • h or — Scroll view left
  • l or — Scroll view right
  • Enter — Open story URL in browser
  • c — Open story comments in browser
  • q or Esc — Quit

As mentioned in the introduction the motivation for starting the client was to practice using the async Rust ecosystem and it kind of spiralled from there. The resulting TUI is functional but not especially useful, since it just opens links in your browser. I can imagine it being slightly more useful if you could also view and reply to comments without leaving the UI.

Building It

The client proved to be an interesting challenge, mostly because Lobsters doesn't have a full API. This meant I had to learn how to set up and use a cookie jar along side reqwest in order to make authenticated requests. Logging in requires supplying a cross-site request forgery token, which Rails uses to prevent CSRF attacks. To handle this I need to first fetch the login page, note the token, then POST to the login endpoint. I could have tried to extract the token from the markup with a regex or substring matching but instead used kuchiki to parse the HTML and then match on the meta element in the head.

Once I added support for writing with the client (posting comments), not just reading, I thought I best not test against the real site. Fortunately the site's code is open source. I took this as an opportunity to use my new-found Docker knowledge and run it with Docker Compose. That turned out pretty easy since I was able to base it on one of the Dockerfiles for a Rails app I run. If you're curious the Alpine Linux based Dockerfile and docker-compose.yml can be viewed in this paste.

After I had the basics of the client worked out I thought it would be neat to fetch the front page stories and render them in the terminal in a style similar to the site itself. I initially did this with ansi_term. It looked good but lacked interactivity so I looked into ways to build a TUI along the lines of tig. I built it several times with different crates, switching each time I hit a limitation. I tried:

  • easycurses, which lived up to it's name and produced a working result quickly. I'd recommend this if your needs aren't too fancy, however I needed more control than it provided.
  • pancurses didn't seem to be able to use colors outside the core 16 from ncurses.

Finally I ended up going a bit lower-level and used termion. It does everything itself but at the same time you lose the conveniences ncurses provides. It also doesn't support Windows, so my plans of supporting that were thwarted. Some time after I had the termion version working I revisited tui-rs, which I had initially dismissed as unsuitable for my task. In hindsight it would probably have been perfect, but we're here now.

In addition to async and TUI I also learned more about:

  • Building a robust and hopefully user friendly command line tool.
  • Documenting a library.
  • Publishing crates.
  • Dockerising a Rails app that uses MySQL.
  • How to build and publish pre-built binaries for many platforms.
  • How to accept a password in the terminal without echoing it.
  • Setting up multi-platform CI builds on Sourcehut.

Whilst the library and UI aren't especially useful the exercise was worth it. I got to practice a bunch of things and learn some new ones at the same time.



Previous Post: Cross Compiling Rust for FreeBSD With Docker

April 24, 2019

Átila on Code (atilaneves)

Type inference debate: a C++ culture phenomenon? April 24, 2019 09:22 AM

I read two C++ subreddit threads today on using the auto keyword. They’re both questions: the first one asks why certain people seem to dislike using type inference, while the second asks about what commonly taught guidelines should be considered bad practice. A few replies there mention auto. This confuses me for more than one […]

Derek Jones (derek-jones)

C2X and undefined behavior April 24, 2019 02:00 AM

The ISO C Standard is currently being revised by WG14, to create C2X.

There is a rather nebulous clustering of people who want to stop compilers using undefined behaviors to generate what these people (and probably most other developers) consider to be very surprising code. For instance, always printing p is truep is false, when executing the code: bool p; if ( p ) printf("p is true"); if ( !p ) printf("p is false"); (possible because p is uninitialized, and accessing an uninitialized value is undefined behavior).

This sounds like a good thing; nobody wants compilers generating surprising code.

All the proposals I have seen, so far, involve doing away with constructs that can produce undefined behavior. Again, this sounds like a good thing; nobody likes undefined behaviors.

The problem is, there is a reason for labeling certain constructs as producing undefined behavior; the behavior is who-knows-what.

Now the C Standard could specify the who-knows-what behavior; for instance, it could specify that the result of dividing by zero is 42. Standard’s conforming compilers would then have to generate code to check whether the denominator was zero, and return 42 for this case (until Intel, ARM and other processor vendors ‘updated’ the behavior of their divide instructions). Way-back-when a design decision was made, the behavior of divide by zero is undefined, not 42 or any other value; this was a design decision, code efficiency and compactness was considered to be more important.

I have not seen anybody arguing that the behavior of divide by zero should be specified. But I have seen people arguing that once C’s integer representation is specified as being twos-compliment (currently it can also be ones-compliment or signed-magnitude), then arithmetic overflow becomes defined. Wrong.

Twos-compliment is a specification of a representation, not a specification of behavior. What is the behavior when the result of adding two integers cannot be represented? The result might be to wrap (the behavior expected by many developers), to saturate at the maximum value (frequently needed in image and signal processing), to raise a signal (overflow is not usually supposed to happen), or something else.

WG14 could define the behavior, for when the result of an arithmetic operation is not representable in the number of bits available. Standard’s conforming compilers targeting processors whose arithmetic instructions did not behave as required would have to generate code, for any operation that could overflow, to do what was necessary. The embedded market are heavy users of C; in this market memory is limited, and processor performance is never fast enough, the overhead of supporting a defined behavior could just be too high (a more attractive solution is to code review, to make sure the undefined behavior cannot occur).

Is there another way of addressing the issue of compiler writers’ use/misuse of undefined behavior? Yes, offer them money. Compiler writing is a business, at least at the level at which gcc and llvm operate. If people really are keen to influence the code generated by gcc and llvm, money is the solution. Wot, no money? Then stop complaining.

April 23, 2019

Pierre Chapuis (catwell)

Spicing things up April 23, 2019 09:45 PM

In my last post I told you I had plans that I was not ready to talk about yet. Well, the time has come. I am happy to announce that I am now the CTO and co-founder of a startup called Chilli.

Chilli is not a typical startup, it is an eFounders project. You may know eFounders as the first startup studio in France, which originated companies such as Front, Aircall and Spendesk. The way they usually work is that they identify a problem that needs solving and find founders to tackle it, providing them both support and funding in exchange for equity. When the studio was created, I had doubts about the model, but later on I became quite enthusiastic about it.

Most eFounders companies are Software-as-a-Service businesses, and several of them were born of a need identified in traditional SMBs and SMEs. However, many pivoted to serve a different market, either tech companies or enterprises, and we can see the same pattern in other SaaS companies as well. So we end up with software that doesn't sell in the market it was originally designed for, and SMBs left on the side of the road with unaddressed digital needs. The reason, we believe, lies with the SaaS-to-SMBs distribution model, and that is the issue Chilli intends to solve.

We are certain that the solution to that problem must involve software. However, we also think technology alone will not be enough; a human touch is necessary, which is why my co-founder and CEO Julien comes from a consulting background. What we will build is a hybrid platform to help leaders identify the pain points in their companies and match them with the best digital tools to solve them. By starting from the customer's needs, we will work around the distribution cost issues and become the missing link between SaaS vendors and traditional SMBs and SMEs.

For me, this is a new and exciting challenge. Despite having been a very early employee at startups twice, I have never been a founder yet, and it is something I have wanted to do for a while. Moreover, it means I will be doing a lot of Product and Web development again, which will change from the last five years I spent mostly in the world of systems software in C and Lua.

On that note, our Web stack is (typed) Python 3 / Flask and TypeScript / Angular, and I am looking for a full stack developer to join the team. This is a junior to mid position based in Paris, France (no remote); since most of the work is on the frontend experience with Python is not a requirement. If you are interested, get in touch.

April 22, 2019

Richard Kallos (rkallos)

Imperium in Imperio - A Bridge is Made of Planks; Every Plank is a Bridge April 22, 2019 02:30 PM

In the previous three installments, I discussed visualizing ideal final states of things, examining the current state of your life, and how to bridge the two. In the final post in this series, I show how these techniques can be applied at any scale; from a daily to-do list to planning the course of your entire life.

In the previous post, I showed how how I write Structural Tension charts. The paragraph on top is where I write the ideal state of some thing that I want to accomplish, whether it’s a finished project, an instilled habit or an attained achievement. The paragraph on the bottom is where I describe as objectively and nonjudgmentally as possible the current reality of the thing I’m setting out to accomplish. The space in between the two paragraphs gets filled with a list of actionable steps I can take to move from where I am to where I want to be.

To me, these steps resemble planks on a bridge. The easier and smaller steps tend to go down near the bottom of the page, and the later tasks tend to be a bit more abstract, large in scope, and usually depend on previous tasks. Sometimes the steps I write out are really big items, like “Get a job as a software developer”. Writing this down doesn’t fill me with the will to go out and get things done. How do I even go about starting such a huge task? The answer: Make a ST chart for that task. If there are any tasks in that new chart that are too large, you can make yet more charts. In the end, you wind up with a forest (in the graph theory sense) of charts that all serve to plot a course for your life.

For example, the ST chart I shared with you in the previous post could have been a single step in a larger chart where I set a course to gain better insight to my emotions, and the step titled “Experiment with active forms of meditation” could have its own chart where I describe the styles I’ve tried, what’s gone well, and what has yet to be tested. Furthermore, I could have a completely separate chart about setting up a regular journaling habit where I list “Try keeping a meditation journal” as one of the steps.

So far the method that I’ve had the most success with is to keep track of these charts on paper, but as much as I deeply enjoy the feel of pen and paper, I can’t help but think that organizing these charts is a task that computers are well-suited to. I’ve noticed there is a lot of software that could be great at managing these ST charts, but I think the two most promising ones are Emacs and TiddlyWiki. I’ll hopefully have more to say in the future about if/how I’ve adapted my system to allow for the help of software.

In conclusion, the process of writing ST charts may lead you to write out steps that are a bit too large to tackle on their own. Fortunately, you can take advantage of the recursive nature of ST charts and break each step down into their own chart, and it becomes easier to plan out projects or ambitions of any size.

Pete Corey (petecorey)

Anonymizing GraphQL Resolvers with Decorators April 22, 2019 12:00 AM

As software developers and application owners, we often want to show off what we’re working on to others, especially if there’s some financial incentive to do so. Maybe we want to give a demo of our application to a potential investor or a prospective client. The problem is that staging environments and mocked data are often lifeless and devoid of the magic that makes our project special.

In an ideal world, we could show off our application using production data without violating the privacy of our users.

On a recently client project we managed to do just that by modifying our GraphQL resolvers with decorators to automatically return anonymized data. I’m very happy with the final solution, so I’d like to give you a run-through.

Setting the Scene

Imagine that we’re working on a Node.js application that uses Mongoose to model its data on the back-end. For context, imagine that our User Mongoose model looks something like this:


const userSchema = new Schema({
  name: String,
  phone: String
});

const User = mongoose.model('User', userSchema);

As we mentioned before, we’re using GraphQL to build our client-facing API. The exact GraphQL implementation we’re using doesn’t matter. Let’s just assume that we’re assembling our resolver functions into a single nested object before passing them along to our GraphQL server.

For example, a simple resolver object that supports a user query might look something like this:


const resolvers = {
  Query: {
    user: (_root, { _id }, _context) => {
      return User.findById(_id);
    }
  }
};

Our goal is to return an anonymized user object from our resolver when we detect that we’re in “demo mode”.

Updating Our Resolvers

The most obvious way of anonymizing our users when in “demo mode” would be to find every resolver that returns a User and manually modify the result before returning:


const resolvers = {
  Query: {
    user: async (_root, { _id }, context) => {
      let user = await User.findById(_id);

      // If we're in "demo mode", anonymize our user:
      if (context.user.demoMode) {
        user.name = 'Jane Doe';
        user.phone = '(555) 867-5309';
      }

      return user;
    }
  }
};

This works, but it’s a high touch, high maintenance solution. Not only do we have to comb through our codebase modifying every resolver function that returns a User type, but we also have to remember to conditionally anonymize all future resolvers that return User data.

Also, what if our anonymization logic changes? For example, what if we want anonymous users to be given the name 'Joe Schmoe' rather than 'Jane Doe'? Doh!

Thankfully, a little cleverness and a little help from Mongoose opens the doors to an elegant solution to this problem.

Anonymizing from the Model

We can improve on our previous solution by moving the anonymization logic into our User model. Let’s write an anonymize Mongoose method on our User model that scrubs the current user’s name and phone fields and returns the newly anonymized model object:


userSchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: 'Jane Doe',
    phone: '(555) 867-5309'
  });
};

We can refactor our user resolver to make use of this new method:


async (_root, { _id }, context) => {
  let user = await User.findById(_id);

  // If we're in "demo mode", anonymize our user:
  if (context.user.demoMode) {
    return user.anonymize();
  }

  return user;
}

Similarly, if we had any other GraphQL/Mongoose types we wanted to anonymize, such as a Company, we could add an anonymize function to the corresponding Mongoose model:


companySchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: 'Initech'
  });
};

And we can refactor any resolvers that return a Company GraphQL type to use our new anonymizer before returning a result:


async (_root, { _id }, context) => {
  let company = await Company.findById(_id);

  // If we're in "demo mode", anonymize our company:
  if (context.user.demoMode) {
    return company.anonymize();
  }

  return company;
}

Going Hands-off with a Decorator

Our current solution still requires that we modify every resolver in our application that returns a User or a Company. We also need to remember to conditionally anonymize any users or companies we return from resolvers we write in the future.

This is far from ideal.

Thankfully, we can automate this entire process. If you look at our two resolver functions up above, you’ll notice that the anonymization process done by each of them is nearly identical.

We anonymize our User like so:


// If we're in "demo mode", anonymize our user:
if (context.user.demoMode) {
  return user.anonymize();
}

return user;

Similarly, we anonymize our Company like so:


// If we're in "demo mode", anonymize our company:
if (context.user.demoMode) {
  return company.anonymize();
}

return company;

Because both our User and Company Mongoose models implement an identical interface in our anonymize function, the process for anonymizing their data is the same.

In theory, we could crawl through our resolvers object, looking for any resolvers that return a model with an anonymize function, and conditionally anonymize that model before returning it to the client.

Let’s write a function that does exactly that:


const anonymizeResolvers = resolvers => {
  return _.mapValues(resolvers, resolver => {
    if (_.isFunction(resolver)) {
      return decorateResolver(resolver);
    } else if (_.isObject(resolver)) {
      return anonymizeResolvers(resolver);
    } else if (_.isArray(resolver)) {
      return _.map(resolver, resolver => anonymizeResolvers(resolver));
    } else {
      return resolver;
    }
  });
};

Our new anonymizeResolvers function takes our resolvers map and maps over each of its values. If the value we’re mapping over is a function, we call a soon-to-be-written decorateResolver function that will wrap the function in our anonymization logic. Otherwise, we either recursively call anonymizeResolvers on the value if it’s an array or an object, or return it if it’s any other type of value.

Our decorateResolver function is where our anonymization magic happens:


const decorateResolver = resolver => {
  return async function(_root, _args, context) {
    let result = await resolver(...arguments);
    if (context.user.demoMode &&
        _.chain(result)
         .get('anonymize')
         .isFunction()
         .value()
    ) {
      return result.anonymize();
    } else {
      return result;
    }
  };
};

In decorateResolver we replace our original resolver function with a new function that first calls out to the original, passing through any arguments our new resolver received. Before returning the result, we check if we’re in demo mode and that the result of our call to resolver has an anonymize function. If both checks hold true, we return the anonymized result. Otherwise, we return the original result.

We can use our newly constructed anonymizeResolvers function by wrapping it around our original resolvers map before handing it off to our GraphQL server:


const resolvers = anonymizeResolvers({
  Query: {
    ...
  }
});

Now any GraphQL resolvers that return any Mongoose model with an anonymize function with return anonymized data when in demo mode, regardless of where the query lives, or when it’s written.

Final Thoughts

While I’ve been using Mongoose in this example, it’s not a requirement for implementing this type of solution. Any mechanism for “typing” objects and making them conform to an interface should get you where you need to go.

The real magic here is the automatic decoration of every resolver in our application. I’m incredibly happy with this solution, and thankful that GraphQL’s resolver architecture made it so easy to implement.

My mind is buzzing with other decorator possibilities. Authorization decorators? Logging decorators? The sky seems to be the limit. Well, the sky and the maximum call stack size.

April 21, 2019

Gonçalo Valério (dethos)

Easy backups with Borg April 21, 2019 05:55 PM

One of the oldest and most frequent advises to people working with computers is “create backups of your stuff”. People know about it, they are sick of hearing it, they even advice other people about it, but a large percentage of them don’t do it.

There are many tools out there to help you fulfill this task, but throughout the years the one I end up relying the most is definitely “Borg“. It is really easy to use, has good documentation and runs very well on Linux machines.

Here how they describe it:

BorgBackup (short: Borg) is a deduplicating backup program. Optionally, it supports compression and authenticated encryption.

The main goal of Borg is to provide an efficient and secure way to backup data. The data deduplication technique used makes Borg suitable for daily backups since only changes are stored. The authenticated encryption technique makes it suitable for backups to not fully trusted targets.

Borg’s Website

The built-in encryption and de-duplication features are some of its more important selling points.

Until recently I’ve had a hard time recommending it to less technical people, since Borg is mostly available through the command line and can take some work to implement the desired backup “policy”. There is a web based graphical user interface but I generally don’t like them as a replacement for native desktop applications.

However in the last few months I’ve been testing this GUI frontend for Borg, called Vorta, that I think will do the trick for family and friends that ask me what can they use to backup their data.

The tool is straight forward to use and supports the majority of Borg’s functionality, once you setup the repository you can instruct it to regularly perform your backups and forget about it.

I’m not gonna describe how to use it, because with a small search on the internet you can quickly find lots of articles with that information.

The only advise that I would like to leave here about Vorta, is related to the the encryption and the settings chosen when creating your repository. At least on the version I used, the recommend repokey option will store your passphrase on a local SQLite database in clear-text, which is kind of problematic.

This seems to be viewed as a feature:

Fallback to save repo passwords. Only used if no Keyring available.

Github Repository

But I could not find the documentation about how to avoid this “fallback”.

Ponylang (SeanTAllen)

Last Week in Pony - April 21, 2019 April 21, 2019 11:13 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

April 20, 2019

Bit Cannon (wezm)

Two Years on Linux April 20, 2019 10:00 PM

This is the sixth post in my series on finding an alternative to Mac OS X. The previous post in the series recapped my first year away from Mac OS and my move to FreeBSD on my desktop computer.

The search for the ideal desktop continues and my preferences evolve as I gain more experience. In this post I summarise where I’m at two years after switching away from Mac OS. This includes leaving FreeBSD on the desktop and switching from GNOME to Awesome. I’ll cover the motivation, benefits, and drawbacks to giving up a complete desktop environment for a, “build your own”, desktop.

Embracing Awesome

If I were to identify a general trend in my time away from Mac OS it would be one of gradual migration. Initially I was looking to replicate my Mac OS experience. I landed on elementary OS as it shared many of the same values of Mac OS. Over time, I moved to vanilla GNOME and gradually dropped some of the tools I initially felt were essential, like Albert, and Enpass. Instead, I opted for built in functionality or command line tools.

These gateway tools allowed me remain not too far outside my computing comfort zone. As time goes on though, I’m adopting more platform native options, like using the built in GNOME search instead of having a dedicated app for that like Albert.

GNOME was working pretty well for me and even got updated from 3.18 to 3.28 on FreeBSD (although it’s remained there and the current version is now 3.32). Despite this, high resource usage, some conversations, blog posts and shift in workflow led me to reevaluate tiling window managers.

I was using the terminal more than ever before. I’ve been comfortable in the terminal for a long time but I realised that I was using the tiling features of Tilix and Neovim a lot. I was also using the tiling feature of GNOME to show two apps side-by-side.

The memory usage and log spamming of gnome-shell was bothering me too. The former overflowed into a snarky tweet that led to a conversation that more or less convinced me that the use of JavaScript in gnome-shell was not the ultimate cause of the memory issues but the fact that such an issue went unfixed for years made me evaluate other options. Note: As of GNOME 3.30 the leak should be largely fixed.

I had a good conversation with a friend and long time Linux proponent about his use of i3, and he commented that he felt I’d probably like a tiling window manager. I’ve tried i3 before but didn’t really like it’s semi-manual management of layouts. This did prompt me to start looking around though.

I read some interesting blog posts:

It was a comment on the post above, the really piqued my curiosity. It mentioned spectrwm as a possible candidate. I installed it and was really taken by its primary/secondary tiling model and the sensible defaults approach. I tweaked and ran spectrwm on my XPS 15 for a while but eventually ran into some limitations of its configuration and integrated bar. At this point I was mostly enjoying a tiling window manager for the first time. I spent some time poring over the Arch Linux Wiki, Comparison of tiling window managers page. I reviewed most of the options on that page. Looking for ones that supported the primary/secondary model from spectrwm, were well maintained, configurable, came with a usable base configuration, and did not have many dependencies.

Eventually I landed on Awesome. It’s a well established project and uses Lua for configuration, which is a simple, easy to learn language that allows almost any configuration to be created. I’ve been happily using it on all my systems for about four months now.

Awesome Window Manger - Using the 'centerwork' layout while working on my linux.conf.au badge

It’s not all roses though, the thing with switching from a desktop environment to just a window manager is that it makes you really realise all the things that you get for free from the desktop environment. After settling into Awesome I needed to build/find replacements for the following features that I took for granted in GNOME:

  • Brightness control with keyboard buttons
  • Volume control with keyboard buttons
  • Setting the DPI correctly for a HiDPI display
  • Display adjustment when adding/removing an external display
  • Automatically unlocking the keyring upon login so that I didn’t need to enter the password for SSH and GPG keys.
  • Displaying the battery and volume level in the top bar
  • Trackpad/mouse configuration:
    • Trackpad acceleration
    • Natural scrolling
    • Enable Clickfingers behaviour
  • Double buffering of windows to prevent tearing, black fills where shadows should be present.
  • Notifications

I did solve all these challenges. Check out my xprofile and rc.lua if you’re curious.


Moving on From FreeBSD

From Oct 2017 to Jan 2019 I ran FreeBSD as the primary OS on my desktop computer. Similarly, I hosted this website and others on a FreeBSD server for more than two years. I recently rebuilt my personal server infrastructure on Docker, hosted by Alpine Linux and went back to Arch Linux on my desktop computer.

It wasn’t any one thing in isolation that led to this switch. It was lots of little things that culminated in a broken system one day that pushed me over the edge. I will just list some issues that come to mind in no particular. This post would be very long if I went into detail for each item. I’m aware that there are solutions and workarounds to some of these, like running Linux in bhyve but it was the sum of the whole, not any individual items that made me switch:

  • ZFS on Linux being ported to FreeBSD:
    • One of the reasons I used FreeBSD was for ZFS. I did so on the assumption that the FreeBSD implementation was more stable and “more canonical” than ZFS on Linux (ZoL). However, the announcement that ZoL is being ported to FreeBSD to get its bug fixes, improvements, and wider developer base suggested that was wrong.
  • I wanted/needed to use Docker more.
  • The portion of the community that likes to point out jails existed before Docker and are somehow better.
    • In my experience the jails user experience is terrible compared to Docker and lacks a lot of the features that Docker automatically takes care of, such as networking, file system layers/caching, distribution of images.
  • Attending linux.conf.au:
  • The general fear and loathing of all change that some of the community exhibit.
    • They decry everything that doesn’t keep things that way it was in 1970 as a violation of the “UNIX philosophy”, as though everything done by the UNIX grandfathers was perfect and unchangeable.
  • Working on my Rust powered linux.conf.au e-Paper badge, a project that targeted Raspbian, which was easier to test with a Linux host.
  • More advanced virtualisation:
    • Such as built in graphics support, no need for VNC workarounds.
  • Losing hours to slow networking in virtualised environments, something that just works on Linux.
  • The reaction to the improved FreeBSD Code of Conduct last year by some of the community deeply troubled me.
  • Graphics support:
    • The recent drm-kmod work that brings modern graphics support to FreeBSD is a great improvement but it’s a port of Linux code. If I’m running a bunch of Linux code anyway maybe it’s better to just go to the source.
  • The onerous process required to contribute patches to update a port and find someone to review and merge them.
  • Bugs with patches supplied that sit unmerged for months unless you know the right people to nudge.
  • Continued use of tools that are unfamiliar to the vast majority of developers these days (Subversion, patch based workflow).
    • I can and did deal with this but I think it’s a huge barrier to entry for new contributors.
  • A Electron port that no one seems to be able to get over the line.
    • I’m no electron fan but if the choice is no app or an electron app I’d at least like the option to run it.
    • There’s a US$850 bounty on this issue, $50 I added myself.

Apologies, I know the above list is a bit ranty. For something a bit less ranty read this a great post by Alexander Leidinger that outlines some things he thinks the project needs to do to stay relevant.

I called out some community behaviour, and reactions above but want to point out that these folks don’t represent the whole community. Lots of the BSD community are lovely and are doing the best they can with the comparatively small resources they have available. I thank them for their efforts.

The clincher was a failed upgrade in January 2019. I think I followed the handbook but something happened to the ZFS pool that prevented the system from booting from it. I was able to boot off an install flash drive and mount the pool fine but it refused to boot by itself. I spent several hours trying to fix it but in the end it was the final straw. I carefully backed everything up and then did a clean Arch Linux + ZFS install.

With the knowledge that ZoL was a lot more mature than I had originally thought I decided to install Arch onto the NVMe drive and then have /home live on a zpool comprised of the 3 SSDs.

One drawback to using ZFS for /home is that Dropbox stops working due to their brain-dead requirement that you must use ext4. There are hacks to work around it but I didn’t have proper Dropbox support on FreeBSD so not having it on this install was no different. My use of Dropbox is in maintenance mode anyway so it’s only rarely that I actually need it.

Finally, I may not be using FreeBSD day-to-day anymore but that doesn’t mean I’ve completely left. I continue to make monthly donations to the FreeBSD and OpenBSD projects and will continue to ensure that BSD systems are well-supported by any software I build. I’ll also advocate for avoiding unnecessarily Linux specific code where possible.


The Journey Continues

After more two years my journey continues and I expect it to keep doing so. I enjoy exploring what’s out there and my preferences shift over time. In the future I expect to periodically try out Wayland based systems, like I did on the new desktop Arch install (issues with copy and paste between Firefox and Alacrity led me to put that on hold).

On the operating system front NixOS and Guix are pioneering new ways of constructing reliable systems. As a Rust developer I’m also watching Redox OS, an OS written from scratch in Rust. What comes of Google’s Fuschia project will also be interesting to see unfold. The world of operating systems may not be as diverse as it once was but there’s still lots to come.

April 18, 2019

Derek Jones (derek-jones)

OSI licenses: number and survival April 18, 2019 12:23 AM

There is a lot of source code available which is said to be open source. One definition of open source is software that has an associated open source license. Along with promoting open source, the Open Source Initiative (OSI) has a rigorous review process for open source licenses (so they say, I have no expertise in this area), and have become the major licensing brand in this area.

Analyzing the use of licenses in source files and packages has become a niche research topic. The majority of source files don’t contain any license information, and, depending on language, many packages don’t include a license either (see Understanding the Usage, Impact, and Adoption of Non-OSI Approved Licenses). There is some evolution in license usage, i.e., changes of license terms.

I knew that a fair-few open source licenses had been created, but how many, and how long have they been in use?

I don’t know of any other work in this area, and the fastest way to get lots of information on open source licenses was to scrape the brand leader’s licensing page, using the Wayback Machine to obtain historical data. Starting in mid-2007, the OSI licensing page kept to a fixed format, making automatic extraction possible (via an awk script); there were few pages archived for 2000, 2001, and 2002, and no pages available for 2003, 2004, or 2005 (if you have any OSI license lists for these years, please send me a copy).

What do I now know?

Over the years OSI have listed 110107 different open source licenses, and currently lists 81. The actual number of license names listed, since 2000, is 205; the ‘extra’ licenses are the result of naming differences, such as the use of dashes, inclusion of a bracketed acronym (or not), license vs License, etc.

Below is the Kaplan-Meier survival curve (with 95% confidence intervals) of licenses listed on the OSI licensing page (code+data):

Survival curve of OSI licenses.

How many license proposals have been submitted for review, but not been approved by OSI?

Patrick Masson, from the OSI, kindly replied to my query on number of license submissions. OSI doesn’t maintain a count, and what counts as a submission might be difficult to determine (OSI recently changed the review process to give a definitive rejection; they have also started providing a monthly review status). If any reader is keen, there is an archive of mailing list discussions on license submissions; trawling these would make a good thesis project :-)

April 14, 2019

Derek Jones (derek-jones)

The Algorithmic Accountability Act of 2019 April 14, 2019 08:00 PM

The Algorithmic Accountability Act of 2019 has been introduced to the US congress for consideration.

The Act applies to “person, partnership, or corporation” with “greater than $50,000,000 … annual gross receipts”, or “possesses or controls personal information on more than— 1,000,000 consumers; or 1,000,000 consumer devices;”.

What does this Act have to say?

(1) AUTOMATED DECISION SYSTEM.—The term ‘‘automated decision system’’ means a computational process, including one derived from machine learning, statistics, or other data processing or artificial intelligence techniques, that makes a decision or facilitates human decision making, that impacts consumers.

That is all encompassing.

The following is what the Act is really all about, i.e., impact assessment.

(2) AUTOMATED DECISION SYSTEM IMPACT ASSESSMENT.—The term ‘‘automated decision system impact assessment’’ means a study evaluating an automated decision system and the automated decision system’s development process, including the design and training data of the automated decision system, for impacts on accuracy, fairness, bias, discrimination, privacy, and security that includes, at a minimum—

I think there is a typo in the following: “training, data” -> “training data”

(A) a detailed description of the automated decision system, its design, its training, data, and its purpose;

How many words are there in a “detailed description of the automated decision system”, and I’m guessing the wording has to be something a consumer might be expected to understand. It would take a book to describe most systems, but I suspect that a page or two is what the Act’s proposers have in mind.

(B) an assessment of the relative benefits and costs of the automated decision system in light of its purpose, taking into account relevant factors, including—

Whose “benefits and costs”? Is the Act requiring that companies do a cost benefit analysis of their own projects? What are the benefits to the customer, compared to a company not using such a computerized approach? The main one I can think of is that the customer gets offered a service that would probably be too expensive to offer if the analysis was done manually.

The potential costs to the customer are listed next:

(i) data minimization practices;

(ii) the duration for which personal information and the results of the automated decision system are stored;

(iii) what information about the automated decision system is available to consumers;

This act seems to be more about issues around data retention, privacy, and customers having the right to find out what data companies have about them

(iv) the extent to which consumers have access to the results of the automated decision system and may correct or object to its results; and

(v) the recipients of the results of the automated decision system;

What might the results be? Yes/No, on a load/job application decision, product recommendations are a few.

Some more potential costs to the customer:

(C) an assessment of the risks posed by the automated decision system to the privacy or security of personal information of consumers and the risks that the automated decision system may result in or contribute to inaccurate, unfair, biased, or discriminatory decisions impacting consumers; and

What is an “unfair” or “biased” decision? Machine learning finds patterns in data; when is a pattern in data considered to be unfair or biased?

In the UK, the sex discrimination act has resulted in car insurance companies not being able to offer women cheaper insurance than men (because women have less costly accidents). So the application form does not contain a gender question. But the applicants first name often provides a big clue, as to their gender. So a similar Act in the UK would require that computer-based insurance quote generation systems did not make use of information on the applicant’s first name. There is other, less reliable, information that could be used to estimate gender, e.g., height, plays sport, etc.

Lots of very hard questions to be answered here.