Planet Crustaceans

This is a Planet instance for lobste.rs community feeds. To add/update an entry or otherwise improve things, fork this repo.

May 26, 2019

Gonçalo Valério (dethos)

Pixels Camp v3 May 26, 2019 06:24 PM

Like I did in previous years/versions, this year I participated again on Pixels.camp, a kind of conference plus hackathon. For those who aren’t aware, it is one of the biggest (if not the biggest) technology event in Portugal (from a technical perspective not counting with the Web Summit).

So, as I did in previous editions, I’m gonna leave here a small list with the nicest talks I was able to attend.

Lockpicking versus IT security

This one was super interesting, Walter Belgers showed the audience a set of problems in make locks and compared those mistakes with the ones regularly done by software developers.

Al least for me the more impressive parts of the whole presentation were the demonstrations of the flaws on regular (and high security) locks.

Talk description here.


Containers 101

“Everybody” uses containers nowadays, on this talk the speaker took a step back and went through the history and the major details behind this technology. Then he shows how you could implement a part of it yourself using common Linux features and tools.

I will add the video here, as soon as it becomes available online.

Talk description here.


Static and dynamic analysis of events for threat detection

This one was a nice overview about Siemens infrastructure for threat detection, their approaches and used tools. It was also possible to understand some of the obstacles and challenges a company must address to protect a global infrastructure.

Talk description here.


Protecting Crypto exchanges from a new wave of man-in-the-browser attacks

This presentation used the theme of protecting crypto-currency exchanges but gave lots of good hints on how to improve security of any website or web application. The second half of the talk focused on a kind of attack called man-in-the-browser and focused on a demonstration of it. In my opinion, this last part was weaker and I left with the impression it lacked details about the most crucial part of the attack while spending a lot of time on less important stuff.

Talk description here.

May 25, 2019

Indrek Lasn (indreklasn)

This article is wrong, implausible and fully misinformed. May 25, 2019 10:42 AM

This article is wrong, implausible and fully misinformed. Plenty of 20 year-olds are successful CEOs. To give you a couple of examples of successful companies founded by a young CEO.

Michael Dell, 20's
Mark Zuckerberg, 20's
Bill Gates, 20's
Steve Jobs, 20's

Do you need more proof? And yes, I’m in 20's.

May 24, 2019

Siddhant Goel (siddhantgoel)

Not everything needs to be async May 24, 2019 10:00 PM

Writing asynchronous code is popular these days. Look at this search trend from the last 5 years.

Async Python

I have the feeling that the number of tutorials on the internet explaining asynchronous code has increased quite a bit since Python started supporting the async/await keywords. Even though Python has always had support for running asynchronous code using the asyncore module (or using libraries like Twisted), I don't think that asyncore was used as much as the new asyncio. This is a pure gut feeling though; I have no numbers to back that claim up.

Anyway, asyncio makes it slightly easier to write asynchronous code. Slightly, because I don't know if I can call the API as intuitive, or dare I say, "Pythonic". This article does a much better job of explaining why asyncio is what it is.

Even if we put asyncio aside, I don't think asynchronous code is ever easy. There's just so much going on under the hood that it's difficult to keep your head from spinning, before you can actually get to writing the application logic.

But that's not what this blog post is about. This blog post is about how not everything needs to be async. And that if some code you're working on absolutely necessarily must be async, then why it makes sense to stop for a minute and consider the consequences of introducing this extra level of complexity.

This has nothing to do with Python, or asyncio, or any async framework in general. All I want to say, is, if you think you want to write asynchronous code, think twice.

Synchronous is much simpler

Synchronous code is simple to write. It's also much easier to reason about, and it's lot less likely to contain concurrency or thread-safety bugs than asynchronous code. As programmers, our job is to solve business problems reliably in the least possible time. Synchronous code fits that criteria quite well. So if I'm given a choice between writing synchronous or asynchronous, I can say with a reasonable amount of confidence that I'll prefer synchronous.

Would async really help?

Next, if asynchronous code is absolutely required, it makes sense to think about what it's going to do underneath, and what performance gains it's going to bring.

For instance, if you're writing a web request handler which calls out a few external APIs and combines those responses to finally return a response to your user, yes, asynchronous code would absolutely help. The time that the external resources make your request handler wait can be used to serve other user requests.

On the other hand, if your request handler is fetching a few rows from a database server that's running on the same machine as the app server, it's not going to make that much of a difference if it were async.

Is it safe?

Often times we end up using abstractions that hide away the implementation details and provide a nice API for us to work with. In these cases, it's important to know what exactly is being hidden, or how that abstraction is working underneath.

For example, Python provides an abstraction called ThreadPoolExecutor, which allows you to run functions in separate threads (there is also ProcessPoolExecutor which lets you separate things on a process-level).

The way this works is that you submit a callable to the pool, and the pool returns a Future object immediately. And when the function has finished running, the results (or the exception) would be stored in this future object.

Since there are Future objects involved (which you can await on), it can be tempting to use this abstraction to write async code. But because now there are multiple threads involved, it's not that simple anymore. The functions being submitted to the thread pool should now only make use of resources that are thread-safe. In case two callables are submitted to the pool, both referencing a particular object which is not thread-safe, there's potential for weird concurrency bugs.


Closing thoughts - async is useful (and cool), but there is a time and place for everything. It may result in an increased CPU utilization without necessarily bringing speed improvements, so it's helpful to keep that in mind when writing async code.

Gustaf Erikson (gerikson)

March May 24, 2019 07:53 AM

Skisser för sommaren - Bosön mars 2019

Kristallvertikalaccent i grönt - Stockholm mars 2019

Mar 2018 | Mar 2017 | Mar 2016 | Mar 2015 | Mar 2014 | Mar 2013 | Mar 2012 | Mar 2011 | Mar 2010 | Mar 2009

May 23, 2019

Joe Nelson (begriffs)

Unicode programming, with examples May 23, 2019 12:00 AM

Most programming languages evolved awkwardly during the transition from ASCII to 16-bit UCS-2 to full Unicode. They contain internationalization features that often aren’t portable or don’t suffice.

Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.

Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.

This article illustrates text processing ideas with example programs. We’ll use the International Components for Unicode (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.

IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.

Table of Contents:

Concepts

Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.

What is a “character?”

“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.

Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.

You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.

In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways:

  • A: U+006f (o) + U+0302 (◌̂) + U+0323 (◌̣)
  • B: U+006f (o) + U+0323 (◌̣) + U+0302 (◌̂)
  • C: U+00f4 (ô) + U+0323 (◌̣)
  • D: U+1ecd (ọ) + U+0302 (◌̂)
  • E: U+1ed9 (ộ)

The numbers (written U+xxxx) for each abstract character and each combining symbol are called “codepoints.” Every Unicode string is expressed as a list of codepoints. As illustrated above, multiple strings of codepoints may render into the same sequence of graphemes.

To meaningfully compare strings codepoint by codepoint for equality, both strings should both be represented in a consistent way. A standardized choice of codepoint decomposition for graphemes is called a “normal form.”

One choice is to decompose a string into as many codepoints as possible like examples A and B (with a weighting factor of which combining marks should come first). That is called Normalization Form Canonical Decomposition (NFD). Another choice is to do the opposite and use the fewest codepoints possible like example E. This is called Normalization Form Canonical Composition (NFC).

A core concept to remember is that, although codepoints are the building blocks of text, they don’t match up 1-1 with user-perceived characters (graphemes). Operations such as taking the length of an array of codepoints, or accessing arbitrary array positions are typically not useful for Unicode programs. Programs must also be mindful of the combining characters, like diacritical marks, when inserting or deleting codepoints. Inserting U+0061 into the asterisk position U+006f U+0302 (*) U+0323 changes the string “ộ” into “ôạ” rather than “ộa”.

Glyphs vs graphemes

It’s not just fonts that cause graphemes to be rendered into varying glyphs. The rules of some languages cause glyphs to change through contextual shaping. For instance the Arabic letter “heh” has four forms, depending on which sides are flanked by letters. When isolated it appears as ﻩ and in the final/initial/medial position in a word it appears as ﻪ/ﻫ/ﻬ respectively. Similarly, Greek displays lower-case sigma differently at the end of the word (final form) than elsewhere. Some glyphs change based on visual order. In a right-to-left language the starting parenthesis “(” mirrors to display as “)”.

Not only do individual graphemes’ glyphs vary, graphemes can combine to form single glyphs. One way is through ligatures. The latin letters “fi” often join the dot of the i with the curve of the f (presentation form U+FB01 fi). Another way is language irregularity. The Arabic ا and ل, when contiguous, must form ﻻ.

Conversely, a single grapheme can split into multiple glyphs. For instance in some Indic languages, vowels can split and surround preceding consonants. In Bengali, U+09CC ৌ surrounds U+09AE ম to become মৌ . Try placing a cursor at the end of this text box and pressing backspace:

How are codepoints encoded?

In 1990, Unicode codepoints were 16 bits wide. That choice turned out to be too small for the symbols and languages people wanted to represent, so the committee extended the standard to 21 bits. That’s fine in the abstract, but how the 21 bits are stored in memory or communicated between computers depends on practical factors.

It’s an unusual memory size. Computer hardware doesn’t typically access memory in 21-bit chunks. Networking protocols, too, are better geared toward transmitting eight bits at a time. Thus, codepoints are broken into sequences of more conventionally sized blocks called code units for persistence on disk, transmission over networks, and manipulation in memory.

The Unicode Transformation Formats (UTF) describe different ways to map between codepoints and code units. The transformation formats are named after the bit width of their code units (7, 8, 16, or 32), as well as the endianness (BE or LE). For instance: UTF-8, or UTF-16BE. In addition to the UTFs, there’s another – more complex – encoding called Punycode. It is designed to conform with the limited ASCII character subset used for Internet host names.

A final bit of terminology. A “plane” is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes (1 through 16) are called “supplementary planes.”

Which encoding should you choose?

For transmission and storage, use UTF-8. Programs which move ASCII data can handle it without modification. Machine endianness does not affect UTF-8, and the byte-sized units work well in networks and filesystems.

Some sites, like UTF-8 Everywhere go even further and recommend using UTF-8 for internal manipulation of text in program memory. However, I would suggest you use whatever encoding your Unicode library favors for this. You’ll be performing operations through the library API, not directly on code units. As we’re seeing, there is too much complexity between glyphs, graphemes, codepoints and code units to be manipulating the units directly. Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.

It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint. However, the relationship between codepoints and glyphs isn’t straightforward, so there isn’t a programmatic advantage to storing the string this way.

UTF-32 also wastes at minimum 11 (32 - 21) bits per codepoint, and typically more. For instance, UTF-16 requires only one 16-bit code unit to encode points in the Base Multilingual Plane (the most commonly encountered points). Thus UTF-32 can typically double the space required for the BMP.

There are times to manipulate UTF-32, such as when examining a single codepoint. We’ll see examples below.

ICU example programs

The programs in this article are ready to compile and run. They require the ICU C library called ICU4C, which is available on most platforms through the operating system package manager.

ICU provides five libraries for linking (we need the first two):

Package Contents
icu-uc Common (uc) and Data (dt/data) libraries
icu-io Ustdio/iostream library (icuio)
icu-i18n Internationalization (in/i18n) library
icu-le Layout Engine
icu-lx Paragraph Layout

To use ICU4C, set the compiler and linker flags with pkg-config in your Makefile. (Pkg-config may also need to be installed on your computer.)

CFLAGS  = -std=c99 -pedantic -Wall -Wextra \
          `pkg-config --cflags icu-uc icu-io`
LDFLAGS = `pkg-config --libs icu-uc icu-io`

The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style (//) comments.

Generating random codepoints

To start getting a feel for ICU’s I/O and codepoint manipulation, let’s make a program to output completely random (but valid) codepoints. You could use this program as a basic fuzz tester, to see whether its output confuses other programs. A real fuzz tester ought to have the ability to take an explicit seed for repeatable output, but we will omit that functionality from our simple demo.

This program has limited portability because it gets entropy from /dev/urandom, a Unix device. To generate good random numbers using only the C standard library, see my other article. Also POSIX provides pseudo-random number functions.

/* for constants like EXIT_FAILURE */
#include <stdlib.h>
/* we'll be using standard C I/O to read random bytes */
#include <stdio.h>

/* to determine codepoint categories */
#include <unicode/uchar.h>
/* to output UTF-32 codepoints in proper encoding for terminal */
#include <unicode/ustdio.h>

int main(int argc, char **argv)
{
	long i = 0, linelen;
	/* somewhat non-portable: /dev/urandom is unix specific */
	FILE *f = fopen("/dev/urandom", "rb");
	UFILE *out;
	/* UTF-32 code unit can hold an entire codepoint */
	UChar32 c;
	/* to learn about c */
	UCharCategory cat;

	if (!f)
	{
		fputs("Unable to open /dev/urandom\n", stderr);
		return EXIT_FAILURE;
	}

	/* optional length to insert line breaks */
	linelen = argc > 1 ? strtol(argv[1], NULL, 10) : 0;

	/* have to obtain a Unicode-aware file handle. This function
	 * has no failure return code, it always works. */
	out = u_get_stdout();

	/* read a random 32 bits, presumably forever */
	while (fread(&c, sizeof c, 1, f))
	{
		/* Scale 32-bit value to a number within code planes
		 * zero through fourteen. (Planes 15-16 are private-use)
		 *
		 * The modulo bias is insignificant. The first 65535
		 * codepoints are minutely favored, being generated by
		 * 4370 different 32-bit numbers each. The remaining
		 * 917505 codepoints are generated by 4369 numbers each.
		 */
		c %= 0xF0000;
		cat = u_charType(c);

		/* U_UNASSIGNED are "non-characters" with no assigned
		 * meanings for interchange. U_PRIVATE_USE_CHAR are
		 * reserved for use within organizations, and
		 * U_SURROGATE are designed for UTF-16 code units in
		 * particular. Don't print any of those. */
		if (cat != U_UNASSIGNED && cat != U_PRIVATE_USE_CHAR &&
		    cat != U_SURROGATE)
		{
			u_fputc(c, out);
			if (linelen && ++i >= linelen)
			{
				i = 0;
				/* there are a number of Unicode
				 * linebreaks, but the standard ASCII
				 * \n is valid, and will interact well
				 * with a shell */
				u_fputc('\n', out);
			}
		}
	}

	/* should never get here */
	fclose(f);
	return EXIT_SUCCESS;
}

A note about the mysterious U_UNASSIGNED category, the “non-characters.” These are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. The Unicode Standard sets aside 66 non-character code points. The last two code points of each plane are noncharacters (U+FFFE and U+FFFF on the BMP). In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0…U+FDEF.

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. They are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.

Manipulating codepoints

We discussed non-characters in the previous section, but there are also Private Use codepoints. Unlike non-characters, those for private use are designated for interchange between systems. However the precise meaning and glyphs for these characters is specific to the organization using them. The same codepoints can be used for different things by different people.

Unicode provides a large area for private use. Both a small code block in the BMP, as well as two entire planes: 15 and 16. Because no browser or text editor will render PUA codepoints beyond (typically) empty boxes, we can exploit plane 15 to make a visually confusing code. Ultimately it’s a cheesy transposition cypher, but it’s kind of fun.

Below is a program to shift characters in the BMP to/from plane 15, the Private Use Area A. Example output of an encoded string: 󰁂󰁥󰀠󰁳󰁵󰁲󰁥󰀠󰁴󰁯󰀠󰁤󰁲󰁩󰁮󰁫󰀠󰁹󰁯󰁵󰁲󰀠󰁏󰁶󰁡󰁬󰁴󰁩󰁮󰁥󰀡󰀊

#include <stdio.h>
#include <stdlib.h>
/* for strcmp in argument parsing */
#include <string.h>

#include <unicode/ustdio.h>

void usage(const char *prog)
{
	puts("Shift base multilingual plane to/from PUA-A\n");
	printf("Usage: %s [-d]\n\n", prog);
	puts("Encodes stdin (or decode with -d)");
	exit(EXIT_SUCCESS);
}

int main(int argc, char **argv)
{
	UChar32 c;
	UFILE *in, *out;
	enum { MODE_ENCODE, MODE_DECODE } mode = MODE_ENCODE;

	if (argc > 2)
		usage(argv[0]);
	else if(argc > 1)
	{
		if (strcmp(argv[1], "-d") == 0)
			mode = MODE_DECODE;
		else
			usage(argv[0]);
	}

	out = u_get_stdout();

	in = u_finit(stdin, NULL, NULL);
	if (!in)
	{
		fputs("Error opening stdout as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* u_fgetcx returns UTF-32. U_EOF happens to be 0xFFFF,
	 * not -1 like EOF typically is in stdio.h */
	while ((c = u_fgetcx(in)) != U_EOF)
	{
		/* -1 for UChar32 actually signifies invalid character */
		if (c == (UChar32)0xFFFFFFFF)
		{
			fputs("Invalid character.\n", stderr);
			continue;
		}
		if (mode == MODE_ENCODE)
		{
			/* Move the BMP into the Supplementary
			 * Private Use Area-A, which begins
			 * at codepoint 0xf0000 */
			if (0 < c && c < 0xe000)
				c += 0xf0000;
		}
		else
		{
			/* Move the Supplementary Private Use
			 * Plane down into the BMP */
			if (0xf0000 < c && c < 0xfe000)
				c -= 0xf0000;
		}
		u_fputc(c, out);
	}

	/* if you u_finit it, then u_fclose it */
	u_fclose(in);

	return EXIT_SUCCESS;
}

Examining UTF-8 code units

So far we’ve been working entirely with complete codepoints. This next example gets into their representation as code units in a transformation format, namely UTF-8. We will read the codepoint as a hexadecimal program argument, and convert it to between 1-4 bytes in UTF-8, and print the hex values of those bytes.

/*** utf8.c ***/

#include <stdio.h>
#include <stdlib.h>

#include <unicode/utf8.h>

int main(int argc, char **argv)
{
	UChar32 c;
	/* ICU defines its own bool type to be used
	 * with their macro */
	UBool err = FALSE;
	/* ICU uses C99 types like uint8_t */
	uint8_t bytes[4] = {0};
	/* probably should be size_t not int32_t, but
	 * just matching what their macro expects */
	int32_t written = 0, i;
	char *parsed;

	if (argc != 2)
	{
		fprintf(stderr, "Usage: %s codepoint\n", *argv);
		exit(EXIT_FAILURE);
	}
	c = strtol(argv[1], &parsed, 16);
	if (!*argv[1] || *parsed)
	{
		fprintf(stderr,
			"Cannot parse codepoint: U+%s\n", argv[1]);
		exit(EXIT_FAILURE);
	}

	/* this is a macro, and updates the variables
	 * directly. No need to pass addresses.
	 * We're saying: write to "bytes", tell us how
	 * many were "written", limit it to four */
	U8_APPEND(bytes, written, 4, c, err);
	if (err == TRUE)
	{
		fprintf(stderr, "Invalid codepoint: U+%s\n", argv[1]);
		exit(EXIT_FAILURE);
	}

	/* print in format 'xxd -r' can read */
	printf("0: ");
	for (i = 0; i < written; ++i)
		printf("%2x", bytes[i]);
	puts("");
	return EXIT_SUCCESS;
}

Suppose you compile this to a program named utf8. Here are some examples:

# ascii characters are unchanged
$ ./utf8 61
0: 61

# other codepoints require more bytes
$ ./utf8 1F41A
0: f09f909a

# format is compatible with "xxd"
$ ./utf8 1F41A | xxd -r
🐚

# surrogates (used in UTF-16) are not valid codepoints
$ ./utf8 DC00
Invalid codepoint: U+DC00

Reading lines into internal UTF-16 representation

Unlimited line length

Here’s a useful helper function named u_wholeline() which reads a line of any length into a dynamically allocated buffer. It reads as UChar*, which is ICU’s standard UTF-16 code unit array.

/* to properly test realloc */
#include <errno.h>
#include <stdlib.h>

#include <unicode/ustdio.h>

/* line Feed, vertical tab, form feed, carriage return,
 * next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
	((c) >= 0xa && (c) <= 0xd) || \
	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )

/* allocates buffer, caller must free */
UChar *u_wholeline(UFILE *f)
{
	/* assume most lines are shorter
	 * than 128 UTF-16 code units */
	size_t i, sz = 128;
	UChar c, *s = malloc(sz * sizeof(*s)), *s_new;

	if (!s)
		return NULL;

	/* u_fgetc returns UTF-16, unlike u_fgetcx */
	for (i = 0; (s[i] = u_fgetc(f)) != U_EOF && !NEWLINE(s[i]); ++i)
		if (i >= sz)
		{
			/* double the buffer when it runs out */
			sz *= 2;
			errno = 0;
			s_new = realloc(s, sz * sizeof(*s));
			if (errno == ENOMEM)
				free(s);
			if ((s = s_new) == NULL)
				return NULL;
		}

	/* if terminated by CR, eat LF */
	if (s[i] == 0xd && (c = u_fgetc(f)) != 0xa)
		u_fungetc(c, f);
	/* s[i] will either be U_EOF or a newline; wipe it */
	s[i] = '\0';

	return s;
}

Limited line length

The previous example reads an entire line. However, reading a limited number of code units from UTF-16 lines is more tricky. Truncating a Unicode string is always a little dangerous due to possibly splitting a word and breaking contextual shaping.

UTF-16 also has surrogate pairs, which are how that translation format expresses codepoints outside the BMP. Ending a UTF-16 string early can split surrogate pairs without the proper precaution.

The following example reads lines in chunks of at most three UTF-16 code units at a time. If it reads two consecutive codepoints from supplementary planes it will fail. The program accepts a “fix” argument to make it push a final unpaired surrogate back onto the stream for a future read.

/*** codeunit.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utf16.h>

/* BUFSZ set to be very small so that lines must be read in
 * many chunks. Helps illustrate split surrogate pairs */
#define BUFSZ 4

void printHex(const UChar *s)
{
	while (*s)
		printf("%x ", *s++);
	putchar('\n');
}

/* yeah, slightly annoying duplication */
void printHex32(const UChar32 *s)
{
	while (*s)
		printf("%x ", *s++);
	putchar('\n');
}

int main(int argc, char **argv)
{
	UFILE *in;
	/* read line into ICU's default UTF-16 representation */
	UChar line[BUFSZ];
	/* A buffer to hold codepoints of "line" as UTF-32 code
	 * units.  The length is sufficient because it requires
	 * fewer (or at least no greater) code units in UTF-32 to
	 * encode the string */
	UChar32 codepoints[BUFSZ];
	UChar *final;
	UErrorCode err = U_ZERO_ERROR;

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* read lines one small BUFSZ chunk at a time */
	while (u_fgets(line, BUFSZ, in))
	{
		/* correct for split surrogate pairs only
		 * if the "fix" argument is present */
		if (argc > 1 && strcmp(argv[1], "fix") == 0)
		{
			final = line + u_strlen(line);
			/* want to consider the character before \0
			 * if such exists */
			if (final > line)
				final--;
			/* if it is the lead unit of a surrogate pair */
			if (U16_IS_LEAD(*final))
			{
				/* push it back for a future read, and
				 * truncate the string */
				u_fungetc(*final, in);
				*final = '\0';
			}
		}

		printf("UTF-16    : ");
		printHex(line);
		u_strToUTF32(
			codepoints, BUFSZ, NULL,
			line, -1, &err);
		printf("Error?    : %s\n", u_errorName(err));
		printf("Codepoints: ");
		printHex32(codepoints);

		/* reset potential errors and go for another chunk */
		err = U_ZERO_ERROR;
		*codepoints = '\0';
	}

	u_fclose(in);
	return EXIT_SUCCESS;
}

If the program reads two weird numerals 𝟘𝟙 (different from 01), neither of which are in the BMP, it finds one codepoint but chokes on the broken pair:

$ echo -n 𝟘𝟙 | ./codeunit
UTF-16    : d835 dfd8 d835
Error?    : U_INVALID_CHAR_FOUND
Codepoints: 1d7d8
UTF-16    : dfd9
Error?    : U_INVALID_CHAR_FOUND
Codepoints:

However if we pass the “fix” argument, the program will read two complete codepoints:

$ echo -n 𝟘𝟙 | ./codeunit fix
UTF-16    : d835 dfd8
Error?    : U_ZERO_ERROR
Codepoints: 1d7d8
UTF-16    : d835 dfd9
Error?    : U_ZERO_ERROR
Codepoints: 1d7d9

Perhaps a better way to read a line with limited length is to use a “break iterator” to stop on a word boundary. We’ll see more about that later.

Extracting, iterating codepoints in UTF-16 string

Our next example will rather laboriously remove diacritical marks from a string. There’s an easier way to do this called “transformation,” but doing it manually provides an opportunity to decompose characters and iterate over them with the U16_NEXT macro.

/*** nomarks.c ***/

#include <stdlib.h>

#include <unicode/uchar.h>
#include <unicode/unorm2.h>
#include <unicode/ustdio.h>
#include <unicode/utf16.h>

/* Limit to how many decomposed UTF-16 units a single
 * codepoint will become in NFD. I don't know the
 * correct value here so I chose a value that seems
 * to be overkill */
#define MAX_DECOMP_LEN 16

int main(void)
{
	long i, n;
	UChar32 c;
	UFILE *in, *out;
	UChar decomp[MAX_DECOMP_LEN];
	UErrorCode status = U_ZERO_ERROR;
	UNormalizer2 *norm;

	out = u_get_stdout();

	in = u_finit(stdin, NULL, NULL);
	if (!in)
	{
		/* using stdio functions with stderr and ustdio
		 * with stdout. Mixing the two on a single file
		 * handle would probably be bad. */
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* create a normalizer, in this case one going to NFD */
	norm = (UNormalizer2 *)unorm2_getNFDInstance(&status);
	if (U_FAILURE(status)) {
		fprintf(stderr,
			"unorm2_getNFDInstance(): %s\n",
			u_errorName(status));
		return EXIT_FAILURE;
	}

	/* consume input as UTF-32 units one by one */
	while ((c = u_fgetcx(in)) != U_EOF)
	{
		/* Decompose c to isolate its n combining character
		 * codepoints. Saves them as UTF-16 code units.  FYI,
		 * this function ignores the type of "norm" and always
		 * denormalizes */
		n = unorm2_getDecomposition(
			norm, c, decomp, MAX_DECOMP_LEN, &status
		);

		if (U_FAILURE(status)) {
			fprintf(stderr,
				"unorm2_getDecomposition(): %s\n",
				u_errorName(status));
			u_fclose(in);
			return EXIT_FAILURE;
		}

		/* if c does not decompose and is not itself
		 * a diacritical mark */
		if (n < 0 && ublock_getCode(c) !=
		    UBLOCK_COMBINING_DIACRITICAL_MARKS)
			u_fputc(c, out);

		/* walk canonical decomposition, reuse c variable */
		for (i = 0; i < n; )
		{
			/* the U16_NEXT macro iterates over UChar (aka
			 * UTF-16, advancing by one or two elements as
			 * needed to get a codepoint. It saves the result
			 * in UTF-32. The macro updates i and c. */
			U16_NEXT(decomp, i, n, c);
			/* output only if not combining diacritical */
			if (ublock_getCode(c) !=
			    UBLOCK_COMBINING_DIACRITICAL_MARKS)
				u_fputc(c, out);
		}
	}

	u_fclose(in);
	/* u_get_stdout() doesn't need to be u_fclose'd */
	return EXIT_SUCCESS;
}

Here’s an example of running the program:

$ echo "résumé façade" | ./nomarks
resume facade

Transformation

ICU provides a rich domain specific language for transforming strings. For example, our entire program in the previous section can be replaced by the transformation NFD; [:Nonspacing Mark:] Remove; NFC. This means to perform a canonical decomposition, remove nonspacing marks, and then canonically compose again. (In fact our program above didn’t re-compose.)

The program below echoes stdin to stdout, but passes the output through a transformation.

/*** trans-stream.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>

int main(int argc, char **argv)
{
	UChar32 c;
	UParseError pe;
	UFILE *in, *out;
	UTransliterator *t;
	UErrorCode status = U_ZERO_ERROR;
	UChar *xform_id;
	size_t n;

	if (argc != 2)
	{
		fprintf(stderr,
			"Usage: %s \"translation rules\"\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* the UTF-16 string should never be longer than the UTF-8
	 * argv[1], so this should be safe */
	n = strlen(argv[1]) + 1;
	xform_id = malloc(n * sizeof(UChar));
	u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);

	/* create transliterator by identifier */
	t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
	                 NULL, -1, &pe, &status);
	/* don't need the identifier any more */
	free(xform_id);
	if (U_FAILURE(status)) {
		fprintf(stderr, "utrans_open(%s): %s\n",
		        argv[1], u_errorName(status));
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* transparently transliterate stdout */
	u_fsettransliterator(out, U_WRITE, t, &status);
	if (U_FAILURE(status)) {
		fprintf(stderr,
		        "Failed to set transliterator on stdout: %s\n",
		        u_errorName(status));
		u_fclose(in);
		return EXIT_FAILURE;
	}

	/* what looks like a simple echo loop actually
	 * transliterate characters */
	while ((c = u_fgetcx(in)) != U_EOF)
		u_fputc(c, out);

	utrans_close(t);
	u_fclose(in);
}

As mentioned, it can emulate our earlier “nomarks” program:

$ echo "résumé façade" | ./trans "NFD; [:Nonspacing Mark:] Remove; NFC"
resume facade

It can also transliterate between scripts like this:

$ echo "miirekkaḍiki veḷutunnaaru?" | ./trans "Telugu"
మీరెక్కడికి వెళుతున్నఅరు?

Applying the transformation to a stream with u_fsettransliterator is a simple way to do things. However I did discover and file an ICU bug which will be fixed in version 65.1.

A more robust way to apply transformations is by manipulating UChar strings directly. The technique is also probably more applicable in real applications.

Here’s a rewrite of trans-stream that operates on strings directly:

/*** trans-string.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>

/* max number of UTF-16 code units to accumulate while looking
 * for an unambiguous transliteration. Has to be fairly long to
 * handle names in Name-Any transliteration like
 * \N{LATIN CAPITAL LETTER O WITH OGONEK AND MACRON} */
#define CONTEXT 100

int main(int argc, char **argv)
{
	UErrorCode status = U_ZERO_ERROR;
	UChar c, *end;
	UChar input[CONTEXT] = {0}, *buf, *enlarged;
	UFILE *in, *out; 
	UTransPosition pos;
	int32_t width, sizeNeeded, bufLen;

	size_t n;
	UChar *xform_id;
	UTransliterator *t;

	/* bufLen must be able to hold at least CONTEXT, and
	 * will be increased as needed for transliteration */
	bufLen = CONTEXT;
	buf = malloc(sizeof(UChar) * bufLen);

	if (argc != 2)
	{
		fprintf(stderr,
			"Usage: %s \"translation rules\"\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* allocate and read identifier, like earlier example */
	n = strlen(argv[1]) + 1;
	xform_id = malloc(n * sizeof(UChar));
	u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);

	t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
	                 NULL, -1, NULL, &status);
	free(xform_id);
	if (U_FAILURE(status)) {
		fprintf(stderr, "utrans_open(%s): %s\n",
		        argv[1], u_errorName(status));
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	end = input;
	/* append UTF-16 code units one at a time for incremental
	 * transliteration */
	while ((c = u_fgetc(in)) != U_EOF)
	{
		/* we consider at most CONTEXT consecutive code units
		 * for transliteration (minus one for \0) */
		if (end - input >= CONTEXT-1)
		{
			fprintf(stderr,
				"Exceeded max (%i) code units "
				"for context.\n",
				CONTEXT);
			break;
		}
		*end++ = c;
		*end = '\0';

		/* copy string so far to buf to operate on */
		u_strcpy(buf, input);
		pos.start = pos.contextStart = 0;
		pos.limit = pos.contextLimit = end - input;
		sizeNeeded = -1;
		utrans_transIncrementalUChars(
			t, buf, &sizeNeeded, bufLen, &pos, &status
		);
		/* if buf not big enough for transliterated result */
		if (status == U_BUFFER_OVERFLOW_ERROR)
		{
			/* utrans_transIncrementalUChars sets sizeNeeded,
			 * so resize the buffer */
			if ((enlarged =
			     realloc(buf, sizeof(UChar)*sizeNeeded))
			    == NULL)
			{
				fprintf(stderr,
					"Unable to grow buffer.\n");
				/* fail gracefully and display
				 * what we can */
				break;
			}
			buf = enlarged;
			bufLen = sizeNeeded;
			u_strcpy(buf, input);
			pos.start = pos.contextStart = 0;
			pos.limit = pos.contextLimit = end - input;
			sizeNeeded = -1;

			/* one more time, but with sufficient space */
			status = U_ZERO_ERROR;
			utrans_transIncrementalUChars(
				t, buf, &sizeNeeded, bufLen,
				&pos, &status
			);
		}
		/* handle errors other than U_BUFFER_OVERFLOW_ERROR */
		if (U_FAILURE(status)) {
			fprintf(stderr,
				"utrans_transIncrementalUChars(): %s\n",
				u_errorName(status));
			break;
		}

		/* print buf[0 .. pos.start - 1] */
		u_printf("%.*S", pos.start, buf);

		/* Remove the code units which were processed,
		 * shifting back the remaining ones which could
		 * not be unambiguously transliterated. Then hit
		 * the loop to get another code unit and try again. */
		u_strcpy(input, buf+pos.start);
		end = input + (pos.limit - pos.start);
	}

	/* if any leftovers from incremental transliteration */
	if (end > input)
	{
		/* transliterate input array in place, do our best */
		width = end - input;
		utrans_transUChars(
			t, input, NULL, CONTEXT, 0, &width, &status);
		u_printf("%S", input);
	}

	utrans_close(t);
	u_fclose(in);
	free(buf);
	return U_SUCCESS(status) ? EXIT_SUCCESS : EXIT_FAILURE;
}

Punycode

Punycode is a representation of Unicode within the limited ASCII character subset used for internet host names. If you enter a non-ASCII URL into a web browser navigation bar, the browser translates to Punycode before making the actual DNS lookup.

The encoding is part of the more general process of Internationalizing Domain Names in Applications (IDNA), which also normalizes the string.

Note that not all Unicode strings can be successfully encoded. For instance codepoints like “⒈” include a period in the glyph and are used for numbered lists. Converting that dot to the ASCII hostname would inadvertently specify a subdomain. ICU turns the offending character into U+FFFD (the “replacement character”) in the output and returns an error.

The following program uses uidna_nameToASCII or uidna_nameToUnicode as needed to translate between Unicode and punycode.

/*** puny.c ***/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* uidna stands for International Domain Names in 
 * Applications and contains punycode routines */
#include <unicode/uidna.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

void chomp(UChar *s)
{
	/* unicode characters that split lines */
	UChar splits[] =
		{0xa, 0xb, 0xc, 0xd, 0x85, 0x2028, 0x2029, '\0'};
	if (s)
		s[u_strcspn(s, splits)] = '\0';
}

int main(int argc, char **argv)
{
	UFILE *in;
	UChar input[1024], output[1024];
	UIDNAInfo info = UIDNA_INFO_INITIALIZER;
	UErrorCode status = U_ZERO_ERROR;
	UIDNA *idna = uidna_openUTS46(UIDNA_DEFAULT, &status);

	/* default action is performing punycode */
	int32_t (*action)(
			const UIDNA*, const UChar*, int32_t, UChar*, 
			int32_t, UIDNAInfo*, UErrorCode*
		) = uidna_nameToASCII;

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* the "decode" option reverses our action */
	if (argc > 1 && strcmp(argv[1], "decode") == 0)
		action = uidna_nameToUnicode;

	/* u_fgets includes the newline, so we chomp it */
	u_fgets(input, sizeof(input)/sizeof(*input), in);
	chomp(input);

	action(idna, input, -1, output,
		sizeof(output)/sizeof(*output),
		&info, &status);

	if (U_SUCCESS(status) && info.errors!=0)
		fputs("Bad input.\n", stderr);

	u_printf("%S\n", output);

	uidna_close(idna);
	u_fclose(in);
	return 0;
}

Example of using the program:

$ echo "façade.com" | ./puny
xn--faade-zra.com

# not every string is allowed

$ echo "a⒈.com" | ./puny
Bad input.
a�.com

Changing case

The C standard library has functions like toupper which operate on a single character at a time. ICU has equivalents like u_toupper, but working on single codepoints isn’t sufficient for proper casing. Let’s examine the program and see why.

/*** pointcase.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/uchar.h>
#include <unicode/ustdio.h>

int main(int argc, char **argv)
{
	UChar32 c;
	UFILE *in, *out;
	UChar32 (*op)(UChar32) = NULL;

	/* set op to one of the casing operations
	 * in uchar.h */
	if (argc < 2 || strcmp(argv[1], "upper") == 0)
		op = u_toupper;
	else if (strcmp(argv[1], "lower") == 0)
		op = u_tolower;
	else if (strcmp(argv[1], "title") == 0)
		op = u_totitle;
	else
	{
		fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* operates on UTF-32 */
	while ((c = u_fgetcx(in)) != U_EOF)
		u_fputc(op(c), out);

	u_fclose(in);
	return EXIT_SUCCESS;
}
# not quite right, ß should become SS:

$ echo "Die große Stille" | ./pointcase upper
DIE GROßE STILLE

# also wrong, final sigma should be ς:

$ echo "ΣΊΣΥΦΟΣ" | ./pointcase lower
σίσυφοσ

As you can see, some graphemes need to “expand” into a greater number, and others are position-sensitive. To do this properly, we have to operate on entire strings rather than individual characters. Here is a program to do it right:

/*** strcase.c ***/

#include <locale.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 1024

/* wrapper function for u_strToTitle with signature
 * matching the other casing functions */
int32_t title(UChar *dest, int32_t destCapacity,
		const UChar *src, int32_t srcLength,
		const char *locale, UErrorCode *pErrorCode)
{
	return u_strToTitle(dest, destCapacity, src,
			srcLength, NULL, locale, pErrorCode);
}

int main(int argc, char **argv)
{
	UFILE *in;
	char *locale;
	UChar line[BUFSZ], cased[BUFSZ];
	UErrorCode status = U_ZERO_ERROR;
	int32_t (*op)(
			UChar*, int32_t, const UChar*, int32_t,
			const char*, UErrorCode*
		) = NULL;

	/* casing is locale-dependent */
	if (!(locale = setlocale(LC_CTYPE, "")))
	{
		fputs("Cannot determine system locale\n", stderr);
		return EXIT_FAILURE;
	}

	if (argc < 2 || strcmp(argv[1], "upper") == 0)
		op = u_strToUpper;
	else if (strcmp(argv[1], "lower") == 0)
		op = u_strToLower;
	else if (strcmp(argv[1], "title") == 0)
		op = title;
	else
	{
		fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* Ideally we should change case up to the last word
	 * break and push the remaining characters back for
	 * a future read if the line was longer than BUFSZ.
	 * Currently, if the string is truncated, the final
	 * character would incorrectly be considered
	 * terminal, which affects casing rules in Greek. */
	while (u_fgets(line, BUFSZ, in))
	{
		op(cased, BUFSZ, line, -1, locale, &status);
		/* if casing increases string length, and goes
		 * beyond buffer size like the german ß -> SS */
		if (status == U_BUFFER_OVERFLOW_ERROR)
		{
			/* Just issue a warning and read another line.
			 * Don't treat it as severely as other errors. */
			fputs("Line too long\n", stderr);
			status = U_ZERO_ERROR;
		}
		else if (U_FAILURE(status))
		{
			fputs(u_errorName(status), stderr);
			break;
		}
		else
			u_printf("%S", cased);
	}

	u_fclose(in);
	return U_SUCCESS(status)
		? EXIT_SUCCESS : EXIT_FAILURE;
}

This works better.

$ echo "Die große Stille" | ./strcase upper
DIE GROSSE STILLE

$ echo "ΣΊΣΥΦΟΣ" | ./strcase lower
σίσυφος

Counting words and graphemes

Let’s make a version of wc (the Unix word count program) that knows more about Unicode. Our version will properly count grapheme clusters and word boundaries.

For example, regular wc gets confused by the ancient Ogham script. This was a series of notches scratched into fence posts, and has a space character which is nonblank.

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | wc
       1       1      37

One word, you say? Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it. Here’s one that can:

/*** uwc.c ***/

#include <locale.h>
#include <stdlib.h>

#include <unicode/ubrk.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 512

/* line Feed, vertical tab, form feed, carriage return, 
 * next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
	((c) >= 0xa && (c) <= 0xd) || \
	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )

int main(void)
{
	UFILE *in;
	char *locale;
	UChar line[BUFSZ];
	UBreakIterator *brk_g, *brk_w;
	UErrorCode status = U_ZERO_ERROR;
	long ngraph = 0, nword = 0, nline = 0;
	size_t len;

	/* word breaks are locale-specific, so we'll obtain
	 * LC_CTYPE from the environment */
	if (!(locale = setlocale(LC_CTYPE, "")))
	{
		fputs("Cannot determine system locale\n", stderr);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* create an iterator for graphemes */
	brk_g = ubrk_open(
		UBRK_CHARACTER, locale, NULL, -1, &status);
	/* and another for the edges of words */
	brk_w = ubrk_open(
		UBRK_WORD, locale, NULL, -1, &status);

	/* yes, this is sensitive to splitting end of line
	 * surrogate pairs and can be improved by our previous
	 * function for reading bounded lines */
	while (u_fgets(line, BUFSZ, in))
	{
		len = u_strlen(line);

		ubrk_setText(brk_g, line, len, &status);
		ubrk_setText(brk_w, line, len, &status);

		/* Start at beginning of string, count breaks.
		 * Could have been a for loop, but this looks
		 * simpler to me. */
		ubrk_first(brk_g);
		while (ubrk_next(brk_g) != UBRK_DONE)
			ngraph++;

		ubrk_first(brk_w);
		while (ubrk_next(brk_w) != UBRK_DONE)
			if (ubrk_getRuleStatus(brk_w) ==
			    UBRK_WORD_LETTER)
				nword++;

		/* count the newline if it exists */
		if (len > 0 && NEWLINE(line[len-1]))
			nline++;
	}

	printf("locale  : %s\n"
	       "Grapheme: %zu\n"
	       "Word    : %zu\n"
	       "Line    : %zu\n",
	       locale, ngraph, nword, nline);

	/* clean up iterators after use */
	ubrk_close(brk_g);
	ubrk_close(brk_w);
	u_fclose(in);
}

Much better:

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | ./uwc
locale  : en_US.UTF-8
Grapheme: 14
Word    : 4
Line    : 1

When comparing strings, we can be more or less strict. A familiar example is case sensitivity, but Unicode provides other options. Comparing strings for equality is a degenerate case of sorting, where the strings must not only be determined as equal, but put in order. Sorting is called “collation” and the Unicode collation algorithm supports multiple levels of increasing strictness.

Level Description
Primary base characters
Secondary accents
Tertiary case/variant
Quaternary punctuation

Each level acts as a tie-breaker when strings match in previous levels. When searching we can choose how deep to check before declaring strings equal. To illustrate, consider a text file called words.txt containing these words:

Cooperate
coöperate
COÖPERATE
co-operate
final
fides

We will write a program called ugrep, where we can specify a comparison level and search string. If we search for “cooperate” and allow comparisons up to the tertiary level it matches nothing:

$ ./ugrep 3 cooperate < words.txt
# it's an exact match, no results

It is possible to shift certain “ignorable” characters (like ‘-’) down to the quaternary level while conducting the original level 3 search:

$ ./ugrep 3i cooperate < words.txt
4: co-operate

Doing the same search at the secondary level disregards case, but is still sensitive to accents.

$ ./ugrep 2 cooperate < words.txt
1: Cooperate

Once again, can allow ignorables at this level.

$ ./ugrep 2i cooperate < words.txt
1: Cooperate
4: co-operate

Finally, going only to the primary level, we match words with the same base letters, modulo case and accents.

$ ./ugrep 1 cooperate < words.txt
1: Cooperate
2: coöperate
3: COÖPERATE

Note that the idea of a “base character” is dependent on locale. In Swedish, the letters o and ö are quite distinct, and not minor variants as in English. Setting the locale prior to search restricts the results even at the primary level.

$ LC_COLLATE=sv_SE ./ugrep 1 cooperate < fun.txt
1: Cooperate

One note about the tertiary level. It distinguishes not just case, but ligature presentation forms.

$ ./ugrep 3 fi < words.txt
6: fides

# vs

$ ./ugrep 2 fi < words.txt
5: final
6: fides

Pretty flexible, right? Let’s see the code.

/*** ugrep.c ***/

#include <locale.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/ucol.h>
#include <unicode/usearch.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 1024

int main(int argc, char **argv)
{
	char *locale;
	UFILE *in;
	UCollator *col;
	UStringSearch *srch = NULL;
	UErrorCode status = U_ZERO_ERROR;
	UChar *needle, line[BUFSZ];
	UColAttributeValue strength;
	int ignoreInsignificant = 0, asymmetric = 0;
	size_t n;
	long i;

	if (argc != 3)
	{
		fprintf(stderr,
			"Usage: %s {1,2,@,3}[i] pattern\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* cryptic parsing for our cryptic options */
	switch (*argv[1])
	{
		case '1':
			strength = UCOL_PRIMARY;
			break;
		case '2':
			strength = UCOL_SECONDARY;
			break;
		case '@':
			strength = UCOL_SECONDARY, asymmetric = 1;
			break;
		case '3':
			strength = UCOL_TERTIARY;
			break;
		default:
			fprintf(stderr,
				"Unknown strength: %s\n", argv[1]);
			return EXIT_FAILURE;
	}
	/* length of argv[1] is >0 or we would have died */
	ignoreInsignificant = argv[1][strlen(argv[1])-1] == 'i';

	n = strlen(argv[2]) + 1;
	/* if UTF-8 could encode it in n, then UTF-16
	 * should be able to as well */
	needle = malloc(n * sizeof(*needle));
	u_strFromUTF8(needle, n, NULL, argv[2], -1, &status);

	/* searching is a degenerate case of collation,
	 * so we read the LC_COLLATE locale */
	if (!(locale = setlocale(LC_COLLATE, "")))
	{
		fputs("Cannot determine system collation locale\n",
		      stderr);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	col = ucol_open(locale, &status);
	ucol_setStrength(col, strength);

	if (ignoreInsignificant)
		/* shift ignorable characters down to
		 * quaternary level */
		ucol_setAttribute(col, UCOL_ALTERNATE_HANDLING,
		                  UCOL_SHIFTED, &status);

	/* Assumes all lines fit in BUFSZ. Should
	 * fix this in real code and not increment i */
	for (i = 1; u_fgets(line, BUFSZ, in); ++i)
	{
		/* first time through, set up all options */
		if (!srch)
		{
			srch = usearch_openFromCollator(
				needle, -1, line, -1,
			    col, NULL, &status
			);
			if (asymmetric)
				usearch_setAttribute(
					srch, USEARCH_ELEMENT_COMPARISON,
					USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD,
					&status
				);
		}
		/* afterward just switch text */
		else
			usearch_setText(srch, line, -1, &status);

		/* check if keyword appears in line */
		if (usearch_first(srch, &status) != USEARCH_DONE)
			u_printf("%ld: %S", i, line);
	}

	usearch_close(srch);
	ucol_close(col);
	u_fclose(in);
	free(needle);

	return EXIT_SUCCESS;
}

Comparing strings modulo normalization

In the concepts section, we saw a single grapheme can be constructed with different combinations of codepoints. In many cases when comparing strings for equality, we’re most interested in the strings being perceived by the user in the same way rather than a simple byte-for-byte match.

The ICU library provides a unorm_compare function which returns a value similar to strcmp, and acts in a normalization independent way. It normalizes both strings incrementally while comparing them, so it can stop early if it finds a difference.

Here is code to check that the five ways of representing ộ are equivalent:

#include <stdio.h>
#include <unicode/unorm2.h>

int main(void)
{
	UErrorCode status = U_ZERO_ERROR;
	UChar s[][4] = {
		{0x006f,0x0302,0x0323,0},
		{0x006f,0x0323,0x0302,0},
		{0x00f4,0x0323,0,0},
		{0x1ecd,0x0302,0,0},
		{0x1ed9,0,0,0}
	};

	const size_t n = sizeof(s)/sizeof(s[0]);
	size_t i;

	for (i = 0; i < n; ++i)
		printf("%zu == %zu: %d\n", i, (i+1)%n,
			unorm_compare(
				s[i], -1, s[(i+1)%n], -1, 0, &status));
}

Output:

0 == 1: 0
1 == 2: 0
2 == 3: 0
3 == 4: 0
4 == 0: 0

A return value of 0 means the strings are equal.

Confusable strings

Because Unicode introduces so many graphemes, there are more possibilities for scammers to confuse people using lookalike glyphs. For instance, domains like adoḅe.com or pаypal.com (with Cyrillic а) can direct unwary visitors to phishing sites. ICU contains an entire module for detecting “confusables,” those strings which are known to look too similar when rendered in common fonts. Each string is assigned a “skeleton” such that confusable strings get the same skeleton.

For an example, see my utility utofu. It has a little extra complexity with sqlite access code, so I am not reproducing it here. It’s designed to check Unicode strings to detect changes over time that might be spoofing.

The method of operation is this:

  1. Read line as UTF-8
  2. Convert to Normalization Form C for consistency
  3. Calculate skeleton string
  4. Insert UTF-8 version of normalized input and its skeleton into a database if the skeleton doesn’t already exist
  5. Compare the normalized input string to the string in the database having corresponding skeleton. If not an exact match die with an error.

Further reading

Unicode and internationalization is a huge topic. I could only scratch the surface in this article. I read and enjoyed sections from these books and reference materials, and would recommend them:

May 22, 2019

Indrek Lasn (indreklasn)

Hey, not the type of guy to promote, but we’re building a community of doers/makers. May 22, 2019 09:45 AM

Hey, not the type of guy to promote, but we’re building a community of doers/makers. Heck, who knows — you might find a potential co-founder here. https://app.getnewly.com/join/?r=G2KE-kzff

May 20, 2019

Pete Corey (petecorey)

Minimum Viable Phoenix May 20, 2019 12:00 AM

Starting at the Beginning

Phoenix ships with quite a few bells and whistles. Whenever you fire up mix phx.new to create a new web application, forty six files are created and spread across thirty directories!

This can be overwhelming to developers new to Phoenix.

To build a better understanding of the framework and how all of its moving pieces interact, let’s strip Phoenix down to its bare bones. Let’s start from zero and slowly build up to a minimum viable Phoenix application.

.gitignore


+.DS_Store

Minimum Viable Elixir

Starting at the beginning, we need to recognize that all Phoenix applications are Elixir applications. Our first step in the process of building a minimum viable Phoenix application is really to build a minimum viable Elixir application.

Interestingly, the simplest possible Elixir application is simply an *.ex file that contains some source code. To set ourselves up for success later, let’s place our code in lib/minimal/application.ex. We’ll start by simply printing "Hello." to the console.


IO.puts("Hello.")

Surprisingly, we can execute our newly written Elixir application by compiling it:


➜ elixirc lib/minimal/application.ex
Hello.

This confused me at first, but it was explained to me that in the Elixir world, compilation is also evaluation.

lib/minimal/application.ex


+IO.puts("Hello.")

Generating Artifacts

While our execution-by-compilation works, it’s really nothing more than an on-the-fly evaluation. We’re not generating any compilation artifacts that can be re-used later, or deployed elsewhere.

We can fix that by moving our code into a module. Once we compile our newly modularized application.ex, a new Elixir.Minimal.Application.beam file will appear in the root of our project.

We can run our compiled Elixir program by running elixir in the directory that contains our *.beam file and specifying an expression to evaluate using the -e flag:


➜ elixir -e "Minimal.Application.start()"
Hello.

Similarly, we could spin up an interactive shell (iex) in the same directory and evaluate the expression ourselves:


iex(1)> Minimal.Application.start()
Hello.

.gitignore


+*.beam
.DS_Store

lib/minimal/application.ex


-IO.puts("Hello.")
+defmodule Minimal.Application do
+  def start do
+    IO.puts("Hello.")
+  end
+end

Incorporating Mix

This is great, but manually managing our *.beam files and bootstrap expressions is a little cumbersome. Not to mention the fact that we haven’t even started working with dependencies yet.

Let’s make our lives easier by incorporating the Mix build tool into our application development process.

We can do that by creating a mix.exs Elixir script file in the root of our project that defines a module that uses Mix.Project and describes our application. We write a project/0 callback in our new MixProject module who’s only requirement is to return our application’s name (:minimal) and version ("0.1.0").


def project do
  [
    app: :minimal,
    version: "0.1.0"
  ]
end

While Mix only requires that we return the :app and :version configuration values, it’s worth taking a look at the other configuration options available to us, especially :elixir and :start_permanent, :build_path, :elixirc_paths, and others.

Next, we need to specify an application/0 callback in our MixProject module that tells Mix which module we want to run when our application fires up.


def application do
  [
    mod: {Minimal.Application, []}
  ]
end

Here we’re pointing it to the Minimal.Application module we wrote previously.

During the normal application startup process, Elixir will call the start/2 function of the module we specify with :normal as the first argument, and whatever we specify ([] in this case) as the second. With that in mind, let’s modify our Minimal.Application.start/2 function to accept those parameters:


def start(:normal, []) do
  IO.puts("Hello.")
  {:ok, self()}
end

Notice that we also changed the return value of start/2 to be an :ok tuple whose second value is a PID. Normally, an application would spin up a supervisor process as its first act of life and return its PID. We’re not doing that yet, so we simply return the current process’ PID.

Once these changes are done, we can run our application with mix or mix run, or fire up an interactive Elixir shell with iex -S mix. No bootstrap expression required!

.gitignore


 *.beam
-.DS_Store
+.DS_Store
+/_build/

lib/minimal/application.ex


 defmodule Minimal.Application do
-  def start do
+  def start(:normal, []) do
     IO.puts("Hello.")
+    {:ok, self()}
   end

mix.exs


+defmodule Minimal.MixProject do
+  use Mix.Project
+
+  def project do
+    [
+      app: :minimal,
+      version: "0.1.0"
+    ]
+  end
+
+  def application do
+    [
+      mod: {Minimal.Application, []}
+    ]
+  end
+end

Pulling in Dependencies

Now that we’ve built a minimum viable Elixir project, let’s turn our attention to the Phoenix framework. The first thing we need to do to incorporate Phoenix into our Elixir project is to install a few dependencies.

We’ll start by adding a deps array to the project/0 callback in our mix.exs file. In deps we’ll list :phoenix, :plug_cowboy, and :jason as dependencies.

By default, Mix stores downloaded dependencies in the deps/ folder at the root of our project. Let’s be sure to add that folder to our .gitignore. Once we’ve done that, we can install our dependencies with mix deps.get.

The reliance on :phoenix makes sense, but why are we already pulling in :plug_cowboy and :jason?

Under the hood, Phoenix uses the Cowboy web server, and Plug to compose functionality on top of our web server. It would make sense that Phoenix relies on :plug_cowboy to bring these two components into our application. If we try to go on with building our application without installing :plug_cowboy, we’ll be greeted with the following errors:

** (UndefinedFunctionError) function Plug.Cowboy.child_spec/1 is undefined (module Plug.Cowboy is not available)
    Plug.Cowboy.child_spec([scheme: :http, plug: {MinimalWeb.Endpoint, []}
    ...

Similarly, Phoenix relies on a JSON serialization library to be installed and configured. Without either :jason or :poison installed, we’d receive the following warning when trying to run our application:

warning: failed to load Jason for Phoenix JSON encoding
(module Jason is not available).

Ensure Jason exists in your deps in mix.exs,
and you have configured Phoenix to use it for JSON encoding by
verifying the following exists in your config/config.exs:

config :phoenix, :json_library, Jason

Heeding that advice, we’ll install :jason and add that configuration line to a new file in our project, config/config.exs.

.gitignore


 /_build/
+/deps/

config/config.exs


+use Mix.Config
+
+config :phoenix, :json_library, Jason

mix.exs


   app: :minimal,
-  version: "0.1.0"
+  version: "0.1.0",
+  deps: [
+    {:jason, "~> 1.0"},
+    {:phoenix, "~> 1.4"},
+    {:plug_cowboy, "~> 2.0"}
+  ]
 ]
 

Introducing the Endpoint

Now that we’ve installed our dependencies on the Phoenix framework and the web server it uses under the hood, it’s time to define how that web server incorporates into our application.

We do this by defining an “endpoint”, which is our application’s interface into the underlying HTTP web server, and our clients’ interface into our web application.

Following Phoenix conventions, we define our endpoint by creating a MinimalWeb.Endpoint module that uses Phoenix.Endpoint and specifies the :name of our OTP application (:minimal):


defmodule MinimalWeb.Endpoint do
  use Phoenix.Endpoint, otp_app: :minimal
end

The __using__/1 macro in Phoenix.Endpoint does quite a bit of heaving lifting. Among many other things, it loads the endpoint’s initial configuration, sets up a plug pipeline using Plug.Builder, and defines helper functions to describe our endpoint as an OTP process. If you’re curious about how Phoenix works at a low level, start your search here.

Phoenix.Endpoint uses the value we provide in :otp_app to look up configuration values for our application. Phoenix will complain if we don’t provide a bare minimum configuration entry for our endpoint, so we’ll add that to our config/config.exs file:


config :minimal, MinimalWeb.Endpoint, []

But there are a few configuration values we want to pass to our endpoint, like the host and port we want to serve from. These values are usually environment-dependent, so we’ll add a line at the bottom of our config/config.exs to load another configuration file based on our current environment:


import_config "#{Mix.env()}.exs"

Next, we’ll create a new config/dev.exs file that specifies the :host and :port we’ll serve from during development:


use Mix.Config

config :minimal, MinimalWeb.Endpoint,
  url: [host: "localhost"],
  http: [port: 4000]

If we were to start our application at this point, we’d still be greeted with Hello. printed to the console, rather than a running Phoenix server. We still need to incorporate our Phoenix endpoint into our application.

We do this by turning our Minimal.Application into a proper supervisor and instructing it to load our endpoint as a supervised child:


use Application

def start(:normal, []) do
  Supervisor.start_link(
    [
      MinimalWeb.Endpoint
    ],
    strategy: :one_for_one
  )
end

Once we’ve done that, we can fire up our application using mix phx.server or iex -S mix phx.server and see that our endpoint is listening on localhost port 4000.

Alternatively, if you want to use our old standby of mix run, either configure Phoenix to serve all endpoints on startup, which is what mix phx.server does under the hood:


config :phoenix, :serve_endpoints, true

Or configure your application’s endpoint specifically:


config :minimal, MinimalWeb.Endpoint, server: true

config/config.exs


+config :minimal, MinimalWeb.Endpoint, []
+
 config :phoenix, :json_library, Jason
+
+import_config "#{Mix.env()}.exs"

config/dev.exs


+use Mix.Config
+
+config :minimal, MinimalWeb.Endpoint,
+  url: [host: "localhost"],
+  http: [port: 4000]

lib/minimal/application.ex


 defmodule Minimal.Application do
+  use Application
+
   def start(:normal, []) do
-    IO.puts("Hello.")
-    {:ok, self()}
+    Supervisor.start_link(
+      [
+        MinimalWeb.Endpoint
+      ],
+      strategy: :one_for_one
+    )
   end
 

lib/minimal_web/endpoint.ex


+defmodule MinimalWeb.Endpoint do
+  use Phoenix.Endpoint, otp_app: :minimal
+end

Adding a Route

Our Phoenix endpoint is now listening for inbound HTTP requests, but this doesn’t do us much good if we’re not serving any content!

The first step in serving content from a Phoenix application is to configure our router. A router maps requests sent to a route, or path on your web server, to a specific module and function. That function’s job is to handle the request and return a response.

We can add a route to our application by making a new module, MinimalWeb.Router, that uses Phoenix.Router:


defmodule MinimalWeb.Router do
  use Phoenix.Router
end

And we can instruct our MinimalWeb.Endpoint to use our new router:


plug(MinimalWeb.Router)

The Phoenix.Router module generates a handful of helpful macros, like match, get, post, etc… and configures itself to a module-based plug. This is the reason we can seamlessly incorporate it in our endpoint using the plug macro.

Now that our router is wired into our endpoint, let’s add a route to our application:


get("/", MinimalWeb.HomeController, :index)

Here we’re instructing Phoenix to send any HTTP GET requests for / to the index/2 function in our MinimalWeb.HomeController “controller” module.

Our MinimalWeb.HomeController module needs to use Phoenix.Controller and provide our MinimalWeb module as a :namespace configuration option:


defmodule MinimalWeb.HomeController do
  use Phoenix.Controller, namespace: MinimalWeb
end

Phoenix.Controller, like Phoenix.Endpoint and Phoenix.Router does quite a bit. It establishes itself as a plug and by using Phoenix.Controller.Pipeline, and it uses the :namespace module we provide to do some automatic layout and view module detection.

Because our controller module is essentially a glorified plug, we can expect Phoenix to pass conn as the first argument to our specified controller function, and any user-provided parameters as the second argument. Just like any other plug’s call/2 function, our index/2 should return our (potentially modified) conn:


def index(conn, _params) do
  conn
end

But returning an unmodified conn like this is essentially a no-op.

Let’s spice things up a bit and return a simple HTML response to the requester. The simplest way of doing that is to use Phoenix’s built-in Phoenix.Controller.html/2 function, which takes our conn as its first argument, and the HTML we want to send back to the client as the second:


Phoenix.Controller.html(conn, """
  

Hello.

""")

If we dig into html/2, we’ll find that it’s using Plug’s built-in Plug.Conn.send_resp/3 function:


Plug.Conn.send_resp(conn, 200, """
  

Hello.

""")

And ultimately send_resp/3 is just modifying our conn structure directly:


%{
  conn
  | status: 200,
    resp_body: """
      

Hello.

""", state: :set }

These three expressions are identical, and we can use whichever one we choose to return our HTML fragment from our controller. For now, we’ll follow best practices and stick with Phoenix’s html/2 helper function.

lib/minimal_web/controllers/home_controller.ex


+defmodule MinimalWeb.HomeController do
+  use Phoenix.Controller, namespace: MinimalWeb
+
+  def index(conn, _params) do
+    Phoenix.Controller.html(conn, """
+      

Hello.

+ """) + end +end

lib/minimal_web/endpoint.ex


   use Phoenix.Endpoint, otp_app: :minimal
+
+  plug(MinimalWeb.Router)
 end
 

lib/minimal_web/router.ex


+defmodule MinimalWeb.Router do
+  use Phoenix.Router
+
+  get("/", MinimalWeb.HomeController, :index)
+end

Handling Errors

Our Phoenix-based web application is now successfully serving content from the / route. If we navigate to http://localhost:4000/, we’ll be greeted by our friendly HomeController:

But behind the scenes, we’re having issues. Our browser automatically requests the /facicon.ico asset from our server, and having no idea how to respond to a request for an asset that doesn’t exist, Phoenix kills the request process and automatically returns a 500 HTTP status code.

We need a way of handing requests for missing content.

Thankfully, the stack trace Phoenix gave us when it killed the request process gives us a hint for how to do this:

Request: GET /favicon.ico
  ** (exit) an exception was raised:
    ** (UndefinedFunctionError) function MinimalWeb.ErrorView.render/2 is undefined (module MinimalWeb.ErrorView is not available)
        MinimalWeb.ErrorView.render("404.html", %{conn: ...

Phoenix is attempting to call MinimalWeb.ErrorView.render/2 with "404.html" as the first argument and our request’s conn as the second, and is finding that the module and function don’t exist.

Let’s fix that:


defmodule MinimalWeb.ErrorView do
  def render("404.html", _assigns) do
    "Not Found"
  end
end

Our render/2 function is a view, not a controller, so we just have to return the content we want to render in our response, not the conn itself. That said, the distinctions between views and controllers may be outside the scope of building a “minimum viable Phoenix application,” so we’ll skim over that for now.

Be sure to read move about the ErrorView module, and how it incorporates into our application’s endpoint. Also note that the module called to render errors is customizable through the :render_errors configuration option.

lib/minimal_web/views/error_view.ex


+defmodule MinimalWeb.ErrorView do
+  def render("404.html", _assigns) do
+    "Not Found"
+  end
+end

Final Thoughts

So there we have it. A “minimum viable” Phoenix application. It’s probably worth pointing out that we’re using the phrase “minimum viable” loosely here. I’m sure there are people who can come up with more “minimal” Phoenix applications. Similarly, I’m sure there are concepts and tools that I left out, like views and templates, that would cause people to argue that this example is too minimal.

The idea was to explore the Phoenix framework from the ground up, building each of the requisite components ourselves, without relying on automatically generated boilerplate. I’d like to think we accomplished that goal.

I’ve certainly learned a thing or two!

If there’s one thing I’ve taken away from this process, it’s that there is no magic behind Phoenix. Everything it’s doing can be understood with a little familiarity with the Phoenix codebase, a healthy understanding of Elixir metaprogramming, and a little knowledge about Plug.

May 19, 2019

Ponylang (SeanTAllen)

Last Week in Pony - May 19, 2019 May 19, 2019 08:47 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

May 18, 2019

Andreas Zwinkau (qznc)

Companies are AI May 18, 2019 12:00 AM

Depending on the definition of intelligence, companies are intelligent beings.

Read full article!

May 17, 2019

Derek Jones (derek-jones)

Background checks on pointer values being considered for C May 17, 2019 06:58 PM

DR 260 is a defect report submitted to WG14, the C Standards’ committee, in 2001 that was never resolved, then generally ignored for 10-years, then caught the attention of a research group a few years ago, and is now back on WG14’s agenda. The following discussion covers two of the three questions raised in the DR.

Consider the following fragment of code:

int *p, *q;

    p = malloc (sizeof (int)); assert (p != NULL);  // Line A
    (free)(p);                                      // Line B
    // more code
    q = malloc (sizeof (int)); assert (q != NULL);  // Line C
    if (memcmp (&p, &q, sizeof p) == 0)             // Line D
       {*p = 42;                                    // Line E
        *q = 43;}                                   // Line F

Section 6.2.4p2 of the C Standard says:
“The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime.”

The call to free, on line B, ends the lifetime of the storage (allocated on line A) pointed to by p.

There are two proposed interpretations of the sentence, in 6.2.4p2.

  1. “becomes indeterminate” is treated as effectively storing a value in the pointer, i.e., some bit pattern denoting an indeterminate value. This interpretation requires that any other variables that had been assigned p‘s value, prior to the free, also have an indeterminate value stored into them,
  2. the value held in the pointer is to be treated as an indeterminate value (for instance, a memory management unit may prevent any access to the corresponding storage).

What are the practical implications of the two options?

The call to malloc, on line C, could return a pointer to a location that is identical to the pointer returned by the first call to malloc, i.e., the second call might immediately reuse the free‘ed storage.

Effectively storing a value in the pointer, in response to the call to free means the subsequent call to memcmp would always return a non-zero value, and the questions raised below do not apply; it would be a nightmare to implement, especially in a multi-process environment.

If the sentence in section 6.2.4p2 is interpreted as treating the pointer value as indeterminate, then the definition of malloc needs to be updated to specify that all returned values are determinate, i.e., any indeterminacy that may exist gets removed before a value is returned (the memory management unit must allow read/write access to the storage).

The memcmp, on line D, does a byte-wise compare of the pointer values (a byte-wise compare side-steps indeterminate value issues). If the comparison is exact, an assignment is made via p, line E, and via q, line F.

Does the assignment via p result in undefined behavior, or is the conformance status of the code unaffected by its presence?

Nobody is impuning the conformance status of the assignment via q, on line F.

There are people who think that the assignment via p, on line E, should be treated as undefined behavior, despite the fact that the values of p and q are byte-wise identical. When this issue was first raised (by those trouble makers in the UK ;-), yours truly was less than enthusiastic, but there were enough knowledgeable people in the opposing camp to keep the ball rolling for a while.

The underlying issue some people have with some subsequent uses of p is its provenance, the activities it has previously been associated with.

Provenance can be included in the analysis process by associating a unique number with the address of every object, at the start of its lifetime; these p-numbers are not reused.

The value returned by the call to malloc, on line A, would include a pointer to the allocated storage, plus an associated p-number; the call on line C could return a pointer having the same value, but its p-number is required to be different. Implementations are not required to allocate any storage for p-numbers, treating them purely as conceptual quantities. Your author knows of two implementations that do allocate storage for p-numbers (in a private area), and track usage of p-numbers; the Model Implementation C Checker was validated as handling all of C90, and Cerberus which handles a substantial subset of C11, and I don’t believe that the other tools that check array bounds and use after free are based on provenance (corrections welcome).

If provenance is included as part of a pointer’s value, the behavior of operators needs to be expanded to handle the p-number (conceptual or not) component of a pointer.

The rules might specify that p-numbers are conceptually compared by the call to memcmp, on line C; hence p and q are considered to never compare equal. There is an existing practice of regarding byte compares as just that, i.e., no magic ever occurs when comparing bytes (otherwise known as objects having type unsigned char).

Having p-numbers be invisible to memcmp would be consistent with existing practice. The pointer indirection operation on line E (generating undefined behavior) is where p-numbers get involved and cause the undefined behavior to occur.

There are other situations where pointer values, that were once indeterminate, can appear to become ‘respectable’.

For a variable, defined in a function, “… its lifetime extends from entry into the block with which it is associated until execution of that block ends in any way.”; section 6.2.4p3.

In the following code:

int x;
static int *p=&x;

void f(int n)
{
   int *q = &n;
   if (memcmp (&p, &q, sizeof p) == 0)
      *p = 0;
   p = &n; // assign an address that will soon cease to exist.
} // Lifetime of pointed to object, n, terminates here

int main(void)
{
   f(1); // after this call, p has an indeterminate value
   f(2);
}

the pointer p has an indeterminate value after any call to f returns.

In many implementations, the second call to f will result in n having the same address it had on the first call, and memcmp will return zero.

Again, there are people who have an issue with the assignment involving p, because of its provenance.

One proposal to include provenance contains substantial changes to existing word in the C Standard. The rationale for is proposals looks more like a desire to change wording to make things clearer for those making the change, than a desire to address DR 260. Everybody thinks their proposed changes make the wording clearer (including yours truly), such claims are just marketing puff (and self-delusion); confirmation from the results of an A/B test would add substance to such claims.

It is probably possible to explicitly include support for provenance by making a small number of changes to existing wording.

Is the cost of supporting provenance (i.e., changing existing wording may introduce defects into the standard, the greater the amount of change the greater the likelihood of introducing defects), worth the benefits?

What are the benefits of introducing provenance?

Provenance makes it possible to easily specify that the uses of p, in the two previous examples (and a third given in DR 260), are undefined behavior (if that is WG14’s final decision).

Provenance also provides a model that might make it easier to reason about programs; it’s difficult to say one way or the other, without knowing what the model is.

Supporters claim that provenance would enable tool vendors to flag various snippets of code as suspicious. Tool vendors can already do this, they don’t need permission from the C Standard to flag anything they fancy.

The C Standard requires a conforming implementation to diagnose certain constructs. A conforming implementation can issue as many messages as it likes, for any other construct, e.g., for line A in the first example, a compiler might print “This is the 1,000,000’th call to malloc I have translated, ring this number to claim your prize!

Before any changes are made to wording in the C Standard, WG14 needs to decide what the behavior should be for these examples; it could decide to continue ignoring them for another 20-years.

Once a decision is made, the next question is how to update wording in the standard to specify the behavior that has been decided on.

While provenance is an interesting idea, the benefits it provides appear to be not worth the cost of changing the C Standard.

Indrek Lasn (indreklasn)

Not yet ;) May 17, 2019 03:38 PM

Not yet ;)

I wholeheartedly agreed with this. May 17, 2019 03:34 PM

I wholeheartedly agreed with this.

May 14, 2019

Derek Jones (derek-jones)

A prisoner’s dilemma when agreeing to a management schedule May 14, 2019 11:52 PM

Two software developers, both looking for promotion/pay-rise by gaining favorable management reviews, are regularly given projects to complete by a date specified by management; the project schedules are sometimes unachievable, with probability p.

Let’s assume that both developers are simultaneously given a project, and the corresponding schedule. If the specified schedule is unachievable, High quality work can only be performed by asking for more time, otherwise performing Low quality work is the only way of meeting the schedule.

If either developer faces an unachievable deadline, they have to immediately decide whether to produce High or Low quality work. A High quality decision requires that they ask management for more time, and incur a penalty they perceive to be C (saying they cannot meet the specified schedule makes them feel less worthy of a promotion/pay-rise); a Low quality decision is perceived to be likely to incur a penalty of Q_1 (because of its possible downstream impact on project completion), if one developer chooses Low, and Q_2, if both developers choose Low. It is assumed that: Q_1 < Q_2 < C Q_2 C" /> Q_2 C" /> Q_2 C" title="Q_1 Q_2 C"/>.

This is a prisoner’s dilemma problem. The following mathematical results are taken from: “The Effects of Time Pressure on Quality in Software Development: An Agency Model”, by Robert D. Austin (cannot find a downloadable pdf).

There are two Nash equilibriums, for the decision made by the two developers: Low-Low and High-High (i.e., both perform Low quality work, or both perform High quality work). Low-High is not a stable equilibrium, in that on the next iteration the two developers may switch their decisions.

High-High is a pure strategy (i.e., always use it), when: 1-{Q_1}/C <= p= p" />= p" />= p" title="1-{Q_1}/C = p"/>

High-High is Pareto superior to Low-Low when: 1-{Q_2}/{C-Q_1+Q_2} < p < 1-{Q_1}/C p 1-{Q_1}/C" /> p 1-{Q_1}/C" /> p 1-{Q_1}/C" title="1-{Q_2}/{C-Q_1+Q_2} p 1-{Q_1}/C"/>

How might management use this analysis to increase the likelihood that a High-High quality decision is made?

Evidence shows that 50% of developer estimates, of task effort, underestimate the actual effort; there is sufficient uncertainty in software development that the likelihood of consistently produce accurate estimates is low (i.e., p is a very fuzzy quantity). Managers wanting to increase the likelihood of a High-High decision could be generous when setting deadlines (e.g., multiple developer estimates by 200%, when setting the deadline for delivery), but managers are often under pressure from customers, to specify aggressively short deadlines.

The penalty for a developer admitting that they cannot deliver by the specified schedule, C, could be set very low (e.g., by management not taking this factor into account when deciding developer promotion/pay-rise). But this might encourage developers to always give this response. If all developers mutually agreed to cooperate, to always give this response, none of them would lose relative to the others; but there is an incentive for the more capable developers to defect, and the less capable developers to want to use this strategy.

Regular code reviews are a possible technique for motivating High-High, by increasing the likelihood of any lone Low decision being detected. A Low-Low decision may go unreported by those involved.

To summarise: an interesting analysis that appears to have no practical use, because reasonable estimates of the values of the variables involved are unavailable.

May 13, 2019

Simon Zelazny (pzel)

How I learned to never match on os:cmd output May 13, 2019 10:00 PM

A late change in requirements from a customer had me scrambling to switch an HDFS connector script — from a Python program — to the standard Hadoop tool hdfs.

The application that was launching the connector script was written in Erlang, and was responsible for uploading some files to an HDFS endpoint, like so:

UploadCmd = lists:flatten(io_lib:format("hdfs put ~p ~p", [Here, There])),
"" = os:cmd(UploadCmd),

This was all fine and dandy when the UploadCmd was implemented in full by me. When I switched out the Python script for the hdfs command, all my tests continued to work, and the data was indeed being written successfully to my local test hdfs node. So off to production it went.

Several hours later I got notified that there's some problems with the new code. After inspecting the logs it became clear that the hdfs command was producing unexpected output (WARN: blah blah took longer than expected (..)) and causing the Erlang program to treat the upload operation as failed.

As is the case for reasonable Erlang applications, the writing process would crash upon a failed match, then restart and attempt to continue where it left off — by trying to upload Here to There. Now, this operation kept legitimately failing, because it had in fact succeeded the first time, and HDFS would not allow us to overwrite There (unless we added a -f flag to put).

The solution

The quick-and-dirty solution was to wrap the UploadCmd in a script that captured the exit code, and then printed it out at the end, like so:

sh -c '{UploadCmd}; RES=$?; echo; echo $RES'

Now, your Erlang code can match on the last line of the output and interpret it as a integer exit code. Not the most elegant of solutions, but elegant enough to work around os:cmd/1's blindess to exit codes.

Lesson learned

The UNIX way states that programs should be silent on success and vocal on error. Sadly, many applications don't follow the UNIX way, and the bigger the application at hand, the higher the probability that one of its dependencies will use STDOUT or STDERR as its own personal scratchpad.

My lesson: never rely on os:cmd/1 output in production code, unless the command you're running is fully under your control, and you can be certain that its outputs are completely and exhaustively specified by you.

I do heavily rely on os:cmd output in test code, and I have no intention of stopping. Early feedback about unexpected output is great in tests.

Indrek Lasn (indreklasn)

How to setup continuous integration (CI) with React, CircleCI, and GitHub May 13, 2019 01:07 PM

To ensure the highest grade of quality code, we need to run multiple checks on each commit/pull request. Running code checks is especially useful when working in a team and making sure everyone follows the best and latest practices.

What kind of checks are we talking about? For starters, running our unit tests to make sure everything passes, building and bundling our frontend to make sure the build won’t fail on production, and running our linters to enforce a standard.

At my current company, we run many checks before any code can be committed to the repository.

Code checks at Newly

CI lets us run code checks automatically. Who wants to run all those commands before pushing code to the repository?

Getting started

I’ve chosen CircleCI due to its generous free tier, Github thanks to its community, and React since it’s easy and fun to use.

Create React App

Create your react app, however, you like. For simplicity sake, I’m using CRA.

Creating Github repository

Once you’re finished with CRA, push the code to your Github repository.

Setting up CI with CircleCI

If you already have a CircleCI account, great! If not, make one here.

Once you logged in, click on “Add Projects”

Adding a Project to CircleCI

Find your repository and click “Set Up Project”

Setting up a project

Now we should see instructions.

Installation instructions

Simple enough, let’s create a folder called .circleci and place the config.yml inside the folder.

CircleCI config.yml

We specify the CircleCI version, orbs, and workflows. Orbs are shareable configuration packages for your builds. A workflow is a set of rules for defining a collection of jobs and their run order.

Push the code to your repository

Start building

Head back to CircleCI and press “Start building”

STAAAART BUILDIN! :DBuild succeeded

If you click on the build, you can monitor what actually happened. For this case, the welcome orb is a demo and doesn’t do much.

Setting up our CircleCI with React

Use config.yml setup to run test, lint and build checks with React.

https://medium.com/media/817e5dfe08bf7977b337ef97ab2561a6/href

After you pushed this code, give the orb the permissions it needs.

Settings -> Security -> Yes, allow orbs

Now each commit/PR runs the workflow jobs.

Check CircleCI for the progress of jobs. Here’s what CircleCI is doing for each commit:

  • Set up the React project
  • Runs eslint to check the formatting of the code
  • Runs unit tests
  • Runs test coverage

All of the above workflow jobs have to succeed for the commit and build to be successful.

Now each commit has a green, red or yellow tick indicating the status! Handy.

You can find the demo repository here;

wesharehoodies/circleci-react-example

Thanks for reading, check out my Twitter for more.

Indrek Lasn (@lasnindrek) | Twitter

Here are some of my previous articles you might enjoy:


How to setup continuous integration (CI) with React, CircleCI, and GitHub was originally published in freeCodeCamp.org on Medium, where people are continuing the conversation by highlighting and responding to this story.

Pete Corey (petecorey)

Is My Apollo Client Connected to the Server? May 13, 2019 12:00 AM

When you’re building a real-time, subscription-heavy front-end application, it can be useful to know if your client is actively connected to the server. If that connection is broken, maybe because the server is temporarily down for maintenance, we’d like to be able to show a message explaining the situation to the user. Once we re-establish our connection, we’d like to hide that message and go back to business as usual.

That’s the dream, at least. Trying to implement this functionality using Apollo turned out to be more trouble than we expected on a recent client project.

Let’s go over a few of the solutions we tried that didn’t solve the problem, for various reasons, and then let’s go over the final working solution we came up with. Ultimately, I’m happy with what we landed on, but I didn’t expect to uncover so many roadblocks along the way.

What Didn’t Work

Our first attempt was to build a component that polled for an online query on the server. If the query ever failed with an error on the client, we’d show a “disconnected” message to the user. Presumably, once the connection to the server was re-established, the error would clear, and we’d re-render the children of our component:


const Connected = props => {
  return (
    <Query query={gql'{ online }'} pollInterval={5000}>
      {({error, loading}) => {
        if (loading) {
            return <Loader/>;
        }
        else if (error) {
            return <Message/>;
        }
        else {
            return props.children;
        }
      }}
    </Query>
  );
}

Unfortunately, our assumptions didn’t hold up. Apparently when a query fails, Apollo (react-apollo@2.5.5) will stop polling on that failing query, stopping our connectivity checker dead in its tracks.

NOTE: Apparently, this should work, and in various simplified reproductions I built while writing this article, it did work. Here are various issues and pull requests documenting the problem, merging in fixes (which others claim don’t work), and documenting workarounds:


We thought, “well, if polling is turned off on error, let’s just turn it back on!” Our next attempt used startPolling to try restarting our periodic heartbeat query.


if (error) {
  startPolling(5000);
}

No dice.

Our component successfully restarts polling and carries on refetching our query, but the Query component returns values for both data and error, along with a networkStatus of 8, which indicates that “one or more errors were detected.”

If a query returns both an error and data, how are we to know which to trust? Was the query successful? Or was there an error?

We also tried to implement our own polling system with various combinations of setTimeout and setInterval. Ultimately, none of these solutions seemed to work because Apollo was returning both error and data for queries, once the server had recovered.

NOTE: This should also work, though it would be unnecessary, if it weren’t for the issues mentioned above.


Lastly, we considered leveraging subscriptions to build our connectivity detection system. We wrote a online subscription which pushes a timestamp down to the client every five seconds. Our component subscribes to this publication… And then what?

We’d need to set up another five second interval on the client that flips into an error state if it hasn’t seen a heartbeat in the last interval.

But once again, once our connection to the server is re-established, our subscription won’t re-instantiate in a sane way, and our client will be stuck showing a stale disconnected message.

What Did Work

We decided to go a different route and implemented a solution that leverages the SubscriptionClient lifecycle and Apollo’s client-side query functionality.

At a high level, we store our online boolean in Apollo’s client-side cache, and update this value whenever Apollo detects that a WebSocket connection has been disconnected or reconnected. Because we store online in the cache, our Apollo components can easily query for its value.

Starting things off, we added a purely client-side online query that returns a Boolean!, and a resolver that defaults to being “offline”:


const resolvers = {
    Query: { online: () => false }
};

const typeDefs = gql`
  extend type Query {
    online: Boolean!
  }
`;

const apolloClient = new ApolloClient({
  ...
  typeDefs,
  resolvers
});

Next we refactored our Connected component to query for the value of online from the cache:


const Connected = props => {
  return (
    <Query query={gql'{ online @client }'}>
      {({error, loading}) => {
        if (loading) {
            return <Loader/>;
        }
        else if (error) {
            return <Message/>;
        }
        else {
            return props.children;
        }
      }}
    </Query>
  );
}

Notice that we’re not polling on this query. Any time we update our online value in the cache, Apollo knows to re-render this component with the new value.

Next, while setting up our SubscriptionClient and WebSocketLink, we added a few hooks to detect when our client is connected, disconnected, and later reconnected to the server. In each of those cases, we write the appropriate value of online to our cache:


subscriptionClient.onConnected(() =>
    apolloClient.writeData({ data: { online: true } })
);

subscriptionClient.onReconnected(() =>
    apolloClient.writeData({ data: { online: true } })
);

subscriptionClient.onDisconnected(() =>
    apolloClient.writeData({ data: { online: false } })
);

And that’s all there is to it!

Any time our SubscriptionClient detects that it’s disconnected from the server, we write offline: false into our cache, and any time we connect or reconnect, we write offline: true. Our component picks up each of these changes and shows a corresponding message to the user.

Huge thanks to this StackOverflow comment for pointing us in the right direction.

May 12, 2019

Gokberk Yaltirakli (gkbrk)

Evolving Neural Net classifiers May 12, 2019 05:35 PM

As a research interest, I play with evolutionary algorithms a lot. Recently I’ve been messing around with Neural Nets that are evolved rather than trained with backpropagation.

Because this is a blog post, and to further demonstrate that literally anything can result in evolution, I’m going to be using a hill climbing algorithm. Here’s the gist of it.

  1. Initially, we will start with a Neural Network with random weights.
  2. We’re going to clone the network, pick a weight and change it to a random number.
  3. Evaluate the old network and the new network and get their scores
  4. If the new network has done better or the same as the old one, replace the old network with it
  5. Repeat until the results are satisfactory

The algorithm

The algorithm is shown below. All it does is split the given data into training and test parts, randomly change the neural network weights until the score improves, and then use the test data to determine how good we did.

def train_and_test(X, y, nn_size, iterations=1000, test_size=None, stratify=None):
    random.seed(445)
    np.random.seed(445)
    net = NeuralNetwork(nn_size)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=stratify
    )

    score = 0
    for i in range(iterations):
        score = net.get_score(X_train, y_train)

        new = net.clone()
        new.mutate()
        new_score = new.get_score(X_train, y_train)

        if new_score >= score:
            net = new
            score = new_score

    print(f"Training set: {len(X_train)} elements. Error: {score}")

    score = net.get_classify_score(X_test, y_test)

    print(f"Test set: {score} / {len(X_test)}. Score: {score / len(X_test) * 100}%")

Iris flower dataset

If you are learning about classifiers, the Iris flower dataset is probably the first thing you’re going to test. It is like the “Hello World” of classification basically.

The dataset includes petal and sepal size measurements from 3 different Iris species. The goal is to get measurements and classify which species they are from.

You can find more information on the dataset here.

data = pandas.read_csv("IRIS.csv").values

name_to_output = {
    "Iris-setosa": [1, 0, 0],
    "Iris-versicolor": [0, 1, 0],
    "Iris-virginica": [0, 0, 1],
}

rows = data.shape[0]
data_input = data[:, 0:4].reshape((rows, 4, 1)).astype(float)
data_output = np.array(list(map(lambda x: name_to_output[x], data[:, 4]))).reshape(
    (rows, 3)
)

train_and_test(data_input, data_output, (4, 4, 3), 10000, 0.2)
Training set: 120 elements. Error: -5.697678436657024
Test set: 29 / 30. Score: 96.66666666666667%

96% accuracy isn’t bad such a simple algorithm. But it has that accuracy when it trains with 120 samples and tests with 30. Let’s see if it’s good at generalization by turning our train/test split into 0.03/0.97.

As you can see below; just by training on 4 samples, our network is able to classify the rest of the data with a 94% accuracy.

train_and_test(data_input, data_output, (4, 4, 3), 10000, 0.97)
Training set: 4 elements. Error: -0.8103166051741318
Test set: 138 / 146. Score: 94.52054794520548%

Cancer diagnosis dataset

This dataset has includes some data/measurements about tumors, and classifies them as either Benign (B) or Malignant (M).

You can find the dataset and more information about it here.

data = pandas.read_csv("breast_cancer.csv").values[1:]

rows = data.shape[0]

name_to_output = {"B": [1, 0], "M": [0, 1]}

data_input = data[:, 2:32].reshape((rows, 30, 1)).astype(float) / 100
data_output = np.array(list(map(lambda x: name_to_output[x], data[:, 1]))).reshape(
    (rows, 2)
)

train_and_test(data_input, data_output, (30, 30, 15, 2), 10000, 0.3)
Training set: 397 elements. Error: -5.626705318006574
Test set: 159 / 171. Score: 92.98245614035088%

To see if the network is able to generalize, let’s train it on 11 samples and test it on 557. You can see below that it has an 86% accuracy after seeing a tiny amount of samples.

train_and_test(data_input, data_output, (30, 30, 15, 2), 10000, 0.98)
Training set: 11 elements. Error: -0.2742514647152907
Test set: 481 / 557. Score: 86.35547576301616%

Glass classification dataset

This dataset has some material measurements, like how much of each element was found in a piece of glass. Using these measurements, the goal is to classify which of the 8 glass types it was from.

This dataset doesn’t separate cleanly, and there aren’t a lot of samples you get. So I cranked up the iteration number and added more hidden layers. Deep learning baby!

You can find more information on the dataset here.

data = pandas.read_csv("glass.csv").values[1:]

rows = data.shape[0]
data_input = data[:, :-1].reshape((rows, 9, 1)).astype(float)
data_output = np.array(list(map(lambda x: np.eye(8)[int(x)], data[:, -1]))).reshape((rows, 8))

train_and_test(data_input, data_output, (9, 9, 9, 9, 8), 20000, 0.3, stratify=data_output)
Training set: 149 elements. Error: -8.261249669954738
Test set: 47 / 64. Score: 73.4375%

After I saw this result, I wasn’t super thrilled about it. But after I went through the other solutions on Kaggle and looked at their results, I found out that this wasn’t bad compared to other classifiers.

But where’s the Neural Network code?

Here it is. While it’s a large chunk of code, I find that this is the least interesting part of the project. This is basically a bunch of matrices getting multiplied and mutated randomly. You can find a bunch of tutorials/examples of this on the internet.

import numpy as np
import random
import pandas
from sklearn.model_selection import train_test_split

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.layer_sizes = layer_sizes
        weight_shapes = [(a, b) for a, b in zip(layer_sizes[1:], layer_sizes[:-1])]
        self.weights = [
            np.random.standard_normal(s) / s[1] ** 0.5 for s in weight_shapes
        ]
        self.biases = [np.random.rand(s, 1) for s in layer_sizes[1:]]

    def predict(self, a):
        for w, b in zip(self.weights, self.biases):
            a = self.activation(np.matmul(w, a) + b)
        return a

    def get_classify_score(self, images, labels):
        predictions = self.predict(images)
        num_correct = sum(
            [np.argmax(a) == np.argmax(b) for a, b in zip(predictions, labels)]
        )
        return num_correct

    def get_score(self, images, labels):
        predictions = self.predict(images)
        predictions = predictions.reshape(predictions.shape[0:2])
        return -np.sum(np.abs(np.linalg.norm(predictions-labels)))

    def clone(self):
        nn = NeuralNetwork(self.layer_sizes)
        nn.weights = np.copy(self.weights)
        nn.biases = np.copy(self.biases)
        return nn

    def mutate(self):
        for _ in range(self.weighted_random([(20, 1), (3, 2), (2, 3), (1, 4)])):
            l = self.weighted_random([(l.flatten().shape[0], i) for i, l in enumerate(self.weights)])
            shape = self.weights[l].shape
            layer = self.weights[l].flatten()
            layer[np.random.randint(0, layer.shape[0]-1)] = np.random.uniform(-2, 2)
            self.weights[l] = layer.reshape(shape)

            if np.random.uniform() < 0.01:
                b = self.weighted_random([(b.flatten().shape[0], i) for i, b in enumerate(self.biases)])
                shape = self.biases[b].shape
                bias = self.biases[b].flatten()
                bias[np.random.randint(0, bias.shape[0]-1)] = np.random.uniform(-2, 2)
                self.biases[b] = bias.reshape(shape)

    @staticmethod
    def activation(x):
        return 1 / (1 + np.exp(-x))

    @staticmethod
    def weighted_random(pairs):
        total = sum(pair[0] for pair in pairs)
        r = np.random.randint(1, total)
        for (weight, value) in pairs:
            r -= weight
            if r <= 0: return value

Ponylang (SeanTAllen)

Last Week in Pony - May 12, 2019 May 12, 2019 08:57 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Pages From The Fire (kghose)

Mixins or composition? May 12, 2019 12:14 AM

Mixins are great for “horizontal scaling” by adding functionality to a class over time. Reading mixed in code has an element of “gotcha” because the methods are scattered over multiple classes. Composition is great for handling complex functionality by insulating individual parts into their own classes and just exposing the bare interface to each other …

May 10, 2019

Carlos Fenollosa (carlesfe)

What are the differences between OpenBSD and Linux? May 10, 2019 09:32 AM

Maybe you have been reading recently about the release of OpenBSD 6.5 and wonder, "What are the differences between Linux and OpenBSD?"

I've also been there at some point in the past and these are my conclusions.

They also apply, to some extent, to other BSDs. However, an important disclaimer applies to this article.

This list is aimed at people who are used to Linux and are curious about OpenBSD. It is written to highlight the most important changes from their perspective, not the absolute most important changes from a technical standpoint.

Please bear with me.

A terminal is a terminal is a terminal

The first thing to realize is that, on the surface, the changes are minimal. Both are UNIX-like. You get a terminal, X windows, Firefox, Libreoffice...

Most free software can be recompiled, though some proprietary software isn't on OpenBSD. Don't expect any visual changes. Indeed, the difference between KDE and GNOME on Linux is bigger than the difference between KDE on Linux and KDE on OpenBSD.

Under the hood, there are some BIG differences with relatively little practical impact:

  • BSD licensing vs GNU licensing
  • "Whole OS" model where some base packages are treated as first-class citizens with the kernel, VS bare Kernel + everything is 3rd party
  • Documentation is considered as important as code VS good luck with Stack Overflow and reading mailing lists
  • Whenever a decision has to be made, security and correctness is prioritized VS general-purpose and popularity and efficiency

Do these make little sense to you? I know, it's difficult to fully understand. Your reference is "Windows VS Linux" which are so different on many aspects, like an elephant with a sparrow. To the untrained eye, distinguishing a pigeon with a turtledove may not be so evident.

They're philosophical distinctions which ramifications are not immediately visible. They can't be explained, you need to understand them by usage. That's why the typical recommendation is "just try OpenBSD and see"

Practical differences

So, what are some of the actual, tangible, practical differences?

Not many, really. Some are "features" and some are "undesired" side effects. With every decision there is a trade-off. Let's see some of them.

First of all, OpenBSD is a simpler system. It's very comfortable for sysadmins. All pieces are glued together following the UNIX philosophy, focusing on simplicity. Not sure what this means? Think rc VS systemd. This cannot be understated: many people are attracted to OpenBSD in the first place because it's much more minimal than Linux and even FreeBSD.

OpenBSD also has excellent man pages with practical examples. Use man. Really.

The base system prefers different default daemons/servers/defaults than Linux.

  • apache/nginx: httpd
  • postfix/sendmail: opensmtpd
  • ntp: openntpd
  • bash: ksh

Are these alternatives better or worse? Well, these cover 90% of the use cases, while being robust and simpler to admin. Think: "knowing what we now today about email, how would we write a modern email courier from scratch, without all the old cruft?"

Voilà, OpenSMTPd.

The same goes for the rest, and there are more projects on the way (openssl -> libressl)

Security and system administration

W^X, ipsec, ASLR, kernel relinking, RETGUARD, pledge, unveil, etc.

Do these sound familiar? Most were OpenBSD innovations which trickled down to the rest of the unices

"Does this mean that OpenBSD is more secure than Linux?"

I'd say it's different but equivalent, but OpenBSD's security approach is more robust over time.

System administration and package upgrading is a bit different, but equivalent too, at least on x86. If you use a different arch, you'll need to recompile OpenBSD stuff from time to time.

"But Carlos, you haven't yet told me a single feature which is relevant for my day to day use!"

That's because there is probably none. There are very few things OpenBSD does that Linux does not.

However, what they do, they do better. Is that important for you?

Why philosophical differences matter

Let's jump to some of the not-so-nice ramifications of OpenBSD's philosophy:

Most closed-source Linux software does not work: skype, slack, etc. If that's important for you, use the equivalent web apps, or try FreeBSD, which has a Linux compatibility layer

Some Linux-kernel-specific software does not work either. Namely, docker.

The same for drivers: OpenBSD has excellent drivers, but a smaller number of them. You need to choose your hardware carefully. Hint: choose a Thinkpad

This includes compatibility drivers: modern/3rd party filesystems, for example, are not so well supported.

Because of the focus on security and simplicity, and not on speed or optimizations, software runs a bit slower than on Linux. In my experience (and in some benchmarks) about 10%-20% slower.

Battery life on laptops is also affected. My x230 can run for 5 hours on Linux, 3:30 on OpenBSD. More modern laptops and bigger batteries are a practical solution for most of the people.

So what do I choose?

"Are you telling me that the positives are intangible and the negatives mean a slower system and less software overall?"

At the risk of being technically wrong, but with the goal of empathizing with the Linux user, I'll say yes.

But think about what attracted you to Linux in the first place. It was not a faster computer, more driver availability or more software than Windows. It was probably a sense of freedom, the promise of a more robust, more secure, more private system.

OpenBSD is just the next step on that ladder.

In reality: it means that the intangibles are intangible for you, at this point in time. For other people, these features are what draws them to OpenBSD. For me, the system architecture, philosophy, and administration is 10x better than Linux's.

Let me turn the question around: can you live with these drawbacks if it means you will get a more robust, easier to admin, simpler system?

Now you're thinking: "Maybe Linux is a good tradeoff between freedom, software availability, and newbie friendliness". And, for most people, that can be the case. Hey, I use Linux too. I'm just opening another door for you.

How to try OpenBSD

So what, did I pique your interest? Are you just going to close this browser tab without trying? Go ahead and spin up a VM or install OpenBSD on an old machine and see for yourself.

Life isn't black or white. Maybe OpenBSD can not be your daily OS, but it can be your "travel-laptop OS". Honestly, I know very few people that use OpenBSD as their only system.

That is my case, for example. My daily driver is OSX, not Linux, because I need to use MS Office and other software which is Windows or Mac only for work.

However, when I arrive home, I switch to OpenBSD on my x230 I enjoy using OpenBSD much more than OSX these days.

What are you waiting for? Download OpenBSD and learn what all the fuzz's about!

Tags: openbsd, unix

&via=cfenollosa">&via=cfenollosa">Comments? Tweet  

Stig Brautaset (stig)

Learning Guitar Update May 10, 2019 07:53 AM

I try to keep myself honest--and on target!--by posting an update on my guitar learning journey.

May 09, 2019

Frederik Braun (freddyb)

Chrome switching the XSSAuditor to filter mode re-enables old attack May 09, 2019 10:00 PM

Recently, Google Chrome changed the default mode for their Cross-Site Scripting filter XSSAuditor from block to filter. This means that instead of blocking the page load completely, XSSAuditor will now continue rendering the page but modify the bits that have been detected as an XSS issue.

In this blog post, I will argue that the filter mode is a dangerous approach by re-stating the arguments from the whitepaper titled X-Frame-Options: All about Clickjacking? that I co-authored with Mario Heiderich in 2013.

After that, I will elaborate XSSAuditor's other shortocmings and revisit the history of back-and-forth in its default settings. In the end, I hope to convince you that XSSAuditor's contribution is not just neglegible but really negative and should therefore be removed completely.


JavaScript à la Carte

When you allow websites to frame you, you basically give them full permission to decide, what part of JavaScript of your very own script can be executed and what cannot. That sounds crazy right? So, let’s say you have three script blocks on your website. The website that frames you doesn’t mind two of them - but really hates the third one. maybe a framebuster, maybe some other script relevant for security purposes. So the website that frames you just turns that one script block off - and leave the other two intact. Now how does that work?

Well, it’s easy. All the framing website is doing, is using the browser’s XSS filter to selectively kill JavaScript on your page. This has been working in IE some years ago but doesn’t anymore - but it still works perfectly fine in Chrome. Let’s have a look at an annotated code example.

Here is the evil website, framing your website on example.com and sending something that looks like an attempt to XSS you! Only that you don’t have any XSS bugs. The injection is fake - and resembles a part of the JavaScript that you actually use on your site:

<iframe src="//example.com/index.php?code=%3Cscript%20src=%22/js/security-libraries.js%22%3E%3C/script%3E"></iframe>

Now we have your website. The content of the code parameter above is part of your website anyway - no injection here, just a match between URL and site content:

<!doctype html>
<h1>HELLO</h1>
<script src="/js/security-libraries.js"></script>
<script>
// assumes that the libraries are included
</script>

The effect is compelling. The load of the security libraries will be blocked by Chrome’s XSS Auditor, violating the assumption in the following script block, which will run as usual.

Existing and Future Countermeasures

So, as we see defaulting to filter was a bad decision and it can be overriden with the X-XSS-Protection: 1; mode=block header. You could also disallow websites from putting you in an iframe with X-Frame-Options: DENY, but it still leaves an attack vector as your websites could be opened as a top-level window. (The Cross-Origin-Opener-Policy will help, but does not yet ship in any major browser). Surely, Chrome might fix that one bug and stop exposing onerror from internal error pages . But that's not enough.

Other shortcomings of the XSSAuditor

XSSAuditor has numerous problems in detecting XSS. In fact, there are so many that the Chrome Security Team does not treat bypasses as security bugs in Chromium. For example, the XSSAuditor scans parameters individually and thus allows for easy bypasses on pages that have multiple injections points, as an attacker can just split their payload in half. Furthermore, XSSAuditor is only relevant for reflected XSS vulnerabilities. It is completely useless for other XSS vulnerabilities like persistent XSS, Mutation XSS (mXSS) or DOM XSS. DOM XSS has become more prevalent with the rise of JavaScript libraries and frameworks such as jQuery or AngularJS. In fact, a 2017 research paper about exploiting DOM XSS through so-called script gadgets discovered that XSSAuditor is easily bypassed in 13 out of 16 tested JS frameworks

History of XSSAuditor defaults

Here's a rough timeline

Conclusion

Taking all things into considerations, I'd highly suggest removing the XSSAuditor from Chrome completely. In fact, Microsoft has announced they'd remove the XSS filter from Edge last year. Unfortunately, a suggestion to retire XSSAuditor initiated by the Google Security Team was eventually dismissed by the Chrome Security Team.

This blog post does not represent the position of my employer.
Thanks to Mario Heiderich for providing valuable feedback: Supporting arguments and useful links are his. Mistakes are all mine.

Andreas Zwinkau (qznc)

What is ASPICE? May 09, 2019 12:00 AM

The automotive industry knows how to develop software as demonstrated by ASPICE.

Read full article!

May 07, 2019

Phil Hagelberg (technomancy)

in which another game is jammed May 07, 2019 07:52 PM

All the games I've created previously have used the LÖVE framework, which I heartily recommend and have really enjoyed using. It's extremely flexible but provides just the right level of abstraction to let you do any kind of 2D game. I have even created a text editor in it. But for the 2019 Lisp Game Jam I teamed up again with Emma Bukacek (we first worked together on Goo Runner for the previous jam) and wanted to try something new: TIC-80.

tic-80 screenshot

TIC-80 is what's referred to as a "fantasy console"1; that is, a piece of software which embodies an imaginary computer which never actually existed. Hearkening back to the days of the Commodore 64, it has a 16-color palette, a 64kb limit on the amount of code you can load into it, and 80kb of space for data (sprites, maps, sound, and music). While these limitations may sound severe, the idea is that they can be liberating because there is no pressure to create something polished; the medium demands a rough, raw style.

The really impressive thing about TIC-80 you notice right away is how it makes game development so accessible. It's one file to download (or not even download; it runs perfectly fine in a browser) and you're off to the races; the code editor, sprite editor, mapper, sound editor, and music tracker are all built-in. But the best part is that you can explore other people's games (with the SURF command), and once you've played them, hit ESC to open the editor and see how they did it. You can make changes to the code, sprites, etc and immediately see them reflected. This kind of explore-and-tinker approach encourages you to experiment and see for yourself what happens.

In fact, try it now! Go to This is my Mech and hit ESC, then go down to "close game" and press Z to close it. You're in the console now, so hit ESC again to go to the editor, and press the sprite editor button at the top left. Change some of the character sprites, then hit ESC to go back to the console and type RUN to see what it does! The impact of the accessibility and immediacy of the tool simply can't be overstated; it calls out to be hacked and fiddled and tweaked.

Having decided on the platform, Emma and I threw around a few game ideas but landed on making an adventure/comedy game based on the music video I'll form the Head by MC Frontalot, which is in turn a parody of the 1980s cartoon Voltron, a mecha series about five different pilots who work together to form a giant robot that fights off the monster of the week. Instead of making the game about combat, I wanted a theme of cooperation, which led to a gameplay focused around dialog and conversation.

I'll form the head music video

I focused more on the coding and the art, and Emma did most of the writing and all of the music. One big difference when coding on TIC-80 games vs LÖVE is that you can't pull in any 3rd-party libraries; you have the Lua/Fennel standard library, the TIC-80 API, and whatever you write yourself. In fact, TIC-80's code editor supports only a single file. I'm mostly OK with TIC-80's limitations, but that seemed like a bit much, especially when collaborating, so I split out several different files and edited them in Emacs, using a Makefile to concatenate them together and TIC-80's "watch" functionality to load it in upon changes. In retrospect, while having functionality organized into different files was nice, it wasn't worth the downside of having the line numbers be incorrect, so I wouldn't do that part again.

The file watch feature was pretty convenient, but it's worth noting that the changes were only applied when you started a new game. (Not necessarily restarting the whole TIC-80 program, just the RUN command.) There's no way to load in new code from a file without restarting the game. You can evaluate new code with the EVAL command in the console and then RESUME to see the effect it has on a running game, but that only applies to a single line of code typed into the console, which is pretty limiting compared to LÖVE's full support for hot-loading any module from disk at any time that I wrote about previously. This was the biggest disadvantage of developing in TIC-80 by a significant margin. Luckily our game didn't have much state, so constantly restarting it wasn't a big deal, but for other games it would be.2

Another minor downside of collaborating on a TIC-80 game is that the cartridge is a single binary file. You can set it up so it loads the source from an external file, but the rest of the game (sprites, map, sound, and music) are all stored in one place. If you use git to track it, you will find that one person changing a sprite and another changing a music track will result in a conflict you can't resolve using git. Because of this, we would claim a "cartridge lock" in chat so that only one of us was working on non-code assets at a time, but it would be much nicer if changes to sprites could happen independently of changes to music without conflict.

screenshot of the game

Since the game consisted of mostly dialog, the conversation system was the central place to start. We used coroutines to allow a single conversation to be written in a linear, top-to-bottom way and react to player input but still run without blocking the main event loop. For instance, the function below moves the Adam character, says a line, and then asks the player a question which has two possible responses, and reacts differently depending on which response is chosen. In the second case, it sets convos.Adam so that the next time you talk to that character, a different conversation will begin:

(fn all.Adam2 []
  (move-to :Adam 48 25)
  (say "Hey, sorry about that.")
  (let [answer (ask "What's up?" ["What are you doing?"
                                  "Where's the restroom?"])]
    (if (= answer "Where's the restroom?")
        (say "You can pee in your pilot suit; isn't"
             "technology amazing? Built-in"
             "waste recyclers.")
        (= answer "What are you doing?")
        (do (say "Well... I got a bit flustered and"
                 "forgot my password, and now I'm"
                 "locked out of the system!")
            (set convos.Adam all.Adam25)
            (all.Adam25)))))

There was some syntactic redundancy with the questions which could have been tidied up with a macro. In older versions of Fennel, the macro system is tied to the module system, which is normally fine, but TIC-80's single-file restriction makes it so that style of macros were unavailable. Newer versions of Fennel don't have this restriction, but unfortunately the latest stable version of TIC-80 hasn't been updated yet. Hopefully this lands soon! The new version of Fennel also includes pattern matching, which probably would have made a custom question macro unnecessary.

The vast majority of the code is dialog/conversation code; the rest is for walking around with collision detection, and flying around in the end-game sequence. This is pretty standard animation fare but was a lot of fun to write!

rhinos animation

I mentioned TIC-80's size limit already; with such a dialog-heavy game we did run into that on the last day. We were close enough to the deadline with more we wanted to add that it caused a bit of a panic, but all we had to do was remove a bunch of commented code and we were able to squeeze what we needed in. Next time around I would use single-space indents just to save those few extra bytes.

All in all I think the downsides of TIC-80 were well worth it for a pixel-art style, short game. Being able to publish the game to an HTML file and easily publish it to itch.io (the site hosting the jam) was very convenient. It's especially helpful in a jam situation because you want to make it easy for as many people as possible to play your game so they can rate it; if it's difficult to install a lot of people won't do it. I've never done my own art for a game before, but having all the tools built-in convinced me to give it a try, and it turned out pretty good despite me not having any background in pixel art, or art of any kind.

Anyway, I'd encourage you to give the game a try. The game won first place in the game jam, and you can finish it in around ten minutes in your browser. And if it looks like fun, why not make your own in TIC-80?


[1] The term "fantasy console" was coined by PICO-8, a commercial product with limitations even more severe than TIC-80. I've done a few short demos with PICO-8 but I much prefer TIC-80, not just because it's free software, but because it supports Fennel, has a more comfortable code editor, and has a much more readable font. PICO-8 only supports a fixed-precision decimal fork of Lua. The only two advantages of PICO-8 are the larger community and the ability to set flags on sprites.

[2] I'm considering looking into adding support in TIC-80 for reloading the code without wiping the existing state. The author has been very friendly and receptive to contributions in the past, but this change might be a bit too much for my meager C skills.

Pepijn de Vos (pepijndevos)

Google Summer of Code is excluding half the world from participating May 07, 2019 12:00 AM

I recently came across someone who wanted to mentor a Yosys VHDL frontent as a Google Summer of Code project. This sounded fun, so I wrote a proposal, noting that GSoC starts before my summer holiday, and planning accordingly. Long story short, there are limited spots and my proposal was not accepted. I have confirmed with the mentoring organization that my availability was the primary factor in this.

While I understand their decision, it seems odd from an organizational viewpoint. Surely others would have the same problem? Indeed I heard from one person that they coped by just working ridiculous hours, while another said they never applied because of the mismatch. Google seems to be aware that this is an issue, stating in their FAQ:

Can the schedule be adjusted if my school ends late/starts early? No. We know that the schedule doesn’t work for some students, but it’s impossible to make a single timeline that works for everyone. Some organizations may allow a participant to start a little early or end a little late – but this is usually measured in days, not weeks. The monthly evaluation dates cannot be changed.

But how big is this problem, and where do accepted proposals come from? I decided to find out. Wikipedia has a long page of summer vacation dates for each country, and there is also this pdf which contains the following helpful graphic.

holidays holidays

Most summer vacations are from July to August, while GSoC runs from May 27 to August 19, excluding most of Europe and many other countries from participating. (unless you lie in your proposal or work 70 hours per week)

The next question is if this is reflected in accepted proposals. Since country of origin is not disclosed, this requires some digging. I scraped a few hundred names from the GSoC website, and scraped their location from a Linkedin search. This is of course not super reliable, but should give some indication.

      1 Argentina
      5 Australia
      1 Bangladesh
      6 Brazil
      1 Canada
      1 Chile
      7 China
      1 Denmark
      3 Egypt
      5 France
      9 Germany
      2 Ghana
      4 Greece
      2 Hong Kong
    212 India
      4 Indonesia
      1 Israel
      4 Italy
      2 Kazakhstan
      2 Kenya
      1 Lithuania
      2 Malaysia
      2 Mexico
      1 Nepal
      2 Nigeria
      1 Paraguay
      1 Peru
      3 Poland
      1 Portugal
      2 Qatar
      4 Romania
      4 Russian Federation
      1 Serbia
      4 Singapore
      1 South Africa
     10 Spain
      8 Sri Lanka
      2 Sweden
      2 Switzerland
      1 Tank
      2 Turkey
      4 Ukraine
      2 United Arab Emirates
      1 United Kingdom
     78 United States
     70 unknown
      1 Uruguay
      1 Uzbekistan
      3 Vietnam

Holy moly, so many Indians(212), followed by a large number of Americans(78), and then Spain(10), Germany(9), and the rest of the world. No Dutchies in this subset. For all European countries I counted a combined 51 participants, still a reasonable number. Even though Spain and Germany have the same holiday mismatch as the Netherlands. Tell me your secret! Interestingly, Wikipedia states that India has very short holidays, but special exceptions for summer programmes:

Summer vacation lasts for no more than six weeks in most schools. The duration may decrease to as little as three weeks for older students, with the exception of two month breaks being scheduled to allow some high school and university students to participate in internship and summer school programmes.

Anyway, I think a big international company like Google could try to be a bit more flexible, and for example let students work for a subset of the monthly evaluation periods that align with their holiday.

Appendix

To scrape the names, I scrolled down on the project page until I got bored, and then entered some JS in the browser console.

Array.prototype.map.call(document.querySelectorAll(".project-card h2"), function(x) { return x.innerText })

I saved this to a file and wrote a Selenium script to search Linkedin. Likedin was being really annoying by serving me different versions of various pages with completely different html tags, so this only works half of the time.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import json
import time
from urllib.parse import urlencode

with open('data.json') as f:
    data = json.load(f)

driver = webdriver.Firefox()
driver.implicitly_wait(5)
driver.get('https://www.linkedin.com')

username = driver.find_element_by_id('login-email')
username.send_keys('email')
password = driver.find_element_by_id('login-password')
password.send_keys('password')
sign_in_button = driver.find_element_by_id('login-submit')
sign_in_button.click()

for name in data:
    try:
        first, last = name.split(' ', 1)
    except ValueError:
        continue
    if last.endswith('-1'):
        last = last[:-2]
    params = urlencode({"firstName": first, "lastName": last})
    driver.get("https://www.linkedin.com/search/results/people/?" + params)
    try:
        location = driver.find_element_by_css_selector('.search-result--person .subline-level-2').text
        print('"%s", "%s"' % (name, location))
    except NoSuchElementException:
        print('"%s", "%s"' % (name, 'unknown'))
        continue

And finally some quick Bash hax to count the countries. (All US locations only list their state)

cat output.csv | cut -d\" -f 4 | sed "s/Area$/Area, United States/i" | awk -F, '{print $NF}' | awk '{$1=$1};1' | sort | uniq -c

Andreas Zwinkau (qznc)

Accidentally Turing-Complete May 07, 2019 12:00 AM

A list of things that were not supposed to be Turing-complete, but are.

Read full article!

May 06, 2019

Ponylang (SeanTAllen)

Last Week in Pony - May 6, 2019 May 06, 2019 04:16 PM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

May 04, 2019

Pierre Chapuis (catwell)

Changing the SSH port on Arch Linux May 04, 2019 06:00 PM

I often change the default SSH port from 22 to something else on servers I run. It kind of is a dangerous operation, especially when the only way you have to connect to that server is SSH.

The historical way to do this is editing sshd_config and setting the Port variable, but with recent versions of Arch Linux and the default configuration, this will not work.

The reason is that SSH is configured with systemd socket activation. So what you need to do is run sudo systemctl edit sshd.socket and set the contents of the file to:

[Socket]
ListenStream=MY_PORT
Accept=yes

where MY_PORT is the port number you want.

I hope this short post will avoid trouble for other people, at least it will be a reminder for me the next time I have to setup an Arch server...

Derek Jones (derek-jones)

C Standard meeting, April-May 2019 May 04, 2019 01:05 AM

I was at the ISO C language committee meeting, WG14, in London this week (apart from the few hours on Friday morning, which was scheduled to be only slightly longer than my commute to the meeting would have been).

It has been three years since the committee last met in London (the meeting was planned for Germany, but there was a hosting issue, and Germany are hosting next year), and around 20 people attended, plus 2-5 people dialing in. Some regular attendees were not in the room because of schedule conflicts; nine of those present were in London three years ago, and I had met three of those present (this week) at WG14 meetings prior to the last London meeting. I had thought that Fred Tydeman was the longest serving member in the room, but talking to Fred I found out that I was involved a few years earlier than him (our convenor is also a long-time member); Fred has attended more meeting than me, since I stopped being a regular attender 10 years ago. Tom Plum, who dialed in, has been a member from the beginning, and Larry Jones, who dialed in, predates me. There are still original committee members active on the WG14 mailing list.

Having so many relatively new meeting attendees is a good thing, in that they are likely to be keen and willing to do things; it’s also a bad thing for exactly the same reason (i.e., if it not really broken, don’t fix it).

The bulk of committee time was spent discussing the proposals contains in papers that have been submitted (listed in the agenda). The C Standard is currently being revised, WG14 are working to produce C2X. If a person wants the next version of the C Standard to support particular functionality, then they have to submit a paper specifying the desired functionality; for any proposal to have any chance of success, the interested parties need to turn up at multiple meetings, and argue for it.

There were three common patterns in the proposals discussed (none of these patterns are unique to the London meeting):

  • change existing wording, based on the idea that the change will stop compilers generating code that the person making the proposal considers to be undesirable behavior. Some proposals fitting this pattern were for niche uses, with alternative solutions available. If developers don’t have the funding needed to influence the behavior of open source compilers, submitting a proposal to WG14 offers a low cost route. Unless the proposal is a compelling use case, affecting lots of developers, WG14’s incentive is to not adopt the proposal (accepting too many proposals will only encourage trolls),
  • change/add wording to be compatible with C++. There are cost advantages, for vendors who have to support C and C++ products, to having the two language be as mutually consistent as possible. Embedded systems are a major market for C, but this market is not nearly as large for C++ (because of the much larger overhead required to support C++). I pointed out that WG14 needs to be careful about alienating a significant user base, by slavishly following C++; the C language needs to maintain a separate identity, for long term survival,
  • add a new function to the C library, based on its existence in another standard. Why add new functions to the C library? In the case of math functions, it’s to increase the likelihood that the implementation will be correct (maths functions often have dark corners that are difficult to get right), and for string functions it’s the hope that compilers will do magic to turn a function call directly into inline code. The alternative argument is not to add any new functions, because the common cases are already covered, and everything else is niche usage.

At the 2016 London meeting Peter Sewell gave a presentation on the Cerberus group’s work on a formal definition of C; this work has resulted in various papers questioning the interpretation of wording in the standard, i.e., possible ambiguities or inconsistencies. At this meeting the submitted papers focused on pointer provenance, and I was expecting to hear about the fancy optimizations this work would enable (which would be a major selling point of any proposal). No such luck, the aim of the work was stated as clearly specifying the behavior (a worthwhile aim), with no major new optimizations being claimed (formal methods researchers often oversell their claims, Peter is at the opposite end of the spectrum and could do with an injection of some positive advertising). Clarifying behavior is a worthwhile aim, but not at the cost of major changes to existing wording. I have had plenty of experience of asking WG14 for clarification of existing (what I thought to be ambiguous) wording, only to be told that the existing wording was clear and not ambiguous (to those reviewing my proposed defect). I wonder how many of the wording ambiguities that the Cerberus group claim to have found would be accepted by WG14 as a defect that required a wording change?

Winner of the best pub quiz question: Does the C Standard require an implementation to be able to exactly represent floating-point zero? No, but it is now required in C2X. Do any existing conforming implementations not support an exact representation for floating-point zero? There are processors that use a logarithmic representation for floating-point, but I don’t know if any conforming implementation exists for such systems; all implementations I know of support an exact representation for floating-point zero. Logarithmic representation could handle zero using a special bit pattern, with cpu instructions doing the right thing when operating on this bit pattern, e.g., 0.0+X == X, (I wonder how much code would break, if the compiler mapped the literal 0.0 to the representable value nearest to zero).

Winner of the best good intentions corrupted by the real world: intmax_t, an integer type capable of representing any value of any signed integer type (i.e., a largest representable integer type). The concept of a unique largest has issues in a world that embraces diversity.

Today’s C development environment is very different from 25 years ago, let alone 40 years ago. The number of compilers in active use has decreased by almost two orders of magnitude, the number of commonly encountered distinct processors has shrunk, the number of very distinct operating systems has shrunk. While it is not a monoculture, things appear to be heading in that direction.

The relevance of WG14 decreases, as the number of independent C compilers, in widespread use, decreases.

What is the purpose of a C Standard in today’s world? If it were not already a standard, I don’t think a committee would be set up to standardize the language today.

Is the role of WG14 now, the arbiter of useful common practice across widely used compilers? Documenting decisions in revisions of the C Standard.

Work on the Cobol Standard ran for almost 60-years; WG14 has to be active for another 20-years to equal this.

May 02, 2019

Maxwell Bernstein (tekknolagi)

Recursive Python objects May 02, 2019 08:24 PM

Recently for work I had to check that self-referential Python objects could be string-ified without endless recursion. In the process of testing my work, I had to come come up with a way of making self-referential built-in types (eg dict, list, set, and tuple).

Making a self-refential list is the easiest task because list is just a dumb mutable container. Make a list and append a reference to itself:

ls = []
ls.append(ls)
>>> ls
[[...]]
>>>

dict is similarly easy:

d = {}
d['key'] = d
>>> d
{'key': {...}}
>>>

Making a self-referential tuple is a little bit tricker because tuples cannot be modified after they are constructed (unless you use the C-API, in which case this is much easier — but that’s cheating). In order to close the loop, we’re going to have to use a little bit of indirection.

class C:
  def __init__(self):
    self.val = (self,)

  def __repr__(self):
    return self.val.__repr__()

>>> C()
((...),)
>>>

Here we create an class that stores a pointer to itself in a tuple. That way the tuple contains a pointer to an object that contains the tuple — A->B->A.

The solution is nearly the same for set:

class C:
  def __init__(self):
    self.val = set((self,))

  def __repr__(self):
    return self.val.__repr__()

>>> C()
{set(...)}
>>>

Note that simpler solutions like directly adding to the set (below) don’t work because sets are not hashable, and hashable containers like tuple depend on the hashes of their contents.

s = set()
s.add(s)  # nope
s.add((s,))  # still nope

There’s not a whole lot of point in doing this, but it was a fun exercise.

Cryptolosophy (awn)

to slice or not to slice May 02, 2019 12:00 AM

Go is an incredibly useful programming language because it hands you a fair amount of power while remaining fairly succinct. Here are are few bits of knowledge I’ve picked up in my time spent with it.

Say you have a fixed-size byte array and you want to pass it to a function that only accepts slices. That’s easy, you can “slice” it:

var bufarray [32]byte
bufslice := bufarray[:] // []byte

Going the other way is harder. The standard solution is to allocate a new array and copy the values over:

bufslice := make([]byte, 32)
var bufarray [32]byte
copy(bufarray[:], bufslice)

“What if I don’t want to make a copy?”, I hear you ask. You could be handling sensitive data or maybe you’re just optimizing the shit out of something. In any case we can grab a pointer and do it ourselves:

bufarrayptr := (*[32]byte)(unsafe.Pointer(&buf[0])) // *[32]byte (same memory region)
bufarraycpy := *(*[32]byte)(unsafe.Pointer(&buf[0])) // [32]byte (copied to new memory region)

A pointer to the first element of the slice is passed to unsafe.Pointer which is then cast to “pointer to fixed-size 32 byte array”. Dereferencing this will return a copy of the data as a new fixed-size byte array.

The unsafe cat is out of the bag so why not get funky with it? We can make our own slices, with blackjack and hookers:

func ByteSlice(ptr *byte, len int, cap int) []byte {
    var sl = struct {
        addr uintptr
        len  int
        cap  int
    }{uintptr(unsafe.Pointer(ptr)), len, cap}
    return *(*[]byte)(unsafe.Pointer(&sl))
}

This function will take a pointer, a length, and a capacity; and return a slice with those attributes. Using this, another way to convert an array to a slice would be:

var bufarray [32]byte
bufslice := ByteSlice(&bufarray[0], 32, 32)

We can take this further to get slices of arbitrary types, []T, as long as the memory region being mapped to divides the size of T. For example, to get a []uint32 representation of our [32]byte we would divide the length and capacity by four (a uint32 value consumes four bytes) and end up with a slice of size eight:

var sl = struct {
    addr uintptr
    len  int
    cap  int
}{uintptr(unsafe.Pointer(&bufarray[0])), 8, 8}
uint32slice := *(*[]uint32)(unsafe.Pointer(&sl))

But there is a catch. This “raw” construction converts the unsafe.Pointer object into a uintptr—a “dumb” integer address—which will not describe the region of memory you want if the runtime or garbage collector moves the original object around. To ensure that this doesn’t happen you can allocate your own memory using system-calls or a C allocator like malloc. This is exactly what we had to in memguard: the system-call wrapper is available here. To avoid memory leaks, remember to free your allocations!

It seems a bit wasteful to have a garbage collector and not use it though, so why don’t we let it catch some of the freeing for us? First create a container structure to work with:

type buffer struct {
    Bytes []byte
}

Add some generic constructor and destructor functions:

import "github.com/awnumar/memguard/memcall"

func alloc(size int) *buffer {
    if size < 1 {
        return nil
    }
    return &buffer{memcall.Alloc(size)}
}

func (b *buffer) free() {
    if b.Bytes == nil {
        // already been freed
        return
    }
    memcall.Free(b.Bytes)
    b.Bytes = nil
}

We use runtime.SetFinalizer to inform the runtime about our object and what to do if it finds it some time after it becomes unreachable. Modifying alloc to include this looks like:

func alloc(size int) *buffer {
    if size < 1 {
        return nil
    }

    buf := &buffer{memcall.Alloc(size)}

    runtime.SetFinalizer(buf, func(buf *buffer) {
        go buf.free()
    })

    return buf
}

Alright I think that’s enough shenanigans for one post.

May 01, 2019

Bogdan Popa (bogdan)

Using GitHub Actions to Test Racket Code May 01, 2019 02:00 PM

Like Alex Harsányi, I’ve been looking for a good, free-as-in-beer, alternative to Travis CI. For now, I’ve settled on GitHub Actions because using them is straightforward and because I saves me from creating yet another account with some other company. GitHub Actions revolves around the concept of “workflows” and “actions”. Actions execute arbitrary Docker containers on top of a checked-out repository and workflows describe which actions need to be executed when a particular event occurs.

Marc Brooker (mjb)

Some risks of coordinating only sometimes May 01, 2019 12:00 AM

Some risks of coordinating only sometimes

Sometimes-coordinating systems have dangerous emergent behaviors

A classic cloud architecture is built of small clusters of nodes (typically one to nine1), with coordination used inside each cluster to provide availability, durability and integrity in the face of node failures. Coordination between clusters is avoided, making it easier to scale the system while meeting tight availability and latency requirements. In reality, however, systems sometimes do need to coordinate between clusters, or clusters need to coordinate with a central controller. Some of these circumstances are operational, such as around adding or removing capacity. Others are triggered by the application, where the need to present a client API which appears consistent requires either the system itself, or a layer above it, to coordinate across otherwise-uncoordinated clusters.

The costs and risks of re-introducing coordination to handle API requests or provide strong client guarantees are well explored in the literature. Unfortunately, other aspects of sometimes-coordinated systems do not get as much attention, and many designs are not robust in cases where coordination is required for large-scale operations. Results like CAP and CALM2 provide clear tools for thinking through when coordination must occur, but offer little help in understanding the dynamic behavior of the system when it does occur.

One example of this problem is reacting to correlated failures. At scale, uncorrelated node failures happen all the time. Designing to handle them is straightforward, as the code and design is continuously validated in production. Large-scale correlated failures also happen, triggered by power and network failures, offered load, software bugs, operator mistakes, and all manner of unlikely events. If systems are designed to coordinate during failure handling, either as a mesh or by falling back to a controller, these correlated failures bring sudden bursts of coordination and traffic. These correlated failures are rare, so the way the system reacts to them is typically untested at the scale at which it is currently operating when they do happen. This increases time-to-recovery, and sometimes requires that drastic action is taken to recover the system. Overloaded controllers, suddenly called upon to operate at thousands of times their usual traffic, are a common cause of long time-to-recovery outages in large-scale cloud systems.

A related issue is the work that each individual cluster needs to perform during recovery or even scale-up. In practice, it is difficult to ensure that real-world systems have both the capacity required to run, and spare capacity for recovery. As soon as a system can’t do both kinds of work, it runs the risk of entering a mode where it is too overloaded to scale up. The causes of failure here are both technical (load measurement is difficult, especially in systems with rich APIs), economic (failure headroom is used very seldom, making it an attractive target to be optimized away), and social (people tend to be poor at planning for relatively rare events).

Another risk of sometimes-coordination is changing quality of results. It’s well known how difficult it is to program against APIs which offer inconsistent consistency, but this problem goes beyond just API behavior. A common design for distributed workload schedulers and placement systems is to avoid coordination on the scheduling path (which may be latency and performance critical), and instead distribute or discover stale information about the overall state of the system. In steady state, when staleness is approximately constant, the output of these systems is predictable. During failures, however, staleness may increase substantially, leading the system to making worse choices. This may increase churn and stress on capacity, further altering the workload characteristics and pushing the system outside its comfort zone.

The underlying cause of each of these issues is that the worst-case behavior of these systems may diverge significantly from their average-case behavior, and that many of these systems are bistable with a stable state in normal operation, and a stable state at “overloaded”. Within AWS, we are starting to settle on some patterns that help constrain the behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination, independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.

More work is needed in this space, both in understanding how to build systems which strongly avoid congestive collapse during all kinds of failures, and in building tools to characterize and test the behavior of real-world systems. Distributed systems and control theory are natural partners.

Footnotes:

  1. Cluster sizing is a super interesting topic in it's own right. Nine seems arbitrary here, but isn't: for the most durable consensus systems, because when spread across three datacenters allows one datacenter failure (losing 3) and one host failure while still having a healthy majority. Chain replicated and erasure coded systems will obviously choose differently, as will anything with read replicas, or cost, latency or other constraints.
  2. See Keeping CALM: When Distributed Consistency is Easy by Hellerstein and Alvaro. It's a great paper, and a very powerful conceptual tool.

April 29, 2019

Pete Corey (petecorey)

Generating Realistic Pseudonyms with Faker.js and Deterministic Seeds April 29, 2019 12:00 AM

Last week we talked about using decorators to conditionally anonymize users of our application to build a togglable “demo mode”. In our example, we anonymized every user by giving them the name "Jane Doe" and the phone number "555-867-5309". While this works, it doesn’t make for the most exciting demo experience. Ideally, we could incorporate more variety into our anonymized user base.

It turns out that with a little help from Faker.js and deterministic seeds, we can do just that!

Faker.js

Faker.js is a library that “generate[s] massive amounts of realistic fake data in Node.js and the browser.” This sounds like it’s exactly what we need.

As a first pass at incorporating Faker.js into our anonymization scheme, we might try generating a random name and phone number in the anonymize function attached to our User model:


const faker = require('faker');

userSchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: faker.name.findName(),
    phone: faker.phone.phoneNumber()
  });
};

We’re on the right path, but this approach has problems. Every call to anonymize will generate a new name and phone number for a given user. This means that the same user might be given multiple randomly generated identities if they’re returned from multiple resolvers.

Consistent Random Identities

Thankfully, Faker.js once again comes to the rescue. Faker.js lets us specify a seed which it uses to configure it’s internal pseudo-random number generator. This generator is what’s used to generate fake names, phone numbers, and other data. By seeding Faker.js with a consistent value, we’ll be given a consistent stream of randomly generated data in return.

Unfortunately, it looks like Faker.js’ faker.seed function accepts a number as its only argument. Ideally, we could pass the _id of our model being anonymized.

However, a little digging shows us that the faker.seed function calls out to a local Random module:


Faker.prototype.seed = function(value) {
  var Random = require('./random');
  this.seedValue = value;
  this.random = new Random(this, this.seedValue);
}

And the Random module calls out to the mersenne library, which supports seeds in the form of an array of numbers:


if (Array.isArray(seed) && seed.length) {
  mersenne.seed_array(seed);
}

Armed with this knowledge, let’s update our anonymize function to set a random seed based on the user’s _id. We’ll first need to turn our _id into an array of numbers:


this._id.split("").map(c => c.charCodeAt(0));

And then pass that array into faker.seed before returning our anonymized data:


userSchema.methods.anonymize = function() {
  faker.seed(this._id.split("").map(c => c.charCodeAt(0)));
  return _.extend({}, this, {
    name: faker.name.findName(),
    phone: faker.phone.phoneNumber()
  });
};

And that’s all there is to it.

Now every user will be given a consistent anonymous identity every time their user document is anonymized. For example, a user with an _id of "5cb0b6fd8f6a9f00b8666dcb" will always be given a name of "Arturo Friesen", and a phone number of "614-157-9046".

Final Thoughts

My client ultimately decided not to go this route, and decided to stick with obviously fake “demo mode” identities. That said, I think this is an interesting technique that I can see myself using in the future.

Seeding random number generators with deterministic values is a powerful technique for generating pseudo-random, but repeatable data.

That said, it’s worth considering if this is really enough to anonymize our users’ data. By consistently replacing a user’s name, we’re just masking one aspect of their identity in our application. Is that enough to truly anonymize them, or will other attributes or patterns in their behavior reveal their identity? Is it worth risking the privacy of our users just to build a more exciting demo mode? These are all questions worth asking.

April 28, 2019

Ponylang (SeanTAllen)

Last Week in Pony - April 28, 2019 April 28, 2019 08:53 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Jeff Carpenter (jeffcarp)

Measuring My Chinese Progress April 28, 2019 05:20 AM

Last summer I started learning Mandarin Chinese. To start I began taking classes at a Chinese language school in SF. For more practice I started an Instagram @jeffcarp_zh and tried writing a couple blog posts. Almost a year later, I’m still going to Chinese class on a semi-regular basis (1 hour a week except when I’m taking a break) and keep up a daily spaced-repetition flashcard habit using the Pleco Chinese dictionary app (usually on the train into work).

April 25, 2019

Derek Jones (derek-jones)

Dimensional analysis of the Halstead metrics April 25, 2019 05:30 PM

One of the driving forces behind the Halstead complexity metrics was physics envy; the early reports by Halstead use the terms software physics and software science.

One very simple, and effective technique used by scientists and engineers to check whether an equation makes sense, is dimensional analysis. The basic idea is that when performing an operation between two variables, their measurement units must be consistent; for instance, two lengths can be added, but a length and a time cannot be added (a length can be divided by time, returning distance traveled per unit time, i.e., velocity).

Let’s run a dimensional analysis check on the Halstead equations.

The input variables to the Halstead metrics are: eta_1, the number of distinct operators, eta_2, the number of distinct operands, N_1, the total number of operators, and N_2, the total number of operands. These quantities can be interpreted as units of measurement in tokens.

The formula are:

  • Program length: N = N_1 + N_2
    There is a consistent interpretation of this equation: operators and operands are both kinds of tokens, and number of tokens can be interpreted as a length.
  • Calculated program length: hat{N} = eta_1 log_2 eta_1 + eta_2 log_2 eta_2
    There is a consistent interpretation of this equation: the operand of a logarithm has to be dimensionless, and the convention is to treat the operand as a ratio (if no denominator is specified, the value 1 having the same dimensions as the numerator is taken, giving a dimensionless result), the value returned is dimensionless, which can be multiplied by a variable having any kind of dimension; so again two (token) lengths are being added.
  • Volume: V = N * log_2 eta
    A volume has units of length^3 (i.e., it is created by multiplying three lengths). There is only one length in this equation; the equation is misnamed, it is a length.
  • Difficulty: D = {eta_1 / 2 } * {N_2 / eta_2}
    Here the dimensions of eta_1 and eta_2 cancel, leaving the dimensions of N_2 (a length); now Halstead is interpreting length as a difficulty unit (whatever that might be).
  • Effort: E =  D * V
    This equation multiplies two variables, both having a length dimension; the result should be interpreted as an area. In physics work is force times distance, and power is work per unit time; the term effort is not defined.

Halstead is claiming that a single dimension, program length, contains so much unique information that it can be used as a measure of a variety of disparate quantities.

Halstead’s colleagues at Purdue were rather damming in their analysis of these metrics. Their report Software Science Revisited: A Critical Analysis of the Theory and Its Empirical Support points out the lack of any theoretical foundation for some of the equations, that the analysis of the data was weak and that a more thorough analysis suggests theory and data don’t agree.

I pointed out in an earlier post, that people use Halstead’s metrics because everybody else does. This post is unlikely to change existing herd behavior, but it gives me another page to point people at, when people ask why I laugh at their use of these metrics.

Wesley Moore (wezm)

What I Learnt Building a Lobsters TUI in Rust April 25, 2019 05:00 AM

As a learning and practice exercise I built a crate for interacting with the Lobsters programming community website. It's built on the asynchronous Rust ecosystem. To demonstrate the crate I also built a terminal user interface (TUI).

Screenshot of Lobsters TUI A screenshot of the TUI in Alacritty

Try It

crates.io

Pre-built binaries with no runtime dependencies are available for:

  • FreeBSD 12 amd64
  • Linux armv6 (Raspberry Pi)
  • Linux x86_64
  • MacOS
  • NetBSD 8 amd64
  • OpenBSD 6.5 amd64

Downloads Source Code

The TUI uses the following key bindings:

  • j or — Move cursor down
  • k or — Move cursor up
  • h or — Scroll view left
  • l or — Scroll view right
  • Enter — Open story URL in browser
  • c — Open story comments in browser
  • q or Esc — Quit

As mentioned in the introduction the motivation for starting the client was to practice using the async Rust ecosystem and it kind of spiralled from there. The resulting TUI is functional but not especially useful, since it just opens links in your browser. I can imagine it being slightly more useful if you could also view and reply to comments without leaving the UI.

Building It

The client proved to be an interesting challenge, mostly because Lobsters doesn't have a full API. This meant I had to learn how to set up and use a cookie jar along side reqwest in order to make authenticated requests. Logging in requires supplying a cross-site request forgery token, which Rails uses to prevent CSRF attacks. To handle this I need to first fetch the login page, note the token, then POST to the login endpoint. I could have tried to extract the token from the markup with a regex or substring matching but instead used kuchiki to parse the HTML and then match on the meta element in the head.

Once I added support for writing with the client (posting comments), not just reading, I thought I best not test against the real site. Fortunately the site's code is open source. I took this as an opportunity to use my new-found Docker knowledge and run it with Docker Compose. That turned out pretty easy since I was able to base it on one of the Dockerfiles for a Rails app I run. If you're curious the Alpine Linux based Dockerfile and docker-compose.yml can be viewed in this paste.

After I had the basics of the client worked out I thought it would be neat to fetch the front page stories and render them in the terminal in a style similar to the site itself. I initially did this with ansi_term. It looked good but lacked interactivity so I looked into ways to build a TUI along the lines of tig. I built it several times with different crates, switching each time I hit a limitation. I tried:

  • easycurses, which lived up to it's name and produced a working result quickly. I'd recommend this if your needs aren't too fancy, however I needed more control than it provided.
  • pancurses didn't seem to be able to use colors outside the core 16 from ncurses.

Finally I ended up going a bit lower-level and used termion. It does everything itself but at the same time you lose the conveniences ncurses provides. It also doesn't support Windows, so my plans of supporting that were thwarted. Some time after I had the termion version working I revisited tui-rs, which I had initially dismissed as unsuitable for my task. In hindsight it would probably have been perfect, but we're here now.

In addition to async and TUI I also learned more about:

  • Building a robust and hopefully user friendly command line tool.
  • Documenting a library.
  • Publishing crates.
  • Dockerising a Rails app that uses MySQL.
  • How to build and publish pre-built binaries for many platforms.
  • How to accept a password in the terminal without echoing it.
  • Setting up multi-platform CI builds on Sourcehut.

Whilst the library and UI aren't especially useful the exercise was worth it. I got to practice a bunch of things and learn some new ones at the same time.



Previous Post: Cross Compiling Rust for FreeBSD With Docker

April 24, 2019

Tobias Pfeiffer (PragTob)

You may not need GenServers and Supervision Trees April 24, 2019 05:26 PM

The thought that people seem to think of GenServers and supervision trees in the elixir/erlang world as too essential deterring people from using these languages has been brewing in my mind for quite some time. In fact I have written about it before in Elixir Forum. This is a summary and extended version of that […]

Átila on Code (atilaneves)

Type inference debate: a C++ culture phenomenon? April 24, 2019 09:22 AM

I read two C++ subreddit threads today on using the auto keyword. They’re both questions: the first one asks why certain people seem to dislike using type inference, while the second asks about what commonly taught guidelines should be considered bad practice. A few replies there mention auto. This confuses me for more than one […]

Derek Jones (derek-jones)

C2X and undefined behavior April 24, 2019 02:00 AM

The ISO C Standard is currently being revised by WG14, to create C2X.

There is a rather nebulous clustering of people who want to stop compilers using undefined behaviors to generate what these people (and probably most other developers) consider to be very surprising code. For instance, always printing p is truep is false, when executing the code: bool p; if ( p ) printf("p is true"); if ( !p ) printf("p is false"); (possible because p is uninitialized, and accessing an uninitialized value is undefined behavior).

This sounds like a good thing; nobody wants compilers generating surprising code.

All the proposals I have seen, so far, involve doing away with constructs that can produce undefined behavior. Again, this sounds like a good thing; nobody likes undefined behaviors.

The problem is, there is a reason for labeling certain constructs as producing undefined behavior; the behavior is who-knows-what.

Now the C Standard could specify the who-knows-what behavior; for instance, it could specify that the result of dividing by zero is 42. Standard’s conforming compilers would then have to generate code to check whether the denominator was zero, and return 42 for this case (until Intel, ARM and other processor vendors ‘updated’ the behavior of their divide instructions). Way-back-when a design decision was made, the behavior of divide by zero is undefined, not 42 or any other value; this was a design decision, code efficiency and compactness was considered to be more important.

I have not seen anybody arguing that the behavior of divide by zero should be specified. But I have seen people arguing that once C’s integer representation is specified as being twos-compliment (currently it can also be ones-compliment or signed-magnitude), then arithmetic overflow becomes defined. Wrong.

Twos-compliment is a specification of a representation, not a specification of behavior. What is the behavior when the result of adding two integers cannot be represented? The result might be to wrap (the behavior expected by many developers), to saturate at the maximum value (frequently needed in image and signal processing), to raise a signal (overflow is not usually supposed to happen), or something else.

WG14 could define the behavior, for when the result of an arithmetic operation is not representable in the number of bits available. Standard’s conforming compilers targeting processors whose arithmetic instructions did not behave as required would have to generate code, for any operation that could overflow, to do what was necessary. The embedded market are heavy users of C; in this market memory is limited, and processor performance is never fast enough, the overhead of supporting a defined behavior could just be too high (a more attractive solution is to code review, to make sure the undefined behavior cannot occur).

Is there another way of addressing the issue of compiler writers’ use/misuse of undefined behavior? Yes, offer them money. Compiler writing is a business, at least at the level at which gcc and llvm operate. If people really are keen to influence the code generated by gcc and llvm, money is the solution. Wot, no money? Then stop complaining.

April 23, 2019

Pierre Chapuis (catwell)

Spicing things up April 23, 2019 09:45 PM

In my last post I told you I had plans that I was not ready to talk about yet. Well, the time has come. I am happy to announce that I am now the CTO and co-founder of a startup called Chilli.

Chilli is not a typical startup, it is an eFounders project. You may know eFounders as the first startup studio in France, which originated companies such as Front, Aircall and Spendesk. The way they usually work is that they identify a problem that needs solving and find founders to tackle it, providing them both support and funding in exchange for equity. When the studio was created, I had doubts about the model, but later on I became quite enthusiastic about it.

Most eFounders companies are Software-as-a-Service businesses, and several of them were born of a need identified in traditional SMBs and SMEs. However, many pivoted to serve a different market, either tech companies or enterprises, and we can see the same pattern in other SaaS companies as well. So we end up with software that doesn't sell in the market it was originally designed for, and SMBs left on the side of the road with unaddressed digital needs. The reason, we believe, lies with the SaaS-to-SMBs distribution model, and that is the issue Chilli intends to solve.

We are certain that the solution to that problem must involve software. However, we also think technology alone will not be enough; a human touch is necessary, which is why my co-founder and CEO Julien comes from a consulting background. What we will build is a hybrid platform to help leaders identify the pain points in their companies and match them with the best digital tools to solve them. By starting from the customer's needs, we will work around the distribution cost issues and become the missing link between SaaS vendors and traditional SMBs and SMEs.

For me, this is a new and exciting challenge. Despite having been a very early employee at startups twice, I have never been a founder yet, and it is something I have wanted to do for a while. Moreover, it means I will be doing a lot of Product and Web development again, which will change from the last five years I spent mostly in the world of systems software in C and Lua.

On that note, our Web stack is (typed) Python 3 / Flask and TypeScript / Angular, and I am looking for a full stack developer to join the team. This is a junior to mid position based in Paris, France (no remote); since most of the work is on the frontend experience with Python is not a requirement. If you are interested, get in touch.

April 22, 2019

Richard Kallos (rkallos)

Imperium in Imperio - A Bridge is Made of Planks; Every Plank is a Bridge April 22, 2019 02:30 PM

In the previous three installments, I discussed visualizing ideal final states of things, examining the current state of your life, and how to bridge the two. In the final post in this series, I show how these techniques can be applied at any scale; from a daily to-do list to planning the course of your entire life.

In the previous post, I showed how how I write Structural Tension charts. The paragraph on top is where I write the ideal state of some thing that I want to accomplish, whether it’s a finished project, an instilled habit or an attained achievement. The paragraph on the bottom is where I describe as objectively and nonjudgmentally as possible the current reality of the thing I’m setting out to accomplish. The space in between the two paragraphs gets filled with a list of actionable steps I can take to move from where I am to where I want to be.

To me, these steps resemble planks on a bridge. The easier and smaller steps tend to go down near the bottom of the page, and the later tasks tend to be a bit more abstract, large in scope, and usually depend on previous tasks. Sometimes the steps I write out are really big items, like “Get a job as a software developer”. Writing this down doesn’t fill me with the will to go out and get things done. How do I even go about starting such a huge task? The answer: Make a ST chart for that task. If there are any tasks in that new chart that are too large, you can make yet more charts. In the end, you wind up with a forest (in the graph theory sense) of charts that all serve to plot a course for your life.

For example, the ST chart I shared with you in the previous post could have been a single step in a larger chart where I set a course to gain better insight to my emotions, and the step titled “Experiment with active forms of meditation” could have its own chart where I describe the styles I’ve tried, what’s gone well, and what has yet to be tested. Furthermore, I could have a completely separate chart about setting up a regular journaling habit where I list “Try keeping a meditation journal” as one of the steps.

So far the method that I’ve had the most success with is to keep track of these charts on paper, but as much as I deeply enjoy the feel of pen and paper, I can’t help but think that organizing these charts is a task that computers are well-suited to. I’ve noticed there is a lot of software that could be great at managing these ST charts, but I think the two most promising ones are Emacs and TiddlyWiki. I’ll hopefully have more to say in the future about if/how I’ve adapted my system to allow for the help of software.

In conclusion, the process of writing ST charts may lead you to write out steps that are a bit too large to tackle on their own. Fortunately, you can take advantage of the recursive nature of ST charts and break each step down into their own chart, and it becomes easier to plan out projects or ambitions of any size.

Pete Corey (petecorey)

Anonymizing GraphQL Resolvers with Decorators April 22, 2019 12:00 AM

As software developers and application owners, we often want to show off what we’re working on to others, especially if there’s some financial incentive to do so. Maybe we want to give a demo of our application to a potential investor or a prospective client. The problem is that staging environments and mocked data are often lifeless and devoid of the magic that makes our project special.

In an ideal world, we could show off our application using production data without violating the privacy of our users.

On a recently client project we managed to do just that by modifying our GraphQL resolvers with decorators to automatically return anonymized data. I’m very happy with the final solution, so I’d like to give you a run-through.

Setting the Scene

Imagine that we’re working on a Node.js application that uses Mongoose to model its data on the back-end. For context, imagine that our User Mongoose model looks something like this:


const userSchema = new Schema({
  name: String,
  phone: String
});

const User = mongoose.model('User', userSchema);

As we mentioned before, we’re using GraphQL to build our client-facing API. The exact GraphQL implementation we’re using doesn’t matter. Let’s just assume that we’re assembling our resolver functions into a single nested object before passing them along to our GraphQL server.

For example, a simple resolver object that supports a user query might look something like this:


const resolvers = {
  Query: {
    user: (_root, { _id }, _context) => {
      return User.findById(_id);
    }
  }
};

Our goal is to return an anonymized user object from our resolver when we detect that we’re in “demo mode”.

Updating Our Resolvers

The most obvious way of anonymizing our users when in “demo mode” would be to find every resolver that returns a User and manually modify the result before returning:


const resolvers = {
  Query: {
    user: async (_root, { _id }, context) => {
      let user = await User.findById(_id);

      // If we're in "demo mode", anonymize our user:
      if (context.user.demoMode) {
        user.name = 'Jane Doe';
        user.phone = '(555) 867-5309';
      }

      return user;
    }
  }
};

This works, but it’s a high touch, high maintenance solution. Not only do we have to comb through our codebase modifying every resolver function that returns a User type, but we also have to remember to conditionally anonymize all future resolvers that return User data.

Also, what if our anonymization logic changes? For example, what if we want anonymous users to be given the name 'Joe Schmoe' rather than 'Jane Doe'? Doh!

Thankfully, a little cleverness and a little help from Mongoose opens the doors to an elegant solution to this problem.

Anonymizing from the Model

We can improve on our previous solution by moving the anonymization logic into our User model. Let’s write an anonymize Mongoose method on our User model that scrubs the current user’s name and phone fields and returns the newly anonymized model object:


userSchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: 'Jane Doe',
    phone: '(555) 867-5309'
  });
};

We can refactor our user resolver to make use of this new method:


async (_root, { _id }, context) => {
  let user = await User.findById(_id);

  // If we're in "demo mode", anonymize our user:
  if (context.user.demoMode) {
    return user.anonymize();
  }

  return user;
}

Similarly, if we had any other GraphQL/Mongoose types we wanted to anonymize, such as a Company, we could add an anonymize function to the corresponding Mongoose model:


companySchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: 'Initech'
  });
};

And we can refactor any resolvers that return a Company GraphQL type to use our new anonymizer before returning a result:


async (_root, { _id }, context) => {
  let company = await Company.findById(_id);

  // If we're in "demo mode", anonymize our company:
  if (context.user.demoMode) {
    return company.anonymize();
  }

  return company;
}

Going Hands-off with a Decorator

Our current solution still requires that we modify every resolver in our application that returns a User or a Company. We also need to remember to conditionally anonymize any users or companies we return from resolvers we write in the future.

This is far from ideal.

Thankfully, we can automate this entire process. If you look at our two resolver functions up above, you’ll notice that the anonymization process done by each of them is nearly identical.

We anonymize our User like so:


// If we're in "demo mode", anonymize our user:
if (context.user.demoMode) {
  return user.anonymize();
}

return user;

Similarly, we anonymize our Company like so:


// If we're in "demo mode", anonymize our company:
if (context.user.demoMode) {
  return company.anonymize();
}

return company;

Because both our User and Company Mongoose models implement an identical interface in our anonymize function, the process for anonymizing their data is the same.

In theory, we could crawl through our resolvers object, looking for any resolvers that return a model with an anonymize function, and conditionally anonymize that model before returning it to the client.

Let’s write a function that does exactly that:


const anonymizeResolvers = resolvers => {
  return _.mapValues(resolvers, resolver => {
    if (_.isFunction(resolver)) {
      return decorateResolver(resolver);
    } else if (_.isObject(resolver)) {
      return anonymizeResolvers(resolver);
    } else if (_.isArray(resolver)) {
      return _.map(resolver, resolver => anonymizeResolvers(resolver));
    } else {
      return resolver;
    }
  });
};

Our new anonymizeResolvers function takes our resolvers map and maps over each of its values. If the value we’re mapping over is a function, we call a soon-to-be-written decorateResolver function that will wrap the function in our anonymization logic. Otherwise, we either recursively call anonymizeResolvers on the value if it’s an array or an object, or return it if it’s any other type of value.

Our decorateResolver function is where our anonymization magic happens:


const decorateResolver = resolver => {
  return async function(_root, _args, context) {
    let result = await resolver(...arguments);
    if (context.user.demoMode &&
        _.chain(result)
         .get('anonymize')
         .isFunction()
         .value()
    ) {
      return result.anonymize();
    } else {
      return result;
    }
  };
};

In decorateResolver we replace our original resolver function with a new function that first calls out to the original, passing through any arguments our new resolver received. Before returning the result, we check if we’re in demo mode and that the result of our call to resolver has an anonymize function. If both checks hold true, we return the anonymized result. Otherwise, we return the original result.

We can use our newly constructed anonymizeResolvers function by wrapping it around our original resolvers map before handing it off to our GraphQL server:


const resolvers = anonymizeResolvers({
  Query: {
    ...
  }
});

Now any GraphQL resolvers that return any Mongoose model with an anonymize function with return anonymized data when in demo mode, regardless of where the query lives, or when it’s written.

Final Thoughts

While I’ve been using Mongoose in this example, it’s not a requirement for implementing this type of solution. Any mechanism for “typing” objects and making them conform to an interface should get you where you need to go.

The real magic here is the automatic decoration of every resolver in our application. I’m incredibly happy with this solution, and thankful that GraphQL’s resolver architecture made it so easy to implement.

My mind is buzzing with other decorator possibilities. Authorization decorators? Logging decorators? The sky seems to be the limit. Well, the sky and the maximum call stack size.

April 21, 2019

Gonçalo Valério (dethos)

Easy backups with Borg April 21, 2019 05:55 PM

One of the oldest and most frequent advises to people working with computers is “create backups of your stuff”. People know about it, they are sick of hearing it, they even advice other people about it, but a large percentage of them don’t do it.

There are many tools out there to help you fulfill this task, but throughout the years the one I end up relying the most is definitely “Borg“. It is really easy to use, has good documentation and runs very well on Linux machines.

Here how they describe it:

BorgBackup (short: Borg) is a deduplicating backup program. Optionally, it supports compression and authenticated encryption.

The main goal of Borg is to provide an efficient and secure way to backup data. The data deduplication technique used makes Borg suitable for daily backups since only changes are stored. The authenticated encryption technique makes it suitable for backups to not fully trusted targets.

Borg’s Website

The built-in encryption and de-duplication features are some of its more important selling points.

Until recently I’ve had a hard time recommending it to less technical people, since Borg is mostly available through the command line and can take some work to implement the desired backup “policy”. There is a web based graphical user interface but I generally don’t like them as a replacement for native desktop applications.

However in the last few months I’ve been testing this GUI frontend for Borg, called Vorta, that I think will do the trick for family and friends that ask me what can they use to backup their data.

The tool is straight forward to use and supports the majority of Borg’s functionality, once you setup the repository you can instruct it to regularly perform your backups and forget about it.

I’m not gonna describe how to use it, because with a small search on the internet you can quickly find lots of articles with that information.

The only advise that I would like to leave here about Vorta, is related to the the encryption and the settings chosen when creating your repository. At least on the version I used, the recommend repokey option will store your passphrase on a local SQLite database in clear-text, which is kind of problematic.

This seems to be viewed as a feature:

Fallback to save repo passwords. Only used if no Keyring available.

Github Repository

But I could not find the documentation about how to avoid this “fallback”.

Ponylang (SeanTAllen)

Last Week in Pony - April 21, 2019 April 21, 2019 11:13 AM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

April 20, 2019

Bit Cannon (wezm)

Two Years on Linux April 20, 2019 10:00 PM

This is the sixth post in my series on finding an alternative to Mac OS X. The previous post in the series recapped my first year away from Mac OS and my move to FreeBSD on my desktop computer.

The search for the ideal desktop continues and my preferences evolve as I gain more experience. In this post I summarise where I’m at two years after switching away from Mac OS. This includes leaving FreeBSD on the desktop and switching from GNOME to Awesome. I’ll cover the motivation, benefits, and drawbacks to giving up a complete desktop environment for a, “build your own”, desktop.

Embracing Awesome

If I were to identify a general trend in my time away from Mac OS it would be one of gradual migration. Initially I was looking to replicate my Mac OS experience. I landed on elementary OS as it shared many of the same values of Mac OS. Over time, I moved to vanilla GNOME and gradually dropped some of the tools I initially felt were essential, like Albert, and Enpass. Instead, I opted for built in functionality or command line tools.

These gateway tools allowed me remain not too far outside my computing comfort zone. As time goes on though, I’m adopting more platform native options, like using the built in GNOME search instead of having a dedicated app for that like Albert.

GNOME was working pretty well for me and even got updated from 3.18 to 3.28 on FreeBSD (although it’s remained there and the current version is now 3.32). Despite this, high resource usage, some conversations, blog posts and shift in workflow led me to reevaluate tiling window managers.

I was using the terminal more than ever before. I’ve been comfortable in the terminal for a long time but I realised that I was using the tiling features of Tilix and Neovim a lot. I was also using the tiling feature of GNOME to show two apps side-by-side.

The memory usage and log spamming of gnome-shell was bothering me too. The former overflowed into a snarky tweet that led to a conversation that more or less convinced me that the use of JavaScript in gnome-shell was not the ultimate cause of the memory issues but the fact that such an issue went unfixed for years made me evaluate other options. Note: As of GNOME 3.30 the leak should be largely fixed.

I had a good conversation with a friend and long time Linux proponent about his use of i3, and he commented that he felt I’d probably like a tiling window manager. I’ve tried i3 before but didn’t really like it’s semi-manual management of layouts. This did prompt me to start looking around though.

I read some interesting blog posts:

It was a comment on the post above, the really piqued my curiosity. It mentioned spectrwm as a possible candidate. I installed it and was really taken by its primary/secondary tiling model and the sensible defaults approach. I tweaked and ran spectrwm on my XPS 15 for a while but eventually ran into some limitations of its configuration and integrated bar. At this point I was mostly enjoying a tiling window manager for the first time. I spent some time poring over the Arch Linux Wiki, Comparison of tiling window managers page. I reviewed most of the options on that page. Looking for ones that supported the primary/secondary model from spectrwm, were well maintained, configurable, came with a usable base configuration, and did not have many dependencies.

Eventually I landed on Awesome. It’s a well established project and uses Lua for configuration, which is a simple, easy to learn language that allows almost any configuration to be created. I’ve been happily using it on all my systems for about four months now.

Awesome Window Manger - Using the 'centerwork' layout while working on my linux.conf.au badge

It’s not all roses though, the thing with switching from a desktop environment to just a window manager is that it makes you really realise all the things that you get for free from the desktop environment. After settling into Awesome I needed to build/find replacements for the following features that I took for granted in GNOME:

  • Brightness control with keyboard buttons
  • Volume control with keyboard buttons
  • Setting the DPI correctly for a HiDPI display
  • Display adjustment when adding/removing an external display
  • Automatically unlocking the keyring upon login so that I didn’t need to enter the password for SSH and GPG keys.
  • Displaying the battery and volume level in the top bar
  • Trackpad/mouse configuration:
    • Trackpad acceleration
    • Natural scrolling
    • Enable Clickfingers behaviour
  • Double buffering of windows to prevent tearing, black fills where shadows should be present.
  • Notifications

I did solve all these challenges. Check out my xprofile and rc.lua if you’re curious.


Moving on From FreeBSD

From Oct 2017 to Jan 2019 I ran FreeBSD as the primary OS on my desktop computer. Similarly, I hosted this website and others on a FreeBSD server for more than two years. I recently rebuilt my personal server infrastructure on Docker, hosted by Alpine Linux and went back to Arch Linux on my desktop computer.

It wasn’t any one thing in isolation that led to this switch. It was lots of little things that culminated in a broken system one day that pushed me over the edge. I will just list some issues that come to mind in no particular. This post would be very long if I went into detail for each item. I’m aware that there are solutions and workarounds to some of these, like running Linux in bhyve but it was the sum of the whole, not any individual items that made me switch:

  • ZFS on Linux being ported to FreeBSD:
    • One of the reasons I used FreeBSD was for ZFS. I did so on the assumption that the FreeBSD implementation was more stable and “more canonical” than ZFS on Linux (ZoL). However, the announcement that ZoL is being ported to FreeBSD to get its bug fixes, improvements, and wider developer base suggested that was wrong.
  • I wanted/needed to use Docker more.
  • The portion of the community that likes to point out jails existed before Docker and are somehow better.
    • In my experience the jails user experience is terrible compared to Docker and lacks a lot of the features that Docker automatically takes care of, such as networking, file system layers/caching, distribution of images.
  • Attending linux.conf.au:
  • The general fear and loathing of all change that some of the community exhibit.
    • They decry everything that doesn’t keep things that way it was in 1970 as a violation of the “UNIX philosophy”, as though everything done by the UNIX grandfathers was perfect and unchangeable.
  • Working on my Rust powered linux.conf.au e-Paper badge, a project that targeted Raspbian, which was easier to test with a Linux host.
  • More advanced virtualisation:
    • Such as built in graphics support, no need for VNC workarounds.
  • Losing hours to slow networking in virtualised environments, something that just works on Linux.
  • The reaction to the improved FreeBSD Code of Conduct last year by some of the community deeply troubled me.
  • Graphics support:
    • The recent drm-kmod work that brings modern graphics support to FreeBSD is a great improvement but it’s a port of Linux code. If I’m running a bunch of Linux code anyway maybe it’s better to just go to the source.
  • The onerous process required to contribute patches to update a port and find someone to review and merge them.
  • Bugs with patches supplied that sit unmerged for months unless you know the right people to nudge.
  • Continued use of tools that are unfamiliar to the vast majority of developers these days (Subversion, patch based workflow).
    • I can and did deal with this but I think it’s a huge barrier to entry for new contributors.
  • A Electron port that no one seems to be able to get over the line.
    • I’m no electron fan but if the choice is no app or an electron app I’d at least like the option to run it.
    • There’s a US$850 bounty on this issue, $50 I added myself.

Apologies, I know the above list is a bit ranty. For something a bit less ranty read this a great post by Alexander Leidinger that outlines some things he thinks the project needs to do to stay relevant.

I called out some community behaviour, and reactions above but want to point out that these folks don’t represent the whole community. Lots of the BSD community are lovely and are doing the best they can with the comparatively small resources they have available. I thank them for their efforts.

The clincher was a failed upgrade in January 2019. I think I followed the handbook but something happened to the ZFS pool that prevented the system from booting from it. I was able to boot off an install flash drive and mount the pool fine but it refused to boot by itself. I spent several hours trying to fix it but in the end it was the final straw. I carefully backed everything up and then did a clean Arch Linux + ZFS install.

With the knowledge that ZoL was a lot more mature than I had originally thought I decided to install Arch onto the NVMe drive and then have /home live on a zpool comprised of the 3 SSDs.

One drawback to using ZFS for /home is that Dropbox stops working due to their brain-dead requirement that you must use ext4. There are hacks to work around it but I didn’t have proper Dropbox support on FreeBSD so not having it on this install was no different. My use of Dropbox is in maintenance mode anyway so it’s only rarely that I actually need it.

Finally, I may not be using FreeBSD day-to-day anymore but that doesn’t mean I’ve completely left. I continue to make monthly donations to the FreeBSD and OpenBSD projects and will continue to ensure that BSD systems are well-supported by any software I build. I’ll also advocate for avoiding unnecessarily Linux specific code where possible.


The Journey Continues

After more two years my journey continues and I expect it to keep doing so. I enjoy exploring what’s out there and my preferences shift over time. In the future I expect to periodically try out Wayland based systems, like I did on the new desktop Arch install (issues with copy and paste between Firefox and Alacrity led me to put that on hold).

On the operating system front NixOS and Guix are pioneering new ways of constructing reliable systems. As a Rust developer I’m also watching Redox OS, an OS written from scratch in Rust. What comes of Google’s Fuschia project will also be interesting to see unfold. The world of operating systems may not be as diverse as it once was but there’s still lots to come.

April 18, 2019

Derek Jones (derek-jones)

OSI licenses: number and survival April 18, 2019 12:23 AM

There is a lot of source code available which is said to be open source. One definition of open source is software that has an associated open source license. Along with promoting open source, the Open Source Initiative (OSI) has a rigorous review process for open source licenses (so they say, I have no expertise in this area), and have become the major licensing brand in this area.

Analyzing the use of licenses in source files and packages has become a niche research topic. The majority of source files don’t contain any license information, and, depending on language, many packages don’t include a license either (see Understanding the Usage, Impact, and Adoption of Non-OSI Approved Licenses). There is some evolution in license usage, i.e., changes of license terms.

I knew that a fair-few open source licenses had been created, but how many, and how long have they been in use?

I don’t know of any other work in this area, and the fastest way to get lots of information on open source licenses was to scrape the brand leader’s licensing page, using the Wayback Machine to obtain historical data. Starting in mid-2007, the OSI licensing page kept to a fixed format, making automatic extraction possible (via an awk script); there were few pages archived for 2000, 2001, and 2002, and no pages available for 2003, 2004, or 2005 (if you have any OSI license lists for these years, please send me a copy).

What do I now know?

Over the years OSI have listed 110107 different open source licenses, and currently lists 81. The actual number of license names listed, since 2000, is 205; the ‘extra’ licenses are the result of naming differences, such as the use of dashes, inclusion of a bracketed acronym (or not), license vs License, etc.

Below is the Kaplan-Meier survival curve (with 95% confidence intervals) of licenses listed on the OSI licensing page (code+data):

Survival curve of OSI licenses.

How many license proposals have been submitted for review, but not been approved by OSI?

Patrick Masson, from the OSI, kindly replied to my query on number of license submissions. OSI doesn’t maintain a count, and what counts as a submission might be difficult to determine (OSI recently changed the review process to give a definitive rejection; they have also started providing a monthly review status). If any reader is keen, there is an archive of mailing list discussions on license submissions; trawling these would make a good thesis project :-)

April 14, 2019

Derek Jones (derek-jones)

The Algorithmic Accountability Act of 2019 April 14, 2019 08:00 PM

The Algorithmic Accountability Act of 2019 has been introduced to the US congress for consideration.

The Act applies to “person, partnership, or corporation” with “greater than $50,000,000 … annual gross receipts”, or “possesses or controls personal information on more than— 1,000,000 consumers; or 1,000,000 consumer devices;”.

What does this Act have to say?

(1) AUTOMATED DECISION SYSTEM.—The term ‘‘automated decision system’’ means a computational process, including one derived from machine learning, statistics, or other data processing or artificial intelligence techniques, that makes a decision or facilitates human decision making, that impacts consumers.

That is all encompassing.

The following is what the Act is really all about, i.e., impact assessment.

(2) AUTOMATED DECISION SYSTEM IMPACT ASSESSMENT.—The term ‘‘automated decision system impact assessment’’ means a study evaluating an automated decision system and the automated decision system’s development process, including the design and training data of the automated decision system, for impacts on accuracy, fairness, bias, discrimination, privacy, and security that includes, at a minimum—

I think there is a typo in the following: “training, data” -> “training data”

(A) a detailed description of the automated decision system, its design, its training, data, and its purpose;

How many words are there in a “detailed description of the automated decision system”, and I’m guessing the wording has to be something a consumer might be expected to understand. It would take a book to describe most systems, but I suspect that a page or two is what the Act’s proposers have in mind.

(B) an assessment of the relative benefits and costs of the automated decision system in light of its purpose, taking into account relevant factors, including—

Whose “benefits and costs”? Is the Act requiring that companies do a cost benefit analysis of their own projects? What are the benefits to the customer, compared to a company not using such a computerized approach? The main one I can think of is that the customer gets offered a service that would probably be too expensive to offer if the analysis was done manually.

The potential costs to the customer are listed next:

(i) data minimization practices;

(ii) the duration for which personal information and the results of the automated decision system are stored;

(iii) what information about the automated decision system is available to consumers;

This act seems to be more about issues around data retention, privacy, and customers having the right to find out what data companies have about them

(iv) the extent to which consumers have access to the results of the automated decision system and may correct or object to its results; and

(v) the recipients of the results of the automated decision system;

What might the results be? Yes/No, on a load/job application decision, product recommendations are a few.

Some more potential costs to the customer:

(C) an assessment of the risks posed by the automated decision system to the privacy or security of personal information of consumers and the risks that the automated decision system may result in or contribute to inaccurate, unfair, biased, or discriminatory decisions impacting consumers; and

What is an “unfair” or “biased” decision? Machine learning finds patterns in data; when is a pattern in data considered to be unfair or biased?

In the UK, the sex discrimination act has resulted in car insurance companies not being able to offer women cheaper insurance than men (because women have less costly accidents). So the application form does not contain a gender question. But the applicants first name often provides a big clue, as to their gender. So a similar Act in the UK would require that computer-based insurance quote generation systems did not make use of information on the applicant’s first name. There is other, less reliable, information that could be used to estimate gender, e.g., height, plays sport, etc.

Lots of very hard questions to be answered here.

Ponylang (SeanTAllen)

Last Week in Pony - April 14, 2019 April 14, 2019 01:58 PM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

April 13, 2019

Carlos Fenollosa (carlesfe)

I miss Facebook, and I'm not ashamed to admit it April 13, 2019 05:25 PM

I'm 35. Before Facebook, I had to use different tools depending on whom I wanted to chat with.

I'm not talking about the early era of the Internet, but rather the period after everybody started getting online. Chat was just getting popular, but it was quite limited.

We used ICQ/MSN Messenger to chat with real life friends. IRC was used mostly for "internet friends", as we called them back then. Finally, we had the Usenet and forums for open discussion with everybody else.

If you wanted to post pictures, Flickr was the go-to website. We didn't share many videos, and there was no really good tool to do so, so we didn't care much.

There was Myspace, and Fotolog, very preliminar social networks which had their chance but simply didn't "get it."

Then Facebook appeared. And it was a big deal.

Add me on Facebook

Whenever you met somebody IRL you would add them to Facebook almost immediately, and keep connected through it.

Suddenly, everybody you knew and everybody you wanted to know was on Facebook, and you could reach all of them, or any of them, quickly and easily.

At that time, privacy was not such a big concern. We kinda trusted the network, and furthermore, our parents and potential employers weren't there.

On Facebook, we were raw.

At some point it all went south. The generational change, privacy breaches, mobile-first apps and the mass adoption of image and video moved everybody to alternative platforms. Whatsapp, mainly for private communications, and Instagram as our facade.

I wrote about Facebook's demise so I will not go through the reasons here. Suffice to say, we all know what happened.

The Wall was replaced by an algorithm which sunk original content below a flood of ads, fake news, and externally shared content "you might like". We stopped seeing original content. Then, people stopped sharing personal stuff, as nobody interacted with it.

In the end, we just got fed up with the changes, and maybe some people just wanted something shiny and new, or something easier to use.

Facebook was a product of its era, technologically and socially. But, as a service, it was peak human connection. Damn you Zuck, you connected mankind with a superb tool, then let it slip through your fingers. What a tragic outcome.

Current social networks, not the same thing

I, too, moved to Instagram when friends stopped being active on Facebook and encouraged me to create an account there.

Then I realized how fake it is. Sorry for the cliché, but we all know it's true.

I gave it an honest try. I really wanted to like it. But I just couldn't. At least, not as an alternative to Facebook. Stories were a step forward, but I felt —maybe rightfully— that I was being gamed to increase my engagement, not to have access to my friends content.

Instagram is a very different beast. There is no spontaneity; all posts are carefully selected images, masterfully filtered and edited, showcasing only the most successful of your daily highlights.

I admit it's very useful to connect with strangers, but the downside is that you can't connect with friends the same way you did on Facebook.

Of course, I'm not shooting the messenger, but let me apportion a bit of blame. A service that is a picture-first sharing site and demotes text and comments to an afterthought makes itself really difficult to consider as an honest two-way communication tool.

Instagram is designed to be used as it is actually used: as a posturing tool.

On Facebook you could share a moment with friends. With Instagram, however, moments are projected at you.

I miss Facebook

I miss knowing how my online friends are really doing these days. Being able to go through their life, their personal updates, the ups and the downs.

I miss spontaneous updates at 3 am, last-minute party invites, making good friends with people who I just met once in person and now live thousands of kilometers away.

I miss going through profiles of people to learn what kind of music and movies they liked, and feeling this serendipitous connection based on shared interests with someone I did not know that well in real life.

I miss the opportunity of sharing a lighthearted comment with hundreds of people that understand me and will interpret it in the most candid way, instead of the nitpicking and criticism of Twitter.

I miss the ability to tell something to my friends without the need of sharing a picture, the first-class citizen treatment of text.

I miss the degree of casual social interaction that Facebook encouraged, where it was fine to engage with people sporadically. On the contrary, getting a comment or a Like from a random acquaintance could make your day.

I miss when things online were more real, more open.

I miss peak Facebook; not just the tool, but the community it created.

Facebook was the right tool at the right time

Somebody might argue that, for those people I am not in touch anymore, they were clearly not such big friends. After all, I still talk to my real-life friends and share funny pics via Whatsapp.

Well, those critics are right; they were not so important in my life as to keep regular contact. But they still held a place in there, and I would have loved to still talk to them. And the only socially acceptable way to keep in touch with those acquaintances was through occasional contact via Facebook. I've heard the condescending "pick up the phone and call them"; we all know that's not how it works.

In the end, nobody is in a position to judge how people enjoy their online tools. If users prefer expressing themselves with pictures rather than text, so be it. There is nothing wrong with fishing for Likes.

So please don't misinterpret me, nobody is really at fault. There was no evil plan to move people from one network to another. No one forced friends to stop posting thoughts and post only pics. Instagram just facilitated a new communication channel that people happened to like more than the previous one.

When Facebook Inc. started sensing its own downfall, they were happy to let its homonymous service be cannibalized by Instagram. It's how business works. The time of Facebook had passed.

I'm sorry I can't provide any interesting conclusion to this article. There was no real intent besides feeling nostalgic for a tool and community that probably won't come back, and hopefully connecting with random strangers that might share the same sentiment.

Maybe, as we all get older, we just want to enjoy what's nice of life, make everybody else a little bit jealous, and avoid pointless online discussions. We'd rather shut up, be more careful, and restrict our online interactions to non-rebuttable pictures of our life.

We all, however, lost precious connections on the way.

Tags: life, internet, facebook, web

&via=cfenollosa">&via=cfenollosa">Comments? Tweet  

April 12, 2019

Bogdan Popa (bogdan)

The Problem with SSH Agent Forwarding April 12, 2019 11:00 AM

After hacking the matrix.org website today, the attacker opened a series of GitHub issues mentioning the flaws he discovered. In one of those issues, he mentions that “complete compromise could have been avoided if developers were prohibited from using [SSH agent forwarding].”

April 08, 2019

Gustaf Erikson (gerikson)

The Gun by C. J. Chivers April 08, 2019 07:29 AM

A technical, social and political history of the AK-47 assault rifle and derivatives.

Chivers does a good job tying the design of the gun into Soviet defense policy, and compares the development of the weapon favorably compared to the US introduction of the M16.

The author explores the issues with the massive proliferation of these assault rifles worldwide, but he seems to have a blind spot for the similar proliferation of semi-automatic weapons with large magazine sizes in the US. He has faith that the situations that lead to the widespread uses of the AK-47 will never occur in the USA.

Gergely Nagy (algernon)

On Git workflows April 08, 2019 06:45 AM

To make things clear, I'll start this post with a strongly held opinion: E-mail based git workflows are almost always stupid, because in the vast majority of cases, there exists a more reliable, more convenient, easier to work with workflow, which usually requires less setup and even less sacrificial lambs. I've tried to explain this a few times on various media, but none of those provide the tools for me to properly do so, hence this blog post was born, so I can point people to it instead of trying to explain it - briefly - over and over again. I wrote about mailing lists vs GitHub before, but that was more of an anti-GitHub rebuttal than a case against an e-mail workflow.

I originally wanted to write a long explanation comparing various workflows: A Forge's web UI vs Forge with loose IDE integration vs Forge with tight IDE integration vs E-mail-based variations. However, during this process I realised I don't need to go that far, I can just highlight the shortcomings of e-mail with a few examples, and then show a glimpse into the power a Forge can give us.

One of the reasons I most often hear in support of an e-mail-based workflow is that git ships with built-in tools for collaborating over e-mail. It does not. It ships with tools to send email, and tools to process e-mail sent by itself. There's no built-in tool to bridge the two, it is entirely up to you to do so. Collaboration is not about sending patches into the void. Collaboration includes receiving feedback, incorporating it, change, iteration. Git core does not provide tools for those, only some low-level bits upon which you can build your own workflow.

Git is also incredibly opinionated about how you should work with e-mail: one mail per commit, nicely threaded, patches inline. But that's not the only way to have an e-mail based workflow: PostgreSQL for example uses attachments and multiple commits attached to the same e-mail. This isn't supported by core git tools, even though it solves a whole lot of problems with the one inline patch per commit method - more about that a few paragraphs below.

So what's the problem with e-mail? First of all, in this day and age, delivery is not reliable. This might come as a surprise to proponents of the method, but despite the SMTP protocol being resilient, it is not reliable. It will keep retrying if it gets a temporary failure, yes. But that's about the only thing it guarantees, that it keeps trying. Once we add spam filters, greylisting, and a whole lot of other checks one needs in 2019 to not drown in junk, there's a reasonable chance that something will, at some point, go horribly wrong. Let me describe a few examples from personal experience!

At one time, I sent a 10-commit patch series to a mailing list. I wasn't subscribed at the time, and the mailing list software silently dropped every mail: on the SMTP level, it appeared accepted, but it never made to the list. I had no insight into why, and had to contact a list admin to figure it out. Was it a badly configured server? Perhaps, or perhaps not. Silently dropping junk makes sense if you don't want to let the sender know that you know they're sending junk. Sometimes there are false positives, which sucks, but the administrators made this trade-off, who am I to disagree? Subscribing and resending worked flawlessly, but this introduced days of delay and plenty of extra work for both me and the list admins. Not a great experience. I could have read more about the contribution process and subscribe in advance, but as this was a one-off contribution, subscribing to the list (a relatively high-volume one) felt like inviting a whole lot of noise for no good reason. Having to subscribe to a list to meaningfully contribute is also a big barrier: not everyone's versed in efficiently handling higher volumes of e-mail (nor do people need to be).

Another time, back when greylisting was new, I had some of my patches delayed for hours. This isn't a particularly big deal, as I'm in no rush. It becomes a big deal when patches start arriving out of order, sometimes with hours between them because I didn't involve enough sacrificial lambs to please the SMTP gods. When the first feedback you get is "where's the first patch?", even though you sent it, that's not a great experience. I've even had a case where a part of the commit thread was rejected by the list, another part went through. What do you do in this case? You can't just resend the rejected parts unchanged. If you change them to please the list software, that pretty much invalidates the parts that did get through - and nothing guarantees that they'll all get through this time, either.

In all of these cases, I had no control. I didn't set the mailing lists up, I didn't configure their SMTP servers. I did everything by the book, and yet...

From another point of view, as a reviewer, receiving dozens of mail in a thread for review isn't as easy to work with as one would like. For example, if I want to send feedback on two different - but related - commits, then I have to either send two mail, as replies to the relevant commits, or merge the two patches in one email for the purpose of replying. In the second case, it's far too easy to loose track of what's where and why.

With these in mind, I'm sorry to say, but e-mail is not reliable. E-mail delivery is not reliable. It is resilient, but not reliable (see above). The contents of an e-mail are fragile, change the subject, and git am becomes unhappy. You want to avoid bad MUAs screwing up patches? Attach them! Except the default git tooling can't deal with that. There are so many things that can go wrong, it's not even funny. Many of those things, you won't know until hours, or days later. That's not a reliable way to work.

Sending patches as attachments, in a single mail, solves most of these problems: if it gets rejected, all gets rejected. If it gets delayed, the whole thing gets delayed. Patches never arrive out of order and with delays. Reviewing multiple commits becomes easier too, because all of them are available at hand, without having to build tooling to make them available. But patches as attachments aren't supported by core git tools. Even in this case, there's plenty more you can't easily do, becaise there's something that patches lack: plenty of meta-information.

You can't easily see more context than what the patch file provides. You can, if you apply the patchset and look at the tree, but that's not something the default tools provide out of the box. It's not hard to do that, not hard to automate it, but it doesn't come out of the box. To navigate the source code at any given time of its history, you have to apply the patches too. There are plenty of other things where one wants more information than what is available in a patch.

But I said in the opening paragraph that:

there exists a more reliable, more convenient, easier to work with workflow, which usually requires less setup and even less sacrificial lambs

So what is this magical, more reliable, more convenient, easier to work with workflow? Forges. Forges like GitHub, GitLab, Gitea, and so on. You may have been led to believe that you need a browser for these, that a workflow involving a forge cannot be done without switching to a browser at some point. This is true: you will usually need a browser to register. However, from that point onwards, you do not, because all of these forges provide powerful APIs. Powerful APIs that are much easier to build good tooling upon than email. Why? Because these APIs are purpose-built, their reason for existence is to allow tooling to be built upon them. That's their job. When you have purpose-built tools, those will be easier to work with that something as generic and lax as e-mail. With this comes that Forges do most of the integration required for our workflow. We only have to build one bridge: one between the API, and our IDE of choice.

As an example, lets look at magit/forge, an Emacs package that integrates forge support into Magit (the best git UI out there, ever, by far)!

Forge overview

We see pull requests, issues, recent commits, and the current branch in one place. Want to look at an issue? Navigate there, press enter, and voila:

Viewing an issue with Forge

Easy access to the whole history of the issue. You can easily quote, reply, tag, whatever you wish. From the comfort of your IDE.

Pull-requests? Same thing, navigate there, press enter:

Viewing a pull request with Forge

You have easy access to all discussions, all the commits, all the trees, from the comfort of your IDE. You do not need to switch to another application, with different key bindings, slightly different UX. You do not need to switch to a browser, either. You can do everything from the comfort of your Integrated Development Environment. And this, dear reader, is awesome.

The real power of the forges is not that they provide a superior user experience out of the box - they kinda do anyway, since you only have to register and you're good to go. No need to care about SMTP, formatting patches, switching between applications and all that nonsense. The web UI is quite usable for a newcomer. For a power-user - no; using a browser for development would be silly (alas, poor people stuck using Atom, VS Code or Chromebooks). Thankfully, we do not have to, because all of the forges provide APIs, and many IDEs also provide various levels of integration.

But what these Forges provide are not just easy access to issues, pull-requests and commits at one's fingertips. They provide so much more! You see, with a tightly integrated solution, if you want to expand the context of a patch, you can: it's already right there, a single shortcut away. You can easily link to parts of the patchset, or the code, and since they'll be links, everyone reading it will have an easy, straightforward way to navigate there. You can reference issues, pull-requests from commit messages, other issues or pull-requests - and they'll be easy to navigate to, out of the box. A forge binds the building blocks together, to give us an integrated solution out of the box.

Forges make the boring, tedious things invisible. They're not exclusive owners of the code either: you can always drop down to the CLI and use low-level git commands if need be. This is what computers are meant to do: help us be more efficient, make our jobs more convenient, our lives easier. Thankfully, we have Forges like GitLab, Gitea and others, that are open source. We aren't even forced to trust our code, meta-data and workflows to proprietary systems.

However, forges aren't always a good fit. There are communities that wouldn't work well with a Forge. That's ok too. But in the vast majority of cases, a forge will make the life of contributors, maintainers and users easier. So unless you're the Linux kernel, don't try to emulate them.

Pete Corey (petecorey)

FizzBuzz is Just a Three Against Five Polyrhythm April 08, 2019 12:00 AM

Congratulations, you’re now the drummer in my band. Unfortunately, we don’t have any drums, so you’ll have to make due by snapping your fingers. Your first task, as my newly appointed drummer, is to play a steady beat with your left hand while I lay down some tasty licks on lead guitar.

Great! Now let’s add some spice to this dish. In the span of time it takes you to snap three times with your left hand, I want you to lay down five evenly spaced snaps with your right. You probably already know this as the drummer in our band, but this is called a polyrhythm.

Sound easy? Cool, give it a try!

Hmm, I guess being a drummer is harder than I thought. Let’s take a different approach. This time, just start counting up from one. Every time you land on a number divisible by three, snap your left hand. Every time you land on a number divisible by five, snap your right hand. If you land on a number divisible by both three and five, snap both hands.

Go!

You’ll notice that fifteen is the first number we hit that requires we snap both hands. After that, the snapping pattern repeats. Congratulations, you don’t even have that much to memorize!

Here, maybe it’ll help if I draw things out for you. Every character represents a tick of our count. "ı" represents a snap of our left hand, ":" represents a snap of our right hand, and "i" represents a snap of both hands simultaneously.

But man, I don’t want to have to manually draw out a new chart for you every time I come up with a sick new beat. Let’s write some code that does it for us!


_.chain(_.range(1, 15 + 1))
    .map(i => {
        if (i % 3 === 0 && i % 5 === 0) {
            return "i";
        } else if (i % 3 === 0) {
            return "ı";
        } else if (i % 5 === 0) {
            return ":";
        } else {
            return ".";
        }
    })
    .join("")
    .value();

Here’s the printout for the “three against five” polyrhythm I need you to play:

..ı.:ı..ı:.ı..i

But wait, this looks familiar. It’s FizzBuzz! Instead of printing "Fizz" for our multiples of three, we’re printing "i", and instead of printing "Buzz" for our multiples of five, we’re printing "ı".

FizzBuzz is just a three against five polyrhythm.

We could even generalize our code to produce charts for any kind of polyrhythm:


const polyrhythm = (pulse, counterpulse) =>
    _.chain(_.range(1, pulse * counterpulse + 1))
    .map(i => {
            if (i % pulse === 0 && i % counterpulse === 0) {
                return "i";
            } else if (i % pulse === 0) {
                return "ı";
            } else if (i % counterpulse === 0) {
                return ":";
            } else {
                return ".";
            }
        })
        .join("")
        .value();

And while we’re at it, we could drop this into a React project and create a little tool that does all the hard work for us:

Anyways, we should get back to work. We have a Junior Developer interview lined up for this afternoon. Maybe we should have them play us a polyrhythm to gauge their programming abilities?

April 07, 2019

Bogdan Popa (bogdan)

Continuations for Web Development April 07, 2019 02:00 PM

One of the distinguishing features of Racket’s built-in web-server is that it supports the use of continuations in a web context. This is a feature I’ve only ever seen in Smalltalk’s Seaside before, though Racket’s version is more powerful.

April 05, 2019

Simon Zelazny (pzel)

Uses for traits vs type aliases in Ponylang April 05, 2019 10:00 PM

I realized today that while both traits and type aliases can be used to represent a union of types in Pony, each of these solutions has some characteristics which make sense in different circumstances.

Traits: Keeping an interface open

Let's say you have the trait UUIDable.

trait UUIDable
  fun uuid(): UUID

and you have a method that accepts object implementing said trait.

  fun register_uuid(thing: UUIDable): RegistrationResult =>

Delcaring the function parameter type like this means that any thing that has been declared to implement the trait UUIDable will be a valid argument. Inside the method body, we can only call .uuid() on the thing, because that's the only method specified by the trait.

We can take an instance of the class class User is UUIDable, and pass it to register_uuid. When we continue development and add class Invoice is UUIDable, no change in any code is required for register_uuid to also accept this new class. In fact, we are free to add as many UUIDable classes to our codebase, and they'll all work without any changes to register_uuid.

This approach is great when we just want to define a typed contract for our methods. However, it does not work when we want to explicitly break encapsulation and – for example – match on the type of the thing.

fun register_uuid(thing: UUIDable): RegistrationResult =>
  match thing
  | let u: User => _register_user(u)
  | let i: Invoice => _register_invoice(i)
  end

The compiler will complain about this method definition, because it can't know that User and Invoice are the only exising types that satisfy UUIDable. For the compiler, this now means that any UUIDable thing that is not a User or an Invoice will fall through the match, and so the resulting output type must also include None, which represents the 'missed' case in our match statment.

We know that the above match is indeed exhaustive. Users and Invoices will be the only types that satisfy UUIDable. How can we let the compiler know?

Type aliases: explicit and complete enumerations

If we want to break encapsulation, and are interested in an exhaustive and explicit union type, then a type alias gives the compiler enough info to determine that the match statement is in fact exhaustive:

type UUIDable is (User | Invoice)
fun register_uuid(thing: UUIDable): RegistrationResult =>
  match thing
  | let u: User => _register_user(u)
  | let i: Invoice => _register_invoice(i)
  end

Different situations will call for different approaches. The type alias approach means that anytime you add a new UUIDable, you'll have to redefine this alias, and have to go through all your match statements and add a new case. The silver lining is that the compiler will tell you which places you need to modify.

Also, note that you can still call thing.uuid() and have it type-check, as the compiler can determine that all classes belonging to (User | Invoice) actually provide this method.

Encapsulation vs. exhaustiveness

Using traits (or interfaces for even more 'looseness') means that, in the large, your code will have to conform to the OOO practices of loose coupling, information hiding, and encapsulation.

Using union types defined as type aliases means that encapsulation is no longer possible, but the compiler will guide you in making sure that matches are exhaustive when you take apart function arguments. This results in the code looking more 'functional' in the large.

You can play around with this code in the Pony playground.

Siddhant Goel (siddhantgoel)

The “Hacker News Effect” April 05, 2019 10:00 PM

I submitted Developer to Manager to Hacker News on 25th March. The common/accepted knowledge is that there's a ton of variation on what makes a post hit the front page. Having absolutely no knowledge about those variations, I submitted the post regardless, not thinking much about it. What happened next was something I wasn't expecting at all.

The post ended up trending on the front page for one full day, gathering slightly more than 500 upvotes in the process, resulting in a cool 50,000 page views over the next 2 days, and moving the site's Alexa rank by about 2 million places (up). It was quite an experience to see the number of concurrent visitors to the site jump beyond 300.

The site is completely static and hosted on Netlify, so I wasn't too worried about all that traffic taking things down, but that's besides the point.

Fathom Analytics

I think the main takeaway for me, personally, was that it made me realize that I was not the only one wishing for something like this to exist. I started working on this project because I really felt the need for a resource like this, which could help me with my own transition to a slightly more managerial role. Reading the comments on hacker news, and the tons of encouraging emails that people sent me, made it clear that there are plenty of other developers who are transitioning to management and looking for resource to assist with the transition. Solving personal problems is always good, but it's even better when others validate it.

So everyone who upvoted the HN post, left an encouraging comment, sent me an email, volunteered for an interview for the site, or helped in any other way - a huge thank you! I'll be using all that motivation to make Developer to Manager the site you open when you want to know how to become a good engineering manager.

April 03, 2019

Gokberk Yaltirakli (gkbrk)

Plaintext budgeting April 03, 2019 08:26 PM

For the past ~6 months, I’ve been using an Android application to keep track of my daily spending. To my annoyance, I found out that the app doesn’t have an export functionality. I didn’t want to invest more time in a platform that I couldn’t get my data out of, so I started looking for another solution.

I’ve looked into budgeting systems before, and I’ve seen both command-line (ledger) and GUI systems (GNUCash). Now; both of these are great software, and I can appreciate how Double-entry bookkeeping is a useful thing for accounting purposes. But while they are powerful, they’re not as simple as they could be.

I decided to go with CSV files. CSV is one of the most universal file formats, it’s simple and obvious. I can process it with pretty much every programming language and import it to pretty much every spreadsheet software. Or… I could use a shell script to run calculations with SQLite.

If I ever want to migrate to another system; it will probably be possible to convert this file with a shell script, or even a sed command.

I create monthly CSV files in order to keep everything nice and tidy, but the script adapts to everything from a single CSV file to one file for each day/hour/minute.

Here’s what an example file looks like:

Date,Amount,Currency,Category,Description
2019-04-02,5.45,EUR,Food,Centra
2019-04-03,2.75,EUR,Transport,Bus to work

And here’s the script:

#!/bin/sh

days=${1:-7}

cat *.csv | sed '/^Date/d' > combined.csv.temp

output=$(sqlite3 <<EOF
create table Transactions(Date, Amount, Currency, Category, Description);
.mode csv
.import combined.csv.temp Transactions
.mode list

select 'Amount spent today:',
coalesce(sum(Amount), 0) from Transactions where Date = '$(date +%Y-%m-%d)';

select '';
select 'Last $days days average:',
sum(Amount)/$days, Currency from Transactions where Date > '$(date --date="-$days days" +%Y-%m-%d)'
group by Currency;

select '';
select 'Last $days days by category';
select '=======================';

select Category, sum(Amount) from Transactions
where Date > '$(date --date="-$days days" +%Y-%m-%d)'
group by Category order by sum(Amount) desc;
EOF
      )

rm combined.csv.temp

echo "$output" | sed 's/|/ /g'

This is the output of the command

[leo@leo-arch budget]$ ./budget.sh
Amount spent today: 8.46

Last 7 days average: 15.35 EUR

Last 7 days by category
=======================
Groceries 41.09
Transport 35.06
Food 31.35
[leo@leo-arch budget]$ ./budget.sh 5
Amount spent today: 8.46

Last 5 days average: 11.54 EUR

Last 5 days by category
=======================
Groceries 29.74
Transport 17.06
Food 10.9
[leo@leo-arch budget]$

Andrew Montalenti (amontalenti)

Shipping the Second System April 03, 2019 12:54 PM

In 2015-2016, the Parse.ly team embarked upon the task of re-envisioning its entire backend technology stack. The goal was to build upon the learnings of more than 2 years delivering real-time web content analytics, and use that knowledge to create the foundation for a scalable stream processing system that had built-in support for fault tolerance, data consistency, and query flexibility. Today in 2019, we’ve been running this new system successfully in production for over 2 years. Here’s what we learned about designing, building, shipping, and scaling the mythical “second system”.

The Second System Effect

But why re-design our existing system? This question lingered in our minds a few years back. After all, the first system was successful. And I had the lessons of Frederick Brooks accessible and nearby when I embarked on this project. He wrote in The Mythical Man-Month:

Sooner or later the first system is finished, and the architect, with firm confidence and a demonstrated mastery of that class of systems, is ready to build a second system.

This second is the most dangerous system a man ever designs.

When he does his third and later ones, his prior experiences will confirm each other as to the general characteristics of such systems, and their differences will identify those parts of his experience that are particular and not generalizable.

The general tendency is to over-design the second system, using all the ideas and frills that were cautiously sidetracked on the first one. The result, as Ovid says, is a “big pile.”

Were we suffering from engineering hubris to redesign a working system? Perhaps. But we may have been suffering from something else altogether healthy — the paranoia of a high-growth software startup.

I discuss Parse.ly’s log-oriented architecture at Facebook’s HQ for PyData Silicon Valley, with Parse.ly’s VP of Engineering, Keith Bourgoin.

Our product had only just been commercialized. We were a team small enough to be nimble, but large enough to be dangerous. Yes, there were only a handful of engineers. But we were operating at the scale of billions of analytics events per day, on-track to serve hundreds of enterprise customers who required low-latency analytics over terabytes of production data. We knew that scale was not just a “temporary problem”. It was going to be the problem. It was going to be relentless.

Innovation Leaps

Innovation doesn’t come incrementally; it comes in waves. And in this industry, yesterday’s “feat of engineering brilliance” is today’s “commodity functionality”.

This is the environment I found myself in during 2014-2015: We had a successful system with a lot of demands from customers, but no way to satisfy those demands, because the system had been designed with a heap of trade-offs that reflected “life on the edge” in 2012-2014, working with large-scale analytics data.

Back then, we were eager adopters of tools like Hadoop Streaming, Apache Pig, ZeroMQ, and Redis, to handle both low-latency and large-batch analytics. We had already migrated away from these tools, refreshing our stack as the open source community shipped new usable tools.

Meanwhile, expectations had changed. Real-time was no longer innovative; it was expected. A cornucopia of nuanced metrics were no longer nice-to-haves; they were necessities for our customers.

It was no longer enough for our data to be “directionally correct” (our main goal during the “initial product/market fit” phase), but instead it needed to be enterprise-ready: verifiable, auditable, resilient, always-on. The web was changing, too. Assumptions we made about the kind of content our system was measuring shifted before our very eyes. Edge cases were becoming common cases. The exception was becoming the rule. Our data processing scale was already on track to 10x. We could see it going 100x. (It actually went 1000x.)

Once a new technology rolls over you, if you’re not part of the steamroller, you’re part of the road.

–Stewart Brand

So yes, we could have attempted an incremental improvement of the status quo. But doing so would have only resulted in an incremental step, not a giant leap. And incremental steps wouldn’t cut it.

Joel Spolsky famously said that rewriting a working system is something “you should never do”.

There’s a subtle reason that programmers always want to throw away the code and start over. The reason is that they think the old code is a mess. And here is the interesting observation: they are probably wrong.

The reason that they think the old code is a mess is because of a cardinal, fundamental law of programming: It’s harder to read code than to write it.

But what about Apple II vs Macintosh? Did Steve Jobs and his pirate team make a mistake in deciding to rethink everything from scratch — from the hardware to operating system to programming model — for their new product? Or was rethinking everything the price of admission for a high-growth tech company to continue to innovate in its space?

Big Rewrites are New Products

Perhaps we really have a problem of term definition.

When we think “rewrite”, we think “refactor”, as in change the codebase for a “zen garden” requirement, like code cleanliness or performance.

In software, we should admit that “big rewrites” aren’t about refactoring existing products — they are about building and shipping brand new products upon existing knowledge.

Perhaps what your existing product tells you is a core set of use cases which the new product must satisfy. But if your “rewrite” doesn’t support a whole set of new use cases, then it is probably doomed to be a waste of time.

If the Macintosh had shipped and was nothing more than a prettier, better-engineered Apple II, it would have been a failure. But the Macintosh represented a leap from command prompts to graphical user interfaces, and from keyboard-oriented control to the mouse. These two major innovations (among others) propelled not just the Apple’s products into the mainstream, but also generally fueled the personal computing revolution!

So, forget rewrites. Think, new product launches. We re-use code and know-how, yes, but also organizational experience.

We eliminate sunk cost thinking and charge ahead. We fly the pirate flag.

But, we still keep our wits about us.

If you are writing a new product as a “rewrite”, then you should expect it to require as much attention to detail as shipping the original product took, with the added downsides of legacy expectations (how the product used to work) and inflated future expectations (how big an impact the new product should have).

Toward the Second System

So, how did we march toward this new launch? How did we do a big rewrite in such a way that we lived to tell the tale?

System One Problems

In the case of Parse.ly, one thing we did right was to focus on the actual, not imagined or theoretical, problems. We identified a few major problems with the first version of our system, which was in production 2012-2015.

  • Data consistency problems due to pre-aggregation. Issues with data consistency and reliability were our number one support burden, and investigating them was our number one engineering distraction.
  • Did not track analytics at the visitor and URL level. As our customers moved their web content from the “standard article” model to a number of new content forms, such as interactive features, quizzes, embedded audio/video, cards/streams, landing pages, mobile apps, and the like, the unit of tracking we used in our product/market fit MVP (the “article” or “post”) started to feel increasingly limiting.
  • Only supported a single metric, the page view, or click. Our customers wanted to understand alternative metrics, segments, and source attributions — and we only supported this single metric throughout our databases and user interface. This metric was essentially a core assumption built throughout the system. That we only supported this one metric at first might seem surprising — after all, page views are a basic metric! But, that was precisely the point. During our initial product/market fit stage, we were trying to drain as much risk from the problem. We were focusing on the core value, and market differentiation. We weren’t trying to prove the value of new metrics. We were trying to prove the value of a new category of analytics, which we called “content analytics” — the merger of content understanding (what the content is about) and content engagement (how many people visited the content and where they came from). From a technical standpoint, our MVP addressed both of these issues while only supporting a single metric, along with various filters and groupings thereof. Note: in retrospect, this was a brilliant bit of scope reduction in our early days. This MVP got us to our first million in annual revenue, and let us see all sorts of real-world data from real customers. The initial revenue gave us enough runway to survive to a Series A financing round, as well.

There were other purely technical problems with the existing system, too:

  • Some components were single points of failure, such as our Redis real-time store.
  • Our system did not work easily with multi-data-center setups, thus making failover and high availability harder to provide.
  • Data was stored and queried in two places for real-time and historical queries (Redis vs MongoDB), making the client code that accessed the data complex.
  • Rebuilding the data from raw logs was not possible for real-time data and was very complex for historical data.
  • To support our popular API product, we had a completely separate codebase, which had to access and merge data from four different databases (Redis, MongoDB, Postgres, and Apache Solr).

System One Feats of Engineering

However, System One also did many things very well.

  • Real-time and historical queries were satisfied with very good latencies; often milliseconds for real-time and below 1-2 seconds for historical data.
  • Data collection was solid, tracking billions of page views per month and archiving them reliably in Amazon S3.
  • API latencies were serving a billion API requests per month and were very stable.
  • Despite not supporting multiple data centers, we did have a working high availability and failover story that had worked for us so far.

System One had also undergone its share of refactorings. It started its life as simple cronjobs, evolved into a Python and Celery/ZeroMQ system, and eventually into using Storm and Kafka. It had layered on “experimental support” for a couple of new metrics, namely visitors (via a clever implementation of HyperLogLog) and shares (via a clever share crawling subsystem). Both were proving to be popular data sources, though these metrics were only supported in a limited way, due to their experimental implementation. Throughout, System One’s data was being used to power rich dashboards with thousands of users per customer; CSV exports that drove decision-making; and APIs that powered sites the world over.

System Two Requirements

Based on all of this, we laid our requirements for System Two, both customer oriented and technical.

Customer requirements:

  • URL is basic unit of tracking; every URL is tracked.
  • Supports multiple sources, segments, and metrics.
  • Still supports page views and referrers.
  • Adds visitor-oriented understanding and automatic social share counting — unique visitor counts and social interaction counts across URLs, posts, and content categories. The need for this was proven by our experiments in System One.
  • Real-time queries for live updates.
  • 5-minute buckets for past 24 hours and week-ago benchmarking.
  • 1-day buckets for past 2 years.
  • Low latency queries, especially for real-time data.
  • Verifiably correct data.

Technical requirements:

  • Real-time processing can handle current firehose with ease — this started at 2k pixels per second, but 10x’ed to 20k pixels per second in 2016.
  • Batch processing can do rebuilds of customer data and daily periods.
  • The batch and real-time layers are simplified, with shared code among them in pure Python.
  • Databases are linearly and horizontally scalable with more hardware.
  • Data is persistent and highly available once stored.
  • Query of real-time and historical data uses a unified time series engine.
  • A common query language and client library is used across our dashboard and our API.
  • Room in model for adding fundamentally new metrics (e.g. engaged time, video starts) without rearchitecting the entire system.
  • Room in the model for adding new dimensions (e.g. URL-grouping campaign parameters, new data channels) without re-architecting the system or re-indexing production data.

Soft requirements that acted as a guiding light:

  • Backend codebase should be much smaller.
  • There should be fewer production data storage engines — ideally, one or two.
  • System should be easier to reason about; the distributed cluster topology should fit on a single slide.
  • Our frontend team should feel much more user interface innovation is possible atop the new query language for real-time and historical data.

And with that, we charged ahead!

The First Prototype

First prototypes are where engineering hubris is at its height. This is because we can invent various scenarios that allow us to verify our own dreams. It is hard to be the cold scientist who proves the experiment a failure in the name of truth. Instead, we are emotionally tied up in our inventions, and want them to succeed.

Put another way:

But sometimes, merely by believing we shall succeed, we can fashion a bespoke instrument of innovation that causes us to succeed.

In the First Prototype of this project, I wanted to prove two things:

  1. That we could have a simpler codebase.
  2. That the state-of-the-art in open source technology had moved far enough along to benefit us toward our concrete System Two goals.

To start things off, I created a little project for Cassandra and Elasticsearch experiments, which we code-named “casterisk”. I went as deep as I could to teach myself these two technologies. As part of the prototyping process, we also shared what we learned in technical deep-dive posts on Cassandra, Lucene, and Elasticsearch.

The First Prototype had data shaped like our data, but it wasn’t quite our data. It generated random but reasonable-looking customer traffic in a stream, and using the new tools available to me, I managed to restructure the data in myriad ways. Technically speaking, I now had Cassandra CQL tables representing log-structured data that could be scaled horizontally, and Elasticsearch indices representing aggregate records that could be queried across time and also scaled horizontally. The prototype was starting to look like a plausible system.

Some early time series data flowing through our “casterisk” prototype.

But then, it took about 3 months — from May to August — for the prototype to go from “R&D” to “pre-prod” stage. It wasn’t until August of 2015 that we published a post detailing all the new metrics supported in our new “beta” backend system. Why so long, given the early advancements?

Recruiting a Team

You would think as CTO of a startup that I don’t need to recruit a team for an innovative new project. I certainly thought that. But I was wrong.

You see, the status quo is a powerful drug. Upon the first reports of my experiments, my team met me with suspicion and doubt. This was not due to any fault of their own. Smart engineers should be skeptical of prototypes and proposed re-architectures. Only when prototypes survive the harshest and most skeptical scrutiny can they blossom into production systems.

Building Our Own Bike

Steve Jobs once famously said that computers are like a “bicycle for the mind”, because it’s a technology that lets us “move faster” than any of other species’ naturally-endowed speed.

Well, the creaky bike that seemed to be slowing us down in our usage of Apache Storm was an open source module called petrel. Now long unmaintained, at the time, it was the only way to run Python code on Apache Storm, which was how we reliably ran massively parallel real-time streaming jobs, overcoming Python’s global interpreter lock (GIL) and handling multi-node scale-out.

So, we built our own bike: streamparse. I discussed streamparse a bit at PyCon 2015, but the upshot is that it lets us run massively parallel stream processing jobs in pure Python atop Apache Storm. And it let us prototype those distributed stream processing clusters very, very quickly.

But though bikes let you move fast, if you build your own bike, you have to factor in how long it takes to build it — that is, while you’re standing still. And that’s exactly what we did for a few months.

This may have been a bit of scope creep. After all, we didn’t need streamparse to test out our new casterisk system. But it sure made testing them a whole lot easier. It let us run local tests of the clusters and it let us deploy multiple topologies in parallel that tweaked different settings. But it meant a new investment was required that was not the same as the core problem at hand.

Upgrading The Gears… While We Rode

The other bit of hubris that slowed us down: the alluring draw of “upgrades”.

Elasticsearch had just added its aggregation framework, which was exactly what we needed to do time series analysis against its records. It had also just added a new aggregate, cardinality, that we thought could satisfy some important use cases for us. Cassandra had a somewhat-buggy counter implementation in 2.0, but a complete re-implementation was around the corner in 2.1. We thought upgrading to it would save us, but then, we discovered counters were a bad idea altogether. Likewise, Storm had a stable release that we were already running in 0.8, but 0.9.2 was around the corner and was going to be the new stable. We upgraded to it, but then, discovered bugs in its network layer that stopped things from working. Our DevOps team reasonably pushed for a new “stable” Ubuntu version. We adopted it, thinking it’d be safe and stable. Turned out, we hit kernel/driver incompatibility problems with Xen, which were only triggered due to the scale of our bandwidth requirements.

So, all in all, we did several “upgrades to stable” that were actually bleeding edge upgrades in disguise. All while we were testing the systems in question with our own new Second System. The upgrades felt like “adopting a stable version”, but they were simply too new. If you upgrade the gears while riding the bike, you should expect to fall. This was one of the core lessons learned.

Taking a Couple of Fun Detours

The project seemed like it had already been distracted a bit by streamparse development and new database versions, but now a couple of fun detours also emerged from the noise. These were “good” detours. For example: we built an “engaged time” experiment that showcased our ability to track and measure a brand new metric, which had a very different shape from page views and visitors. We proved we could measure the metric effectively with our new system, and report on it in the context of our other metrics. It turns out, this metric was a driving force for adoption of our product in later months and years.

Our “referrer” experiment showed that we’d be able to satisfy several important queries in the eventually-delivered full system. Namely, we could breakout every traffic source category, domain, and URL in full detail, both in real-time and over long historical periods. This made our traffic source analytics more powerful than anything else on the market.

Our visits and sessions experiment showed our ability to do distinct counts just as well (albeit more slowly) than integer counts. Our “new vs returning” visit classifier had not just one, but two, rewrites atop different data structures, before eventually succumbing to a third rewrite that removed some functionality altogether. The funny thing is, these two attempts were eventually thrown away as we replaced it with a much simpler solution (called client-side sessionization, where we establish sessions in the JavaScript tracker rather than on the server). But, it was still a “good” detour, because it resulted in us shipping total visitors, new visitors, and returning visitors as real-time and historical metrics in our core dashboard — something that our competitors have still failed to deliver, years later.

These detours all had the feeling of engineers innovating at their best, but also meant multi-month delays in the delivery of an end-to-end working system.

Arriving at the Destination

Despite all this, in early August, we called a huddle to say we were finally going to be done with our biking tour. It was a new bike, it was fast, its gears were upgraded, and it was running as smoothly as it was going to. This led to the October, November, December period, which was among our team’s most productive during the Second System period. “Parse.ly Preview” was built, tested, and delivered, as the data proved its value and flexibility.

Our old and new dashboard experience running side-by-side, powered by the different backends!

We ran the new system side-by-side with the old system. This was hard to do, but an absolute hard requirement for reducing risk. That’s another lesson learned: something we definitely did right, and that any future rewrites should consider to be a hard requirement, as well.

The new system was iteratively refined, while also making the backend more stable and performant. We updated our blog post on Mage: The Magical Time Series Backend Behind Parse.ly Analytics to reflect the shipped reality of the working system. I presented our usage of Elasticsearch as a large-scale production time series engine at Elastic{on}, where I got to know some members of the Elastic team who had worked on the aggregation engine that we managed to abstract over in Mage. We cut our customers over to the new system, just as our old system was hitting its hard scaling limits. It felt great.

Several new features were launched atop it in the following years, including a new version of our API, a native iOS app, a new “homepage overlay” tool, new full-screen dashboards, new filters and new reports. We shipped campaign tracking, channel tracking, non-post tracking, video tracking — all of which would have been impossible in the old system.

We’ve continued to ship feature after feature atop “casterisk” and “mage” since then. We expanded the scope of Parse.ly toward tracking more than just articles, including videos, landing pages, and other content types. We now support advanced custom segmentation of audience by subscription status, loyalty level, geographic region, and so on. We support custom events that are delivered through to a hosted data pipeline, which customers can use for raw data analysis and auditing. In other words, atop this rewritten backend, our product just kept getting better and better.

Meanwhile, we have kept up with 100% annual growth in monthly data capture rate, along with a 1000x growth in historical data volume. All thanks to our team’s engineering ingenuity, thanks to our willingness to pop open the hood and modify the engine, and thanks to the magic of linear hardware scaling.

A view of how Parse.ly Analytics looks today, powered by the “mythical Second System” and informed by thousands of successful site integrations, tens of thousands of users, and hundreds of enterprise customers.

Brooks was right

Brooks was right to say, in his typically gendered way, that “the second system is the most dangerous one a man ever designs”.

Building and shipping that Second System in the context of a startup evolving its “MVP” into “production system” has its own challenges. But though backend rewrites are hard and painful, sometimes the price of progress is to rethink everything.

Having Brooks in mind while you do so ensures that when you redesign your bike, you truly end up with a lighter, faster, better bike — and not an unstable unicycle that can never ride, or an overweight airplane that can never fly.

Doing this wasn’t easy. But, watching this production system grow over the last few years to support over 300 enterprise customers in their daily work, to serve as a base for cutting-edge natural language processing technology, to answer the toughest questions of content attribution — has been the most satisfying professional experiment of my life. There’s still so much more to do.

So, now that we’ve nailed value and scale, what’s next? My bet’s on scaling this value to the entire web. Care to join us?

Indrek Lasn (indreklasn)

These tips will boost your React code’s performance April 03, 2019 11:05 AM

React is popular because React applications scale well and are fun to work with. Once your app scales, you might consider optimizing your app. Going from 2500ms wait time to 1500ms can have huge impacts on your UX and conversion rates.

Note: This article was originally published here — read the original too!

So without further ado, here are some performance tips I use with React.

React.memo

If you have a stateless component and you know your component won’t be needing re-render, wrap your entire stateless component inside a React.memo function. This:

https://medium.com/media/8ca098b81625f3f2cda5b21015fa26a5/href

becomes the following:

https://medium.com/media/6fe25b5a48af0aafd4c9778d1288878a/href

We wrap the entire stateless component inside a React.memo function. Notice the profile.displayName which helps to debug. More info about component.displayName

React.memo is the equivalent to the class version React.PureComponent

React.PureComponent

PureComponent compares props and state in shouldComponentUpdate life cycle method. This means it won’t re-render if the state and props are the same. Let’s refactor the previous stateless component to a class-based component.

https://medium.com/media/dd03f9b958779c06f64112aa041b6af5/href

If we know for sure the props and state won’t change, instead of using Component we could use the PureComponent.

https://medium.com/media/e3d021ea655638630cde686c43814545/href

componentDidCatch(error, info) {} Lifecycle method

Components may have side-effects which can crash the app in production. If you have more than 1000 components, it can be hard to keep track of everything.

There are so many moving parts in a modern web app; it’s hard to wrap one’s head around the whole concept and to handle errors. Luckily, React introduced a new lifecycle method for handling errors.

The componentDidCatch() method works like the JavaScript catch {} block, but for components. Only class components can be error boundaries.

https://medium.com/media/a23344d63aed363cec85f8be63c89849/href

React.lazy: Code-Splitting with Suspense

Dan explains:

“We’ve built a generic way for components to suspend rendering while they load async data, which we call suspense. You can pause any state update until the data is ready, and you can add async loading to any component deep in the tree without plumbing all the props and state through your app and hoisting the logic. On a fast network, updates appear very fluid and instantaneous without a jarring cascade of spinners that appear and disappear. On a slow network, you can intentionally design which loading states the user should see and how granular or coarse they should be, instead of showing spinners based on how the code is written. The app stays responsive throughout.”
https://medium.com/media/61a5ee6cab5b700f54f4990ec8516824/href

Read more about Beyond React 16 here.

React.Fragments to Avoid Additional HTML Element Wrappers

If you used React you probably know the following error code:

“Parse Error: Adjacent JSX elements must be wrapped in an enclosing tag”.

React components can only have a single child. This is by design.

https://medium.com/media/b6ee0941c317d54a350f0682df3df541/href

The following code will crash your app. The antidote to this is to wrap everything in a single element.

https://medium.com/media/f4fe6bb4cce78da1a1f844386199e2e1/href

The only problem with the following code is we have an extensive wrapper for every component. The more markup we have to render, the slower our app.

Fragments to the rescue!

https://medium.com/media/fa32b8c60d0b575c1eeb2323b596be91/href

Voilà! No extra mark-up is necessary.

Bonus: Here’s the shorthand for Fragments.

https://medium.com/media/3f506d646e33cc87c8562e415c2d5a4e/href

Thanks for reading! Check out my Twitter for more. Here are some cool articles you might enjoy:


These tips will boost your React code’s performance was originally published in freeCodeCamp.org on Medium, where people are continuing the conversation by highlighting and responding to this story.

Marc Brooker (mjb)

Learning to build distributed systems April 03, 2019 12:00 AM

Learning to build distributed systems

A long email reply

A common question I get at work is "how do I learn to build big distributed systems?". I've written replies to that many times. Here's my latest attempt.

Learning how to design and build big distributed systems is hard. I don't mean that the theory is harder than any other field in computer science. I also don't mean that information is hard to come by. There's a wealth of information online, many distributed systems papers are very accessible, and you can't visit a computer science school without tripping over a distributed systems course. What I mean is that learning the practice of building and running big distributed systems requires big systems. Big systems are expensive, and expensive means that the stakes are high. In industry, millions of customers depend on the biggest systems. In research and academia, the risks of failure are different, but no less immediate. Still, despite the challenges, doing and making mistakes is the most effective way to learn.

Learn through the work of others

This is the most obvious answer, but still one worth paying attention to. If you're academically minded, reading lists and lists of best papers can give you a place to start to find interesting and relevant reading material. If you need a gentler introduction, blogs like Adrian Colyer's Morning Paper summarize and explain papers, and can also be a great way to discover important papers. There are a lot of distributed systems books I love, but I haven't found an accessible introduction I particularly like yet.

If you prefer to start with practice, many of the biggest distributed systems shops on the planet publish papers, blogs, and talks describing their work. Even Amazon, which has a reputation for being a bit secretive with our technology, has published papers like the classic Dynamo paper, and a recent papers on the Aurora database, and many more. Talks can be a valuable resource too. Here's Jaso Sorenson describing the design of DynamoDB, me and Holly Mesrobian describing a bit of how Lambda works, and Colm MacCarthaigh talking about some principles for building control planes. There's enough material out there to keep you busy forever. The hard part is knowing when to stop.

Sometimes (as I've written about before) it can be hard to close the gap between theory papers and practice papers. I don't have a good answer to that problem.

Get hands-on

Learning the theory is great, but I find that building systems is the best way to cement knowledge. Implement Paxos, or Raft, or Viewstamped Replication, or whatever you find interesting. Then test it. Fault injection is a great approach for that. Make notes of the mistakes you make (and you will make mistakes, for sure). Docker, EC2 and Fargate make it easier than ever to build test clusters, locally or in the cloud. I like Go as a language for building implementations of things. It's well-suited to writing network services. It compiles fast, and makes executables that are easy to move around.

Go broad

Learning things outside the distributed systems silo is important, too. I learned control theory as an undergrad, and while I've forgotten most of the math I find the way of thinking very useful. Statistics is useful, too. ML. Human factors. Formal methods. Sociology. Whatever. I don't think there's shame in being narrow and deep, but being broader can make it much easier to find creative solutions to problems.

Become an owner

If you're lucky enough to be able to, find yourself a position on a team, at a company, or in a lab that owns something big. I think the Amazon pattern of having the same team build and operate systems is ideal for learning. If you can, carry a pager. Be accountable to your team and your customers that the stuff you build works. Reality cannot be fooled.

Over the years at AWS we've developed some great mechanisms for being accountable. The wheel is one great example, and the COE process (similar to what the rest of the industry calls blameless postmortems) is another. Dan Luu's list of postmortems has a lot of lessons from around the industry. I've always enjoyed these processes, because they expose the weaknesses of systems, and provide a path to fixing them. Sometimes it can feel unforgiving, but the blameless part works well. Some COEs contain as many great distributed systems lessons as the best research papers.

Research has different mechanisms. The goal (over a longer time horizon) is the same: good ideas and systems survive, and bad ideas and systems are fall away. People build on the good ones, with more good ideas and the whole field moves forward. Being an owner is important.

Another tool I like for learning is the what-if COE or premortem. These are COEs for outages that haven't happened yet, but could happen. When building a new system, think about writing your first COE before it happens. What are the weaknesses in your system? How will it break? When replacing an older system with a new one, look at some of the older one's COEs. How would your new system perform in the same circumstances?

It takes time

This all takes time, both in the sense that you need to allocate hours of the day to it, and in the sense that you're not going to learn everything overnight. I've been doing this stuff for 15 years in one way or another, and still feel like I'm scratching the surface. Don't feel bad about others knowing things you don't. It's an opportunity, not a threat.

Pepijn de Vos (pepijndevos)

The only good open source software is for software developers April 03, 2019 12:00 AM

The rest is all inferior clones of commercial software.

When I think of really high-quality open source software, 90% of it are compilers, databases and libraries. Tools for software developers by software developers. There are exceptions (Firefox comes to mind), but as they say, the exception proves the rule.

Outside commercial projects that happen to be open source (Android comes to mind), open source software is largely driven by a “scratch your own itch” mentality. However, this poses a problem when software developers don’t have the itch, and people with the itch are not software developers.

I have recently begun to see the world from the perspective of academia and electrical engineering, and it came as a bit of a shock to me how many of the tools that are in common use are bloated commercial Windows GUI software, compared to nimble open source command-line tools I was used to.

Many of them cost hundreds if not thousands of Euros, take up gigabytes of RAM and storage, are a pain to use, and are still the best or only option available. I can only imagine the horrors of working in a non-tech industry.

I don’t think there is an easy solution. If I’m solving a problem for someone else, I probably want to get paid. So it seems the only plausible model is commercial software that happens to be open source.

The other option is either teaching people with an itch to code, or make people who code have the itch. Broaden your interests, y’all!!! </rant>

April 02, 2019

Caius Durling (caius)

Cheaper Oil for Mini One R50 April 02, 2019 12:00 PM

My Mini One 2003 R50 1.6 litre petrol engine takes specific BMW 5w30 Longlife-04 oil. (I believe the R52 and R53 models take the same oil too.) The oil is 5w30 fully synthetic made to BMW's exacting standards.

The cheapest I've found to buy currently is a GM (Vauxhall/Opel) manufactured one, made to BMW's specifications. Searching for something like "dexos 2 5w30 gm" on ebay finds them at about £20 for 5 litres with free delivery in UK. (Comparatively, an equivalent from Castrol is about £50 at time of writing.)

Sounds like a small saving, but if your Mini is anything like mine it needs topping up once a month or so, and I do a full oil/filter change every 5k miles as an attempt at longevity. Soon adds up.

Andreas Zwinkau (qznc)

C++ State Machines April 02, 2019 12:00 AM

Avoid input parameters so state machines transitions are decoupled from transition effects.

Read full article!

April 01, 2019

Derek Jones (derek-jones)

MI5 agent caught selling Huawei exploits on Russian hacker forums April 01, 2019 01:11 AM

An MI5 agent has been caught selling exploits in Huawei products, on an underground Russian hacker forum (a paper analyzing the operation of these forums; perhaps the researchers were hired as advisors). How did this news become public? A reporter heard Mr Wang Kit, a senior Huawei manager, complaining about not receiving a percentage of the exploit sale, to add to his quarterly sales report. A fair point, given that Huawei are funding a UK centre to search for vulnerabilities.

The ostensive purpose of the Huawei cyber security evaluation centre (funded by Huawei, but run by GCHQ; the UK’s signals intelligence agency), is to allay UK fears that Huawei have added back-doors to their products, that enable the Chinese government to listen in on customer communications.

If this cyber centre finds a vulnerability in a Huawei product, they may or may not tell Huawei about it. Obviously, if it’s an exploitable vulnerability, and they think that Huawei don’t know about it, they could pass the exploit along to the relevant UK government department.

If the centre decides to tell Huawei about the vulnerability, there are two good reasons to first try selling it, to shady characters of interest to the security services:

  • having an exploit to sell gives the person selling it credibility (of the shady technical kind), in ecosystems the security services are trying to penetrate,
  • it increases Huawei’s perception of the quality of the centre’s work; by increasing the number of exploits found by the centre, before they appear in the wild (the centre has to be careful not to sell too many exploits; assuming they manage to find more than a few). Being seen in the wild adds credibility to claims the centre makes about the importance of an exploit it discovered.

How might the centre go about calculating whether to hang onto an exploit, for UK government use, or to reveal it?

The centre’s staff could organized as two independent groups; if the same exploit is found by both groups, it is more likely to be found by other hackers, than an exploit found by just one group.

Perhaps GCHQ knows of other groups looking for Huawei exploits (e.g., the NSA in the US). Sharing information about exploits found, provides the information needed to more accurately estimate the likelihood of others discovering known exploits.

How might Huawei estimate the number of exploits MI5 are ‘selling’, before officially reporting them? Huawei probably have enough information to make a good estimate of the total number of exploits likely to exist in their products, but they also need to know the likelihood of discovering an exploit, per man-hour of effort. If Huawei have an internal team searching for exploits, they might have the data needed to estimate exploit discovery rate.

Another approach would be for Huawei to add a few exploits to the code, and then wait to see if they are used by GCHQ. In fact, if GCHQ accuse Huawei of adding a back-door to enable the Chinese government to spy on people, Huawei could claim that the code was added to check whether GCHQ was faithfully reporting all the exploits it found, and not keeping some for its own use.

March 31, 2019

Ponylang (SeanTAllen)

Last Week in Pony - March 31, 2019 March 31, 2019 01:40 PM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Luke Picciau (user545)

Evaluating 8 Months of Building a Rails Vue SPA March 31, 2019 04:38 AM

Picture unrelated by Pexels Over the last ~8 months I have been building a website called PikaTrack using Rails for an api backend and VueJS frontend. I wanted to write a blog post detailing how it went, what the positives and negatives are and would I do it again. The website is an open source service for fitness tracking using gps logs captured while running or cycling. I’ll start with my background.

March 30, 2019

Simon Zelazny (pzel)

Uncrashable languages aren't March 30, 2019 11:00 PM

A trivial observation, with some examples

Making buggy situations unrepresentable

Programs have bugs. Both creators and end-users of software dislike bugs. Businesses paying for software development dislike bugs. It's no wonder that, as the role of software in the world expands, we've become very interested in minimizing occurrences of bugs.

One way of reducing bugs is via process: making sure that critical code is tested to the greatest practical extent. Another way is via construction: making sure that buggy code is not representable. This could be achieved by making such code unexpressible in the syntax of a language, or having it fail the compiler's type check.

There are two new programming languages that take a principled stance on the side of non-representability, by preventing code from crashing wantonly: Elm and Pony.

Elm does this by eliminating exceptions from language semantics and forcing the programmer to handle all branches of sum types (i.e your code has to cover all representable states that it might encounter).

Pony does this by allowing anonymous exceptions (the error operator), but forcing the programmer to deal with them at some point. All functions – apart from those which are explicitly marked as capable of throwing errors – MUST be total and always return a value.

A small aside about division by zero

Elm used to crash when you tried to divide by zero. Now (I tried version 0.19), it returns 0 for integer division and Infinity for floating-point division. The division functions are therefore total.

> 5.0 / 0.0
Infinity : Float
> 5 // 0
0 : Int
> remainderBy 5 0
0 : Int
> modBy 5 0
0 : Int

Pony also returns zero when zero is the divisor.

actor Main                                        //  This code prints:
  new create(env: Env) =>                         //  0
   env.out.print((U32(1) / U32(0)).string())

However, Pony also provides partial arithmetic operators (/? for division, +? for addition, below), for when you explicitly need integer over/underflows and division-by-zero to be illegal:

actor Main                                         //  This code prints:
  new create(env: Env) =>                          //  div by zero
    try U32(1) /? U32(0)                           //  overflow
    else env.out.print("div by zero")              //  0
    end                                            //  0
    try U8(255) +? U8(1)
    else env.out.print("overflow")
    end
   env.out.print((U32(1) / U32(0)).string())
   env.out.print((U8(255) + U8(1)).string())

While returning '0' for divison-by-zero is a controversial topic (yet silent integer overflows somehow don't generate the same debate), I think it's reasonable to view this compromise as the necessary cost of eliminating crashes in our code. More interesting is this: we have just made a tradeoff between eliminating crashes and wrong results. Having a total division function eliminates crashes at the cost of allowing wrong results to propagate. Let's dig into this a bit more.

Bugs and the bottom type

Taxonomy is supposed to be the lowest form of science, but let's indulge and distinguish two main types of program misbehavior:

1) A program (or function) produces output which does not match the programmer's intent, design, or specification;

2) A program (or function) fails to produce output (e.g. freezes or crashes)

I hope you'll agree that eliminating 'bugs' caused by the first type of error is not an easy feat, and probably not within the scope of a language runtime or compiler. Carefully designing your data structures to make illegal states unrepresentable may go a long way towards eliminating failures of this kind, as will a good testing regimen. Let's not delve deeper into this category and focus on the second one: functions that crash and never return.

The wikipedia article on the Bottom Type makes for an intersting read. It's nice to conceive of as a hole in the program, where execution stops and meaning collapses. Since the bottom type is a subtype of every type, theoretically any function can return the 'bottom value' — although returning the 'bottom value' acutally means never returning at all.

My claim is that while some languages, like Haskell or Rust, might explicitly embrace the existence of , languages that prevent programmers from 'invoking' the bottom type will always contain inconsistencies (I'm leaving dependently-typed languages out of this). Below are two examples.

Broken promises and infinite loops

Elm's promise is that an application written in Elm will never crash, unless there is a bug in the Elm runtime. There are articles out there that enumerate the various broken edge-cases (regexp, arrays, json decoding), but these cases can arguably be construed as bugs in the runtime or mistakes in library API design. That is, these bugs do not mean that Elm's promise is for naught.

However, if you think about it, an infinite loop is a manifestation of the bottom type just as much as an outright crash, and such a loop is possible in all Turing-complete languages.

Here's a legal Elm app that freezes:

import Browser
import Html exposing (Html, button, div, text)
import Html.Events exposing (onClick)

main =
  Browser.sandbox { init = 0, update = update, view = view }

type Msg = Increment | Decrement

add : Int -> Int
add n =
  add (n+1)

update msg model =
  case msg of
    Increment ->
      add model

    Decrement ->
      model - 1

view model =
  div []
    [ button [ onClick Decrement ] [ text "-" ]
    , div [] [ text (String.fromInt model) ]
    , button [ onClick Increment ] [ text "+" ]
    ]

What will happen when you click the + button in the browser? This is what:

Screenshot of browser message saying: 'A web page is slowing down your browser.'

The loop is hidden in the add function, which never actually returns an Int. Its true return type, in this program, is precisely . Without explicitly crashing-and-stopping, we've achieved the logical (and type-systematic) equivalent: a freeze.

Galloping forever and ever

The Pony language is susceptible to a similar trick, but we'll have to be a bit more crafty. First of all, Pony does indeed allow the programmer to 'invoke' the Bottom Type, by simply using the keyword error anywhere in a function body. Using this keyword (or calling a partial function) means that we, the programmer, now have a choice to make:

1) Use try/else to handle the possibility of error, and return a sensible default value

2) Mark this function as partial ?, and force callers to deal with the possibility of the Bottom Type rearing its head.

However, we can craft a function that spins endlessly, never exiting, and thus 'returning' the Bottom Type, without the compiler complaining, and without having to annotate it as partial.

Interestingly enough, naïve approaches are optimized away by the compiler, producing surprising result values instead of spinning forever:

actor Main
  new create(env: Env) =>
    let x = spin(false)
    env.out.print(x.string())

  fun spin(n: Bool): Bool =>
    spin(not n)

Before you run this program, think about what, if anything, it should output. Then, run it and see. Seems like magic to me, but I'm guessing this is LLVM detecting the oscillation and producing a 'sensible' value.

We can outsmart the optimizer by farming out the loop to another object:

actor Main
  new create(env: Env) =>
    let t: TarpitTrap = TarpitTrap
    let result = t.spin(true)
    env.out.print(result.string())

class TarpitTrap
  fun spin(n: Bool): Bool =>
    if n then spin(not n)
    else spin(n)
    end

Now, this program properly freezes forever, as intended. Of course this is just a contrived demonstration, but one can imagine an analogous situation happening at run-time, for example when parsing tricky (or malicious) input data.

The snake in the garden

While I enjoy working both in Elm and Pony, I'm not a particular fan of these languages' hard-line stance on making sure programs never crash. As long as infinite loops are expressible in the language, the Bottom Type cannot be excised.

Even without concerns somewhat external to our programming language runtime, such as memory constraints, FFIs, syscalls, or the proverbial admin pulling the plug on our machine (did this really used to happen?), the humble infinite loop ensures that non-termination can never be purged from our (non-dependently-typed) program.

Instead of focusing on preventing crashes in the small, I think we, as programmers, should embrace failure and look at how to deal with error from a higher-level perspective, looking at processes, machines, and entire systems. Erlang and OTP got this right so many years ago. Ensuring the proper operation of a system despite failure is a much more practical goal than vainly trying to expel the infinitely-looping snake from our software garden.

Derek Jones (derek-jones)

The 2019 Huawei cyber security evaluation report March 30, 2019 02:23 PM

The UK’s Huawei cyber security evaluation centre oversight board has released it’s 2019 annual report.

The header and footer of every page contains the text “SECRET”“OFFICIAL”, which I assume is its UK government security classification. It lends an air of mystique to what is otherwise a meandering management report.

Needless to say, the report contains the usually puffery, e.g., “HCSEC continues to have world-class security researchers…”. World class at what? I hear they have some really good mathematicians, but have serious problems attracting good software engineers (such people can be paid a lot more, and get to do more interesting work, in industry; the industry demand for mathematicians, outside of finance, is weak).

The most interesting sentence appears on page 11: “The general requirement is that all staff must have Developed Vetting (DV) security clearance, …”. Developed Vetting, is the most detailed and comprehensive form of security clearance in UK government (to quote Wikipedia).

Why do the centre’s staff have to have this level of security clearance?

The Huawei source code is not that secret (it can probably be found online, lurking in the dark corners of various security bulletin boards).

Is the real purpose of this cyber security evaluation centre, to find vulnerabilities in the source code of Huawei products, that GCHQ can then use to spy on people?

Or perhaps, this centre is used for training purposes, with staff moving on to work within GCHQ, after they have learned their trade on Huawei products?

The high level of security clearance applied to the centre’s work is the perfect smoke-screen.

The report claims to have found “Several hundred vulnerabilities and issues…”; a meaningless statement, e.g., this could mean one minor vulnerability and several hundred spelling mistakes. There is no comparison of the number of vulnerabilities found per effort invested, no comparison with previous years, no classification of the seriousness of the problems found, no mention of Huawei’s response (i.e., did Huawei agree that there was a problem).

How many vulnerabilities did the centre find that were reported by other people, e.g., the National Vulnerability Database? This information would give some indication of how good a job the centre was doing. Did this evaluation centre find the Huawei vulnerability recently disclosed by Microsoft? If not, why not? And if they did, why isn’t it in the 2019 report?

What about comparing the number of vulnerabilities found in Huawei products against the number found in vendors from the US, e.g., CISCO? Obviously back-doors placed in US products, at the behest of the NSA, need not be counted.

There is some technical material, starting on page 15. The configuration and component lifecycle management issues raised, sound like good points, from a cyber security perspective. From a commercial perspective, Huawei want to quickly respond to customer demand and a dynamic market; corners are likely to be cut off good practices every now and again. I don’t understand why the use of an unnamed real-time operating system was flagged: did some techie gripe slip through management review? What is a C preprocessor macro definition doing on page 29? This smacks of an attempt to gain some hacker street-cred.

Reading between the lines, I get the feeling that Huawei has been ignoring the centre’s recommendations for changes to their software development practices. If I were on the receiving end, I would probably ignore them too. People employed to do security evaluation are hired for their ability to find problems, not for their ability to make things that work; also, I imagine many are recent graduates, with little or no practical experience, who are just repeating what they remember from their course work.

Huawei should leverage its funding of a GCHQ spy training centre, to get some positive publicity from the UK government. Huawei wants people to feel confident that they are not being spied on, when they use Huawei products. If the government refuses to play ball, Huawei should shift its funding to a non-government, open evaluation center. Employees would not need any security clearance and would be free to give their opinions about the presence of vulnerabilities and ‘spying code’ in the source code of Huawei products.

March 29, 2019

Wesley Moore (wezm)

Cross Compiling Rust for FreeBSD With Docker March 29, 2019 11:02 PM

For a little side project I'm working on I want to be able to produce pre-compiled binaries for a variety of platforms, including FreeBSD. With a bit of trial and error I have been able to successfully build working FreeBSD binaries from a Docker container, without using (slow) emulation/virtual machines. This post describes how it works and how to add it to your own Rust project.

Update 27 March 2019: Stephan Jaekel pointed out on Twitter that cross supports a variety of OSes including FreeBSD, NetBSD, Solaris, and more. I have used cross for embedded projects but didn't think to use it for non-embedded ones. Nonetheless the process described in this post was still educational for me but I would recommend using cross instead.

I started with Sandvine's freebsd-cross-build repo. Which builds a Docker image with a cross-compiler that targets FreeBSD. I made a few updates and improvements to it:

  • Update from FreeBSD 9 to 12.
  • Base on newer debian9-slim image instead of ubuntu 16.04.
  • Use a multi-stage Docker build.
  • Do all fetching of tarballs inside the container to remove the need to run a script on the host.
  • Use the FreeBSD base tarball as the source of headers and libraries instead of ISO.
  • Revise the fix-links script to automatically discover symlinks that need fixing.

Once I was able to successfully build the cross-compilation toolchain I built a second Docker image based on the first that installs Rust, and the x86_64-unknown-freebsd target. It also sets up a non-privileged user account for building a Rust project bind mounted into it.

Check out the repo at: https://github.com/wezm/freebsd-cross-build

Building the Images

I haven't pushed the image to a container registry as I want to do further testing and need to work out how to version them sensibly. For now you'll need to build them yourself as follows:

  1. git clone git@github.com:wezm/freebsd-cross-build.git && cd freebsd-cross-build
  2. docker build -t freebsd-cross .
  3. docker build -f Dockerfile.rust -t freebsd-cross-rust .

Using the Images to Build a FreeBSD Binary

To use the freebsd-cross-rust image in a Rust project here's what you need to do (or at least this is how I'm doing it):

In your project add a .cargo/config file for the x86_64-unknown-freebsd target. This tells cargo what tool to use as the linker.

[target.x86_64-unknown-freebsd]
linker = "x86_64-pc-freebsd12-gcc"

I use Docker volumes to cache the output of previous builds and the cargo registry. This prevents cargo from re-downloading the cargo index and dependent crates on each build and saves build artifacts across builds, speeding up compile times.

A challenge this introduces is how to get the resulting binary out of the volume. For this I use a separate docker invocation that copies the binary out of the volume into a bind mounted host directory.

Originally I tried mounting the whole target directory into the container but this resulted in spurious compilation failures during linking and lots of files owned by root (I'm aware of user namespaces but haven't set it up yet).

I wrote a shell script to automate this process:

#!/bin/sh

set -e

mkdir -p target/x86_64-unknown-freebsd

# NOTE: Assumes the following volumes have been created:
# - lobsters-freebsd-target
# - lobsters-freebsd-cargo-registry

# Build
sudo docker run --rm -it \
  -v "$(pwd)":/home/rust/code:ro \
  -v lobsters-freebsd-target:/home/rust/code/target \
  -v lobsters-freebsd-cargo-registry:/home/rust/.cargo/registry \
  freebsd-cross-rust build --release --target x86_64-unknown-freebsd

# Copy binary out of volume into target/x86_64-unknown-freebsd
sudo docker run --rm -it \
  -v "$(pwd)"/target/x86_64-unknown-freebsd:/home/rust/output \
  -v lobsters-freebsd-target:/home/rust/code/target \
  --entrypoint cp \
  freebsd-cross-rust \
  /home/rust/code/target/x86_64-unknown-freebsd/release/lobsters /home/rust/output

This is what the script does:

  1. Ensures that the destination directory for the binary exists. Without this, docker will create it but it'll be owned by root and the container won't be able to write to it.
  2. Runs cargo build --release --target x86_64-unknown-freebsd (the leading cargo is implied by the ENTRYPOINT of the image.
    1. The first volume (-v) argument bind mounts the source code into the container, read-only.
    2. The second -v maps the named volume, lobsters-freebsd-target into the container. This caches the build artifacts.
    3. The last -v maps the named volume, lobsters-freebsd-cargo-registry into the container. This caches the carge index and downloaded crates.
  3. Copies the built binary out of the lobsters-freebsd-target volume into the local filesystem at target/x86_64-unknown-freebsd.
    1. The first -v bind mounts the local target/x86_64-unknown-freebsd directory into the container at /home/rust/output.
    2. The second -v mounts the lobsters-freebsd-target named volume into the container at /home/rust/code/target.
    3. The docker run invocation overrides the default ENTRYPOINT with cp and supplies the source and destination to it, copying from the volume into the bind mounted host directory.

After running the script there is a FreeBSD binary in target/x86_64-unknown-freebsd. Copying it to a FreeBSD machine for testing shows that it does in fact work as expected!

One last note, this all works because I don't depend on any C libraries in my project. If I did, it would be necessary to cross-compile them so that the linker could link them when needed.

Once again, the code is at: https://github.com/wezm/freebsd-cross-build.



Previous Post: My First 3 Weeks of Professional Rust
Next Post: What I Learnt Building a Lobsters TUI in Rust

Gokberk Yaltirakli (gkbrk)

Phone Location Logger March 29, 2019 10:22 AM

If you are using Google Play Services on your Android phone, Google receives and keeps track of your location history. This includes your GPS coordinates and timestamps. Because of the privacy implications, I have revoked pretty much all permissions from Google Play Services and disabled my Location History on my Google settings (as if they would respect that).

But while it might be creepy if a random company has this data, it would be useful if I still have it. After all, who doesn’t want to know the location of a park that they stumbled upon randomly on a vacation 3 years ago.

I remember seeing some location trackers while browsing through F-Droid. I found various applications there, and picked one that was recently updated. The app was a Nextcloud companion app, with support for custom servers. Since I didn’t want a heavy Nextcloud install just to keep track of my location, I decided to go with the custom server approach.

In the end, I decided that the easiest path is to make a small CGI script in Python that appends JSON encoded lines to a text file. Because of this accessible data format, I can process this file in pretty much every programming language, import it to whatever database I want and query it in whatever way I see fit.

The app I went with is called PhoneTrack. You can find the APK and source code links on F-Droid. It replaces the parameters in the URL, logging every parameter looks like this: https://example.com/cgi-bin/locationrecorder.py ?acc=%ACC&alt=%ALT&batt=%BATT&dir=%DIR&lat=%LAT&lon=%LON&sat=%SAT&spd=%SPD &timestamp=%TIMESTAMP

Here’s the script in all it’s glory.

import cgi
import json

PATH = '/home/databases/location.txt'

print('Content-Type: text/plain\n')
form = cgi.FieldStorage()

# Check authentication token
if form.getvalue('token') != 'SECRET_VALUE':
    raise Exception('Nope')

obj = {
    'accuracy':   form.getvalue('acc'),
    'altitude':   form.getvalue('alt'),
    'battery':    form.getvalue('batt'),
    'bearing':    form.getvalue('dir'),
    'latitude':   form.getvalue('lat'),
    'longitude':  form.getvalue('lon'),
    'satellites': form.getvalue('sat'),
    'speed':      form.getvalue('spd'),
    'timestamp':  form.getvalue('timestamp'),
}

with open(PATH, 'a+') as log:
    line = json.dumps(obj)
    log.write(f'{line}\n')

March 28, 2019

Derek Jones (derek-jones)

Using Black-Scholes in software engineering gives a rough lower bound March 28, 2019 04:22 PM

In the financial world, a call option is a contract that gives the buyer the option (but not the obligation) to purchase an asset, at an agreed price, on an agreed date (from the other party to the contract).

If I think that the price of jelly beans is going to increase, and you disagree, then I might pay you a small amount of money for the right to buy a jar of jelly beans from you, in a month’s time, at today’s price. A month from now, if the price of Jelly beans has gone down, I buy a jar from whoever at the lower price, but if the price has gone up, you have to sell me a jar at the previously agreed price.

I’m in the money if the price of Jelly beans goes up, you are in the money if the price goes down (I paid you a premium for the right to purchase at what is known as the strike price).

Do you see any parallels with software development here?

Let’s say I have to rush to complete implementation some functionality by the end of the week. I might decide to forego complete testing, or following company coding practices, just to get the code out. At a later date I can decide to pay the time needed to correct my short-cuts; it is possible that the functionality is not used, so the rework is not needed.

This sounds like a call option (you might have thought of technical debt, which is, technically, the incorrect common usage term). I am both the buyer and seller of the contract. As the seller of the call option I received the premium of saved time, and the buyer pays a premium via the potential for things going wrong. Sometime later the seller might pay the price of sorting out the code.

A put option involves the right to sell (rather than buy).

In the financial world, speculators are interested in the optimal pricing of options, i.e., what should the premium, strike price and expiry date be for an asset having a given price volatility?

The Black-Scholes equation answers this question (and won its creators a Nobel prize).

Over the years, various people have noticed similarities between financial options thinking, and various software development activities. In fact people have noticed these similarities in a wide range of engineering activities, not just computing.

The term real options is used for options thinking outside of the financial world. The difference in terminology is important, because financial and engineering assets can have very different characteristics, e.g., financial assets are traded, while many engineering assets are sunk costs (such as drilling a hole in the ground).

I have been regularly encountering uses of the Black-Scholes equation, in my trawl through papers on the economics of software engineering (in some cases a whole PhD thesis). In most cases, the authors have clearly failed to appreciate that certain preconditions need to be met, before the Black-Scholes equation can be applied.

I now treat use of the Black-Scholes equation, in a software engineering paper, as reasonable cause for instant deletion of the pdf.

If you meet somebody talking about the use of Black-Scholes in software engineering, what questions should you ask them to find out whether they are just sprouting techno-babble?

  • American options are a better fit for software engineering problems; why are you using Black-Scholes? An American option allows the option to be exercised at any time up to the expiry date, while a European option can only be exercised on the expiry date. The Black-Scholes equation is a solution for European options (no optimal solution for American options is known). A sensible answer is that use of Black-Scholes provides a rough estimate of the lower bound of the asset value. If they don’t know the difference between American/European options, well…
  • Partially written source code is not a tradable asset; why are you using Black-Scholes? An assumption made in the derivation of the Black-Scholes equation is that the underlying assets are freely tradable, i.e., people can buy/sell them at will. Creating source code is a sunk cost, who would want to buy code that is not working? A sensible answer may be that use of Black-Scholes provides a rough estimate of the lower bound of the asset value (you can debate this point). If they don’t know about the tradable asset requirement, well…
  • How did you estimate the risk adjusted discount rate? Options involve balancing risks and getting values out of the Black-Scholes equation requires plugging in values for risk. Possible answers might include the terms replicating portfolio and marketed asset disclaimer (MAD). If they don’t know about risk adjusted discount rates, well…

If you want to learn more about real options: “Investment under uncertainty” by Dixit and Pindyck, is a great read if you understand differential equations, while “Real options” by Copeland and Antikarov contains plenty of hand holding (and you don’t need to know about differential equations).

Andreas Zwinkau (qznc)

TipiWiki (2003) March 28, 2019 12:00 AM

More than 15 years ago I published a little wiki software.

Read full article!

March 25, 2019

Wesley Moore (wezm)

My First 3 Weeks of Professional Rust March 25, 2019 06:00 AM

For the last 15 years as a professional programmer I have worked mostly with dynamic languages. First Perl, then Python, and for the last 10 years or so, Ruby. I've also been writing Rust on the side for personal projects for nearly four years. Recently I started a new job and for the first time I'm writing Rust professionally. Rust represents quite a shift in language features, development process and tooling. I thought it would be interesting to reflect on that experience so far.

Note that some of my observations are not unique to Rust and would be equally present in other languages like Haskell, Kotlin, or OCaml.

Knowledge

In my first week I hit up pretty hard against my knowledge of lifetimes in Rust. I was reasonably confident with them conceptually and their simple application but our code has some interesting type driven zero-copy parsing code that tested my knowledge. When encountering some compiler errors I was fortunate to have experienced colleagues to ask for help. It's been nice to extend my knowledge and learn as I go.

Interestingly I had mostly been building things without advanced lifetime knowledge up until this point. I think that sometimes the community puts too much emphasis on some of Rust's more advanced features when citing its learning curve. If you read the book you can get a very long way. Although that will depend on the types of applications or data structures you're trying to build.

Confidence

In my second week I implemented a change to make a certain pattern more ergonomic. It was refreshing to be able to build the initial functionality and then make a project-wide change, confident that given it compiled after the change I probably hadn't broken anything. I don't think I would have had the confidence to make such a change as early on in the Ruby projects I've worked on previously.

Testing

I cringe whenever I see proponents of statically typed languages say things like, "if it compiles, it works", with misguided certainty. The compiler and language do eliminate whole classes of bugs that you'd need to test for in a dynamic language but that doesn't mean tests aren't needed.

Rust has great built in support for testing and I've enjoyed being able to write tests focussed solely on the behaviour and logic of my code. Compared to Ruby where I have to write tests that ensure there are no syntax errors, nil is handled safely, arguments are correct, in addition to the behaviour and logic.

Editor and Tooling

Neovim is my primary text editor. I've been using vim or a derivative since the early 2000s. I have the RLS set up and working in my Neovim environment but less than a week in I started using IntelliJ IDEA with the Rust and Vim emulation plugins for work. A week after that I started trialling CLion as I wanted a debugger.

JetBrains CLion IDE JetBrains CLion IDE

The impetus for the switch was that I was working with a colleague on a change that had a fairly wide impact on the code. We were practicing compiler driven development and were doing a repeated cycle of fix an error, compile, jump to next top most error. Vim's quickfix list + :make is designed to make this cycle easier too but I didn't have that set up at the time. I was doing a lot of manual jumping between files, whereas in IntelliJ I could just click the paths in the error messages.

It's perhaps the combination of working on a foreign codebase and also trying to maximise efficiency when working with others that pushed me to seek out better tooling for work use. There is ongoing to work to improve the RLS so I may still come back to Neovim and I continue to use it for personal projects.

Other CLion features that I'm enjoying:

  • Reliable autocomplete
  • Reliable jump to definition, jump to impl block, find usages
  • Refactoring tooling (rename across project, extract method, extract variable)
  • Built in debugger

VS Code offers some of these features too. However, since they are built on the RLS they suffer many of the same issues I had in Neovim. Additionally I think the Vim emulation plugin for IntelliJ is more complete, or at least more predictable for a long time vim user. This is despite the latter actually using Neovim under the covers.

Debugging

In Ruby with a gem like pry-byebug it's trivial to put a binding.pry in some code to be dropped into a debugger + REPL at that point in the code. This is harder with Rust. println! or dbg! based debugging can get you a surprisingly long way and had served me well for most of my personal projects.

When building some parsing code I quickly felt the need to use a real debugger in order to step through and examine execution of a failing test. It's possible to do this on the command line with the rust-gdb or rust-lldb wrappers that come with Rust. However, I find them fiddly to use and verbose to operate.

CLion makes it simple to add and remove break points by clicking in the gutter, run a single test under the debugger, visually step through the code, see all local variables, step up and down the call stack, etc. These are possible with the command line tools (which CLIon is using behind the scenes), but it's nice to have them built in and available with a single click of the mouse.

Conclusion

So far I am enjoying my new role. There have been some great learning opportunities and surprising tooling changes. I'm also keen to keep an eye on the frequency of bugs encountered in production, their type (such as panic or incorrect logic), their source, and ease of resolution. I look forward to writing more about our work in the future.

Discuss on Lobsters



Previous Post: A Coding Retreat and Getting Embedded Rust Running on a SensorTag
Next Post: Cross Compiling Rust for FreeBSD With Docker

Pete Corey (petecorey)

Bending Jest to Our Will: Restoring Node's Require Behavior March 25, 2019 12:00 AM

Jest does some interesting things to Node’s default require behavior. In an attempt to encourage test independence and concurrent test execution, Jest resets the module cache after every test.

You may remember one of my previous articles about “bending Jest to our will” and caching instances of modules across multiple tests. While that solution works for single modules on a case-by-case basis, sometimes that’s not quite enough. Sometimes we just want to completely restore Node’s original require behavior across the board.

After sleuthing through support tickets, blog posts, and “official statements” from Jest core developers, this seems to be entirely unsupported and largely impossible.

However, with some highly motivated hacking I’ve managed to find a way.

Our Goal

If you’re unfamiliar with how require works under the hood, here’s a quick rundown. The first time a module is required, its contents are executed and the resulting exported data is cached. Any subsequent require calls of the same module return a reference to that cached data.

That’s all there is to it.

Jest overrides this behavior and maintains its own “module registry” which is blown away after every test. If one test requires a module, the module’s contents are executed and cached. If that same test requires the same module, the cached result will be returned, as we’d expect. However, other tests don’t have access to our first test’s module registry. If another test tries to require that same module, it’ll have to execute the module’s contents and store the result in its own private module registry.

Our goal is to find a way to reverse Jest’s monkey-patching of Node’s default require behavior and restore it’s original behavior.

This change, or reversal of a change, will have some unavoidable consequences. Our Jest test suite won’t be able to support concurrent test processes. This means that all our tests will have to run “in band”(--runInBand). More interestingly, Jest’s “watch mode” will no longer work, as it uses multiple processes to run tests and maintain a responsive command line interface.

Accepting these limitations and acknowledging that this is likely a very bad idea, let’s press on.

Dependency Hacking

After several long code reading and debugging sessions, I realized that the heart of the problem resides in Jest’s jest-runtime module. Specifically, the requireModuleOrMock function, which is responsible for Jest’s out-of-the-box require behavior. Jest internally calls this method whenever a module is required by a test or by any code under test.

Short circuiting this method with a quick and dirty require causes the require statements throughout our test suites and causes our code under test to behave exactly as we’d expect:


requireModuleOrMock(from: Path, moduleName: string) {
+ return require(this._resolveModule(from, moduleName));
  try {
    if (this._shouldMock(from, moduleName)) {
      return this.requireMock(from, moduleName);
    } else {
      return this.requireModule(from, moduleName);
    }
  } catch (e) {
    if (e.code === 'MODULE_NOT_FOUND') {
      const appendedMessage = findSiblingsWithFileExtension(
        this._config.moduleFileExtensions,
        from,
        moduleName,
      );

      if (appendedMessage) {
        e.message += appendedMessage;
      }
    }
    throw e;
  }
}

Whenever Jest reaches for a module, we relieve it of the decision to use a cached module from it’s internally maintained moduleRegistry, and instead have it always return the result of requiring the module through Node’s standard mechanisms.

Patching Jest

Our fix works, but in an ideal world we wouldn’t have to fork jest-runtime just to make our change. Thankfully, the requireModuleOrMock function isn’t hidden within a closure or made inaccessible through other means. This means we’re free to monkey-patch it ourselves!

Let’s start by creating a test/globalSetup.js file in our project to hold our patch. Once created, we’ll add the following lines:


const jestRuntime = require('jest-runtime');

jestRuntime.prototype.requireModuleOrMock = function(from, moduleName) {
    return require(this._resolveModule(from, moduleName));
};

We’ll tell our Jest setup to use this config file by listing it in our jest.config.js file:


module.exports = {
    globalSetup: './test/globalSetup.js',
    ...
};

And that’s all there is to it! Jest will now execute our globalSetup.js file once, before all of our test suites, and restore the original behavior of require.

Being the future-minded developers that we are, it’s probably wise to document this small and easily overlooked bit of black magic:


/*
 * This requireModuleOrMock override is _very experimental_. It affects
 * how Jest works at a very low level and most likely breaks Jest-style
 * module mocks.
 *
 * The upside is that it lets us evaluate heavy modules once, rather
 * that once per test.
 */

jestRuntime.prototype.requireModuleOrMock = function(from, moduleName) {
    return require(this._resolveModule(from, moduleName));
};

If you find yourself with no other choice but to perform this incantation on your test suite, I wish you luck. You’re most likely going to need it.

March 24, 2019

Ponylang (SeanTAllen)

Last Week in Pony - March 24, 2019 March 24, 2019 01:59 PM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Wesley Moore (wezm)

A Coding Retreat and Getting Embedded Rust Running on a SensorTag March 24, 2019 02:01 AM

This past long weekend some friends on I went on a coding retreat inspired by John Carmack doing similar in 2018. During the weekend I worked on adding support for the Texas Instruments SensorTag to the embedded Rust ecosystem. This post is a summary of the weekend and what I was able to achieve code wise.

Back in March 2018 John Carmack posted about a week long coding retreat he went on to work on neural networks and OpenBSD. After reading the post I quoted it to some friends and commented:

I finally took another week-long programming retreat, where I could work in hermit mode, away from the normal press of work.

In the spirit of my retro theme, I had printed out several of Yann LeCun’s old papers and was considering doing everything completely off line, as if I was actually in a mountain cabin somewhere

I kind of love the idea of a week long code retreat in a cabin somewhere.

One of my friends also liked the idea and actually made it happen! There was an initial attempt in June 2018 but life got in the way so it was postponed. At the start of the year he picked it up again and organised it for the Labour day long weekend, which just passed.

We rented an Airbnb in the Dandenong Ranges, 45 minutes from Melbourne. Six people attended, two of which were from interstate. The setting was cozy, quiet and picturesque. Our days involved coding and collaborating, shared meals, and a walk or two around the surrounds.

Photo of a sunrise with trees and windmill visible The view from our accommodation one morning.

After linux.conf.au I got inspired to set up some self-hosted home sensors and automation. I did some research and picked up two Texas Instruments SensorTags and a debugger add-on. It uses a CC2650 microcontroller with an ARM Cortex-M3 core and has support for a number of low power wireless standards, such as Bluetooth, ZigBee, and 6LoWPAN. The CC2650 also has a low power 16-bit sensor controller that can be used to help achieve years long battery life from a single CR2032 button cell. In addition to the microcontroller the SensorTag also add a bunch of sensors, including: temperature, humidity, barometer, accelerometer, gyroscope, and light.

Two SensorTags. One with it's rubberised case removed and debugger board attached Two SensorTags. One with it's rubberised case removed and debugger board attached.

My project for the weekend was to try to get some Rust code running on the SensorTag. Rust has good support out of the box for targeting ARM Cortex microcontrollers but there were no crates to make interacting with this particular chip or board easy, so I set about building some.

The first step was generating a basic crate to allow interacting with the chip without needing to wrap everything in an unsafe block and poke at random memory addresses. Fortunately svd2rust can automate this by converting System View Description XML files (SVD) into a Rust crate. Unfortunately TI don't publish SVD files for their devices. As luck would have it though, M-Labs have found that TI do publish XML descriptions in format of their own called DSLite. They have written a tool, dslite2svd, that converts this to SVD, so you can then use svd2rust. It took a while to get dslite2svd working. I had to tweak it to handle differences in the files I was processing, but eventually I was able to generate a crate that compiled.

Now that I had an API for the chip I turned to working out how to program and debug the SensorTag with a very basic Rust program. I used the excellent embedded Rust Discovery guide as a basis for the configuration, tools, and process for getting code onto the SensorTag. Since this was a different chip from a different manufacturer it took a long time to work out which tools worked, how to configure them, what format binaries they wanted, create a linker script, etc. A lot of trial and error was performed, along with lots of searching online with less than perfect internet. However, by Sunday I could program the device, debug code, and verify that my very basic program, shown below, was running.

fn main() -> ! {
    let _y;
    let x = 42;
    _y = x;

    // infinite loop; just so we don't leave this stack frame
    loop {}
}

The combination that worked for programming was:

  • cargo build --target thumbv7m-none-eabi
  • Convert ELF to BIN using cargo objcopy, which is part of cargo-binutils: cargo objcopy --bin sensortag --target thumbv7m-none-eabi -- -O binary sensortag.bin
  • Program with UniFlash:
    • Choose CC2650F128 and XDS1100 on the first screen
    • Do a full erase the first time to reset CCFG, etc
    • Load image (select the .bin file produced above)

For debugging:

  • Build OpenOCD from git to get support for the chip and debugger (I used the existing AUR package)
  • Run OpenOCD: openocd -f jtag/openocd.cfg
  • Use GDB to debug: arm-none-eabi-gdb -x jtag/gdbinit -q target/thumbv7m-none-eabi/debug/sensortag
  • The usual mon reset halt in GDB upsets the debugger connection. I found that soft_reset_halt was able to reset the target (although it complains about being deprecated).

Note: Files in the jtag path above are in my sensortag repo. Trying to program through openocd failed with an error that the vEraseFlash command failed. I'd be curious to know if anyone has got this working as I'd very much like to ditch the huge 526.5 MiB UniFlash desktop-web-app dependency in my workflow.

Now that I could get code to run on the SensorTag I set about trying to use the generated chip support crate to flash one of the on board LEDs. I didn't succeed in getting this working by the time the retreat came to an end, but after I arrived home I was able to find the source of the hard faults I was encountering and get the LED blinking! The key was that I needed to power up the peripheral power domain and enable the GPIO clocks to be able to enable an output GPIO.

It works!

Below is the code that flashes the LED. It should be noted this code is operating with very little abstraction and is using register and field names that match the data sheet. Future work to implement the embedded-hal traits for this controller would make it less verbose and less cryptic.

#![deny(unsafe_code)]
#![no_main]
#![no_std]

#[allow(unused_extern_crates)] // NOTE(allow) bug rust-lang/rust#53964
extern crate panic_halt; // panic handler

// SensorTag is using RGZ package. VQFN (RGZ) | 48 pins, 7×7 QFN

use cc2650_hal as hal;
use cc2650f128;
use cortex_m_rt::entry;

use hal::{ddi, delay::Delay, prelude::*};

pub fn init() -> (Delay, cc2650f128::Peripherals) {
    let core_peripherals = cortex_m::Peripherals::take().unwrap();
    let device_peripherals = cc2650f128::Peripherals::take().unwrap();

    let clocks = ddi::CFGR {
        sysclk: Some(24_000_000),
    }
    .freeze();

    let delay = Delay::new(core_peripherals.SYST, clocks);

    // LEDs are connected to DIO10 and DIO15
    // Configure GPIO pins for output, maximum strength
    device_peripherals.IOC
        .iocfg10
        .modify(|_r, w| w.port_id().gpio().ie().clear_bit().iostr().max());
    device_peripherals.IOC
        .iocfg15
        .modify(|_r, w| w.port_id().gpio().ie().clear_bit().iostr().max());

    // Enable the PERIPH power domain and wait for it to be powered up
    device_peripherals.PRCM.pdctl0.modify(|_r, w| w.periph_on().set_bit());
    loop {
        if device_peripherals.PRCM.pdstat0.read().periph_on().bit_is_set() {
            break;
        }
    }

    // Enable the GPIO clock
    device_peripherals.PRCM.gpioclkgr.write(|w| w.clk_en().set_bit());

    // Load settings into CLKCTRL and wait for LOAD_DONE
    device_peripherals.PRCM.clkloadctl.modify(|_r, w| w.load().set_bit());
    loop {
        if device_peripherals.PRCM.clkloadctl.read().load_done().bit_is_set() {
            break;
        }
    }

    // Enable outputs
    device_peripherals.GPIO
        .doe31_0
        .modify(|_r, w| w.dio10().set_bit().dio15().set_bit());

    (delay, device_peripherals)
}

#[entry]
fn entry() -> ! {
    let (mut delay, periphs) = init();
    let half_period = 500_u16;

    loop {
        // Turn LED on and wait half a second
        periphs.GPIO.dout11_8.modify(|_r, w| w.dio10().set_bit());
        delay.delay_ms(half_period);

        // Turn LED off and wait half a second
        periphs.GPIO.dout11_8.modify(|_r, w| w.dio10().clear_bit());
        delay.delay_ms(half_period);
    }
}

The rest of the code is up on Sourcehut. It's all in a pretty rough state at the moment. I plan to tidy it up over the coming weeks and eventually publish the crates. If you're curious to see it now though, the repos are:

  • cc2650f128 crates.io Documentation -- chip support crate generated by dslite2svd and svd2rust.
  • cc26x0-hal (see wip branch, currently very rough).
  • sensortag -- LED flashing code. I hope to turn this into a board support crate eventually.

Overall the coding retreat was a great success and we hope to do another one next year.



Previous Post: Rebuilding My Personal Infrastructure With Alpine Linux and Docker
Next Post: My First 3 Weeks of Professional Rust

March 22, 2019

Ponylang (SeanTAllen)

0.28.0 Released March 22, 2019 08:06 PM

Pony 0.28.0 is a high-priority release. We advise updating as soon as possible.

In addition to a high-priority bug fix, there are “breaking changes” if you build Pony from source. We’ve also dropped support for some Debian and Ubuntu versions. Read on for further details.

March 21, 2019

Derek Jones (derek-jones)

Describing software engineering in terms of a traditional science March 21, 2019 04:33 PM

If you were asked to describe the ‘building stuff’ side of software engineering, by comparing it with one of the traditional sciences, which science would you choose?

I think a lot of people would want to compare it with Physics. Yes, physics envy is not restricted to the softer sciences of humanities and liberal arts. Unlike physics, software engineering is not governed by a handful of simple ‘laws’, it’s a messy collection of stuff.

I used to think that biology had all the necessary important characteristics needed to explain software engineering: evolution (of code and products), species (e.g., of editors), lifespan, and creatures are built from a small set of components (i.e., DNA or language constructs).

Now I’m beginning to think that chemistry has aspects that are a better fit for some important characteristics of software engineering. Chemists can combine atoms of their choosing to create whatever molecule takes their fancy (subject to bonding constraints, a kind of syntax and semantics for chemistry), and the continuing existence of a molecule does not depend on anything outside of itself; biological creatures need to be able to extract some form of nutrient from the environment in which they live (which is also a requirement of commercial software products, but not non-commercial ones). Individuals can create molecules, but creating new creatures (apart from human babies) is still a ways off.

In chemistry and software engineering, it’s all about emergent behaviors (in biology, behavior is just too complicated to reliably say much about). In theory the properties of a molecule can be calculated from the known behavior of its constituent components (e.g., the electrons, protons and neutrons), but the equations are so complicated it’s impractical to do so (apart from the most simple of molecules; new properties of water, two atoms of hydrogen and one of oxygen, are still being discovered); the properties of programs could be deduced from the behavior its statements, but in practice it’s impractical.

What about the creative aspects of software engineering you ask? Again, chemistry is a much better fit than biology.

What about the craft aspect of software engineering? Again chemistry, or rather, alchemy.

Is there any characteristic that physics shares with software engineering? One that stands out is the ego of some of those involved. Describing, or creating, the universe nourishes large egos.

Stig Brautaset (stig)

Bose QuietComfort 35 Review March 21, 2019 02:39 PM

I review the noise-cancelling headphones I've been using for about 3 years.

March 19, 2019

Simon Zelazny (pzel)

How to grab all hosts but the first, in Ansible March 19, 2019 11:00 PM

Today I was trying to figure out how to run a particular ansible play on one host out of a group, and another play on all the other hosts.

The answer was found in a mailing group posting from 2014, but in case that service goes down, here's my note-to-self on how to do it.

Let's say you have a group of hosts called stateful_cluster_hosts in your inventory. You'd like to upload files/leader_script.sh.j2 to the first host, and files/follower_script.sh.j2 to all the others.

The play for the leader host would look like this:

- hosts: stateful_cluster_hosts[0]
  tasks:
  - name: "Upload leader start script"
    template:
      src: files/leader_script.sh.j2
      dest: start.sh
      mode: "u=rwx,g=rx,o=rx"

The play for the follower hosts would look like this:

- hosts:  stateful_cluster_hosts:!stateful_cluster_hosts[0]
  tasks:
  - name: "Upload follower start script"
    template:
      src: files/follower_script.sh.j2
      dest: start.sh
      mode: "u=rwx,g=rx,o=rx"

Where the sytax list:!list[idx] means take list, but filter out list[idx].

Richard Kallos (rkallos)

Inveniam viam - Building Bridges March 19, 2019 03:00 AM

If you know where you are and where you want to be, you can start to plot a course between the two points. Following on the ideas presented in the previous two posts, I describe a practice I learned from reading Robert Fritz’s book Your Life as Art.

I spent a few years of my life voraciously consuming self-help books. I believe the journey started with Eckhart Tolle’s The Power of Now, and it more-or-less ended with Meditations, Stoicism and the Art of Happiness, and Your Life as Art. I’ll probably wind up writing about my path to Stoicism some other time. This post is about what I learned from Robert Fritz.

Your Life as Art is filled with insight about navigating the complex murky space of life while juggling the often competing aspects of spontaneity and rigid structure. My collection of notes about Your Life as Art is nearly a thousand lines long, and there’s definitely far too much good stuff to fit into a single blog post. At this point, I’ll focus on what Robert Fritz calls structural tension, and his technique for plotting a course with the help of a chart.

Hopefully I convinced you with the previous two posts about the importance of objectively “seeing” where you are and where you want to be. These two activites form the foundation of what Fritz calls structural tension, a force that stems from the contrast between reality and an ideal state and seeks to relieve the tension by moving you from your present state to your ideal state.

Writing is a handy exercise generating this force. A structural tension chart (or ST chart) has your desired ideal state at the top of a page, your current state at the bottom, and a series of steps bridging the gap between the two. First you write the ideal section, then the real section, and finally add the steps in the middle. It’s very important to be as objective and detailed as possible about your ideal and current states. Here’s an example:

--- Ideal ---
I meditate every day for at least 15 minutes. My mind is calm
and focused as I go about my daily activities. I feel comfortable
sitting for the duration of my practice, no matter how long it is.
-------------

- Try keeping a meditation journal
- Experiment with active forms of meditation
- Experiment with seating positions
- Give myself time in the morning to meditate
- Wake up at the same time every day

--- Real ----
I meditate approximately once per week. I have difficulty
finding a regular time during the day to devote to meditation,
making it difficult to create a habit that sticks. I find that
I become uncomfortable sitting with my legs crossed for more
than 5 minutes. I do not often remember how good I feel after
meditating, which results in difficulty deciding to sit.

If you’re interested in reading more, I highly recommend Your Life as Art. Robert Fritz’s books are filled with great ideas. While this is basically a slightly more detailed to-do list, I find the process to be very grounding.

In conclusion, once you know where you are and where you want to be, try writing a structural tension chart in order to set a course.

March 18, 2019

Gergely Nagy (algernon)

Solarium March 18, 2019 11:45 AM

I wanted to build a keyboard for a long time, to prepare myself for building two for our Twins when they're old enough, but always struggled with figuring out what I want to build. I mean, I have the perfect keyboard for pretty much all occasions: my daily driver is the Keyboardio Model01, which I use for everything but the few cases highlighted next. For Steno, I use a Splitography. When I need to be extra quiet, I use an Atreus with Silent Reds. For gaming, I have a Shortcut prototype, and use the Atreus too, depending on the game. I don't travel much nowadays, so I have no immediate need for a portable board, but the Atreus would fit that purpose too.

As it turns out there is one scenario I do not have covered: if I have to type on my phone, I do not have a bluetooth keyboard to do it with, and have to rely on the virtual keyboard. This is far from ideal. Why do I need to type on the phone? Because sometimes I'm in a call at night, and need to be quiet, so I go to another room - but I only have a phone with me there. I could use a laptop, but since I need the phone anyway, carrying a phone and a laptop feels wrong, when I could carry a phone and a keyboard instead.

So I'm going to build myself a bluetooth keyboard. But before I do that, I'll build something simpler. Simpler, but still different enough from my current keyboards that I can justify the effort going into the build process. It will not be wireless at first, because during my research, I found that complicates matters too much, at least for a first build.

A while ago, I had another attempt at coming up with a keyboard, which had bluetooth, was split, and had a few other twists. We spent a whole afternoon brainstorming on the name with the twins and my wife. I'll use that name for another project, but I needed another one for the current one: I started down the same path we used back then, and found a good one.

You see, this keyboard is going to feature a rotary encoder, with a big scrubber knob on top of it, as a kind of huge dial. The knob will be in the middle, surrounded by low-profile Kailh Choc keys.

Solarium

balcony, dial, terrace, sundial, sunny spot

The low-profile keys with a mix of black and white keycaps does look like a terrace; the scrubber knob, a dial. So the name fits like a glove.

Now, I know very little about designing and building keyboards, so this first attempt will likely end up being a colossal failure. But one has to start somewhere, and this feels like a good start: simple enough to be possible, different enough to be interesting and worthwhile.

It will be powered by the same ATMega32U4 as many other keyboards, but unlike most, it will have Kailh Choc switches for a very low profile. It will also feature a rotary encoder, which I plan to use for various mouse-related tasks, such as scrolling. Or volume setting. Or brightness adjustment. Stuff like that.

This means I'll have to add rotary encoder support to Kaleidoscope, but that shouldn't be too big of an issue.

The layout

Solarium

(Original KLE)

The idea is that the wheel will act as a mouse scroll wheel by default. Pressing the left Fn key, it will turn into volume control, pressing the right Fn key, it will turn into brightness control. I haven't found other uses for it yet, but I'm sure I will once I have the physical thing under my fingers. The wheel is to be operated by the opposite hand that holds either Fn, or any hand when no Fn is held. Usually that'll be the right hand, because Shift will be on the left thumb cluster, and I need that for horizontal scrolling.

While writing this, I got another idea for the wheel: I can make it switch windows or desktops. It can act as a more convenient Alt+Tab, too!

Components

The most interesting component is likely the knob. I've been eyeing the Scrubber Knob from Adafruit. Need to find a suitable encoder, the one on Adafruit is out of stock. One of the main reasons I like this knob is that it's low profile.

For the rest, they're pretty usual stuff:

  • Kailh Choc switches. Not sure whether I want reds or browns. I usually enjoy tactile switches, but one of the goals of this keyboard is to be quiet, and reds might be a better fit there.
  • Kailh Choc caps: I'll get a mix of black and white caps, for that terrace / balcony feeling.
  • ATMega32U4

Apart from this, I'll need a PCB, and perhaps a switch- and/or bottom plate, I suppose. Like I said, I know next to nothing about building keyboards. I originally wanted to hand-wire it, but Jesse Vincent told me I really don't want to do that, and I'll trust him on that.

Future plans

In the future, I plan to make a Bluetooth keyboard, and a split one (perhaps both at the same time, as originally planned). I might experiment with adding LEDs to the current one too as a next iteration. I also want to build a board with hotswap switches, though I will likely end up with Kailh Box Royals I guess (still need my samples to arrive first, mind you). We'll see once I built the first one, I'm sure there will be lessons learned.

#DeleteFacebook March 18, 2019 08:00 AM

Your Account Is Scheduled for Permanent Deletion

On March 15, the anniversary of the 1848-49 Hungarian Revolution and war for independence, I deleted my facebook account. Or at least started the process. This is something I've been planning to do for a while, and the special day felt like the perfect opportunity to do so. It wasn't easy, not because I used facebook much - I did not, I haven't looked at my timeline in months, had a total of 9 posts over the years (most of them private). I didn't "like" stuff, and haven't interacted with the site in any meaningful way.

I did use Messenger, mostly to communicate with friends and family, convincing at least some of them to find alternative ways to contact me wasn't without issues. But by March 15, I got the most important people to use another communication platform (XMPP), and I was able to hit the delete switch.

I have long despised facebook, for a whole lot of reasons, but most recently, they started to expose my phone number, which I only gave them for 2FA purposes. They exposed it in a way that I couldn't hide it from friends, either. That's a problem because I don't want every person who I "friended" on there to know my phone number. It's a privilege to know it, and facebook abusing its knowledge of it was over the line. But this isn't the worst yet.

You see, facebook is so helpful that it lets people link their contacts with their facebook friends. A lot of other apps are after one's contact list, and now my phone number got into a bunch more of those. This usually isn't a big deal, people will not notice. But programs will. Programs that hunt for phone numbers to sell.

And this is exactly what happened: my phone number got sold. How do I know? I got a call from an insurance company. One I never had any prior contact with, nor did anyone in my family. I was asked if I have two minutes, and I frankly told them that yes, I do, and I'd like to use that two minutes to inquire where they got my phone number from, as per the GDPR, because as a data subject, I have the right to know what data has been collected about me, how such data was processed. I twisted the right a bit, and said I have the right to know how I got into their database - I'm not sure I have this right. In any case, poor caller wasn't prepared for this, took a bit more than two minutes to convince him that he's better off complying with my request, otherwise they'll have a formal GDPR data request and a complaint against him, personally filed within hours.

A few hours later, I got a call back: they got my phone number from facebook. I thanked them for the information, and asked them to delete all data they have about me, and never contact me again. Yes, there's a conflict between those two requests, we'll see how they handle it, let it be their problem figuring out how to resolve it. Anyway, there's only a few possibilities how they could've gotten my number through facebook:

  • If I friended them, they'd have access. They wouldn't have my consent to use it for this kind of stuff, but they'd have the number. This isn't the case. I'm pretty sure I can't friend corporations on facebook (yet?) to begin with.
  • Some of my friends had their contacts synced with facebook (I know of at least two who made this mistake, one of them by mistake, one too easily made), and had their contacts uploaded to the insurance company via their app, or some similarly shady process. This still doesn't mean I consented to being contacted.
  • Facebook sold my number to them. Likewise, this doesn't imply consent, either.

They weren't able to tell me more than that they got my number from facebook. I have a feeling that this is a lie anyway - they just don't know where they bought it from, and facebook probably sounded like a reasonable source. On the other hand, facebook selling one's personal data, despite the GDPR is something I'm more than willing to believe, considering their past actions. Even if facebook is not the one who sold the number, the fact that an insurance company deemed it acceptable to lie and blame them paints an even worse picture.

In either case, facebook is a sickness I wanted to remove from my life, and this whole deal was the final straw. I initiated the account deletion. They probably won't delete anything, just disable it, and continue selling what they already have about me. But at least I make it harder for them to obtain more info about me. I started to migrate my family to better services: we use an XMPP server I host, with end to end encryption, because noone should trust neither me, nor the VPS provider the server is running on.

It's a painful break up, because there are a bunch of people who I talked with on Messenger from time to time, who will not move away from facebook anytime soon. There are parts of my family (my brother & sister) who will not install another chat app just to chat with me - we'll fall back to phone calls, email and SMS. Nevertheless, this had to be done. I'm lucky that I could, because I wasn't using facebook for anything important to begin with. Many people can't escape its clutches.

I wish there will be a day when all of my family is off of it. With a bit of luck, we can raise our kids without facebook in their lives.

Jan van den Berg (j11g)

The Effective Executive – Peter Drucker March 18, 2019 06:51 AM

Pick up any good management book and chances are that Peter Drucker will be mentioned. He is the godfather of management theory. I encountered Drucker many times before in other books and quotes, but I had never read anything directly by him. I have now, and I can only wish I had done so sooner.

The Effective Executive – Peter Drucker (1967) – 210 pages

The sublime classic The Effective Executive from 1967 was a good place to start. After only finishing the first chapter at the kitchen table, I already told my wife: this is one of the best management books I have ever read.

Drucker is an absolute authority who unambiguously will tell you exactly what’s important and what’s not. His voice and style cuts like a knife and his directness will hit you like a ton of bricks. He explains and summarizes like no one else, without becoming repetitive. Every other sentence could be a quote. And after reading, every other management book makes a bit more sense, because now I can tell where they stem from.

Drucker demonstrates visionary insight, by correctly predicting the rise of knowledge workers and their specific needs (and the role of computers). In a rapidly changing society all knowledge workers are executives. And he/she needs to be effective. But, mind you, executive effectiveness “can be learned, but can’t be taught.”

Executive effectiveness

Even though executive effectiveness is an individual aspiration, Drucker is crystal clear on the bigger picture:

Only executive effectiveness can enable this society to harmonize its two needs: the needs of organization to obtain from the individual the contribution it needs, and the need of the individual to have organization serve as his tool for the accomplishment of his purposes. Effectiveness must be learned…..Executive effectiveness is our one best hope to make modern society productive economically and viable socially.


So this book makes sense on different levels and is timeless. Even if some references, in hindsight, are dated (especially the McNamara references, knowing what we now know about the Vietnam war). I think Drucker himself did not anticipate the influence of his writing, as the next quotes demonstrates. But this is also precisely what I admire about it.

There is little danger that anyone will compare this essay on training oneself to be an effective executive with, say, Kierkegaard’s great self-development tract, Training in Christianity. There are surely higher goals for a man’s life than to become an effective executive. But only because the goal is so modest can we hope at all to achieve it; that is, to have the large number of effective executives modern society and its organizations need.

The post The Effective Executive – Peter Drucker appeared first on Jan van den Berg.

Pete Corey (petecorey)

A Better Mandelbrot Iterator in J March 18, 2019 12:00 AM

Nearly a year ago I wrote about using the J programming language to write a Mandelbrot fractal renderer. I proudly exclaimed that J could be used to “write out expressions like we’d write English sentences,” and immediately proceeded to write a nonsensical, overcomplicated solution.

My final solution bit off more than it needed to chew. The next verb we wrote both calculated the next value of iterating on the Mandelbrot formula and also managed appending that value to a list of previously calculated values.

I nonchalantly explained:

This expression is saying that next “is” (=:) the “first element of the array” ({.) “plus” (+) the “square of the last element of the array” (*:@:{:). That last verb combines the “square” (*:) and “last” ({:) verbs together with the “at” (@:) adverb.

Flows off the tongue, right?

My time spent using J to solve last year’s Advent of Code challenges has shown me that a much simpler solution exists, and it can flow out of you in a fluent way if you just stop fighting the language and relax a little.


Let’s refresh ourselves on Mandelbrot fractals before we dive in. The heart of the Mandelbrot fractal is this iterative equation:

The Mandelbrot set equation.

In English, the next value of z is some constant, c, plus the square of our previous value of z. To render a picture of the Mandelbrot fractal, we map some section of the complex plane onto the screen, so that every pixel maps to some value of c. We iterate on this equation until we decide that the values being calculated either remain small, or diverge to infinity. Every value of c that doesn’t diverge is part of the Mandelbrot set.

But let’s back up. We just said that “the next value of z is some constant, c, plus the square of our previous value of z”.

We can write that in J:

   +*:

And we can plug in example values for c (0.2j0.2) and z (0):

   0.2j0.2 (+*:) 0
0.2j0.2

Our next value of z is c (0.2j0.2) plus (+) the square (*:) of our previous value of z (0). Easy!


My previous solution built up an array of our iterated values of z by manually pulling c and previously iterated values off of the array and pushing new values onto the end. Is there a better way?

Absolutely. If I had read the documentation on the “power” verb (^:), I would have noticed that “boxing” (<) the number of times we want to apply our verb will return an array filled with the results of every intermediate application.

Put simply, we can repeatedly apply our iterator like so:

   0.2j0.2 (+*:)^:(<5) 0
0 0.2j0.2 0.2j0.28 0.1616j0.312 0.128771j0.300838

Lastly, it’s conceivable that we might want to switch the order of our inputs. Currently, our value for c is on the left and our initial value of z is on the right. If we’re applying this verb to an array of c values, we’d probably want c to be the right-hand argument and our initial z value to be a bound left-hand argument.

That’s a simple fix thanks to the “passive” verb (~):

   0 (+*:)^:(<5)~ 0.2j0.2
0 0.2j0.2 0.2j0.28 0.1616j0.312 0.128771j0.300838

We can even plot our iterations to make sure that everything looks as we’d expect.

Our plotted iteration for a C value of 0.2 + 0.2i.


I’m not going to lie and claim that J is an elegantly ergonomic language. In truth, it’s a weird one. But as I use J more and more, I’m finding that it has a certain charm. I’ll often be implementing some tedious solution for a problem in Javascript or Elixir and find myself fantasizing about how easily I could write an equivalent solution in J.

That said, I definitely haven’t found a shortcut for learning the language. Tricks like “reading and writing J like English” only really work at a hand-wavingly superficial level. I’ve found that learning J really just takes time, and as I spend more time with the language, I can feel myself “settling into it” and its unique ways of looking at computation.

If you’re interested in learning J, check out my previous articles on the subject and be sure to visit the JSoftware home page for books, guides, and documentation.

Richard Kallos (rkallos)

Esse quam videri - Seeing what's in front of you March 18, 2019 12:00 AM

As humans, we are easily fooled. Our five senses are the primary way we get an idea of what’s happening around us, and it’s been shown time and time again that our senses are unreliable. In this post, I try to explain the importance of ‘seeing’ what’s in front of you, and how to practice it.

See this video a classic example of our incredible ability at missing important details.

Rembrandt used to train his students by making them copy his self-portraits. This exercise forced them to see their subject as objectively as possible, which was essential to make an accurate reproduction. Only after mastering their portraiturne skills did Rembrandt’s students go on to develop their own artistic styles.

It is important to periodically evaluate your position and course in life. It’s something you do whether you’re aware of it or not. When you plan something, you’re setting a course. When you’re reflecting on past events, you’re estimating your position. For the sake of overloading words like sight, vision, and planning, let’s refer to this act as life portraiture.

Life portraiture can be compared to navigating on land, air, or sea, except the many facets of our lives results in a space of many more dimensions. We can consider our position and course on axes like physical health, emotional health, career, finance, and social life. If we want finer detail, we can split any of those axes into more dimensions.

Objective life portraiture is not easy. We are all vulnerable to cognitive biases. Following the above analogy with navigation, our inaccuracy at objectively evaluating our lives is akin to inaccurately navigating a ship or airplane. If you’re not well-practiced at seeing, your only tool for navigation might be dead reckoning. If you practice drawing self-portraits of your life, you might suddenly find yourself in possession of a sextant and an almanac, so you can navigate using the stars. The ideal in this case would be to have something like GPS, which might look like Quantified Self with an incredible amount of detail.

It’s worth mentioning that our ability to navigate varies across different dimensions. This is an idea that doesn’t really carry over to navigating Earth, but it’s important to recognize. For example, if you’re thorough with your personal finances, you could have tools akin to GPS for navigating that part of your life. At the same time, if you don’t check in with your emotions, or do anything to improve your emotional health, you might be lost in those spaces.

There are ways to improve our navigating abilities depending on the spaces we’re looking at. To improve navigating your personal finances, you can regularly consult your banking statements, make budgets, and explore different methods of investing. To improve navigating your physical health, you can perform one of many different fitness tests, or consult a personal trainer. To improve navigating your emotional health, you could try journaling, or maybe begin seeing a therapist. Any and all of these could help you locate yourself in the vast space where your life could be.

In order to get where you want to go, you need to know where you are, and what direction you’re moving in.

March 17, 2019

Ponylang (SeanTAllen)

Last Week in Pony - March 17, 2019 March 17, 2019 02:35 PM

Last Week In Pony is a weekly blog post to catch you up on the latest news for the Pony programming language. To learn more about Pony check out our website, our Twitter account @ponylang, or our Zulip community.

Got something you think should be featured? There’s a GitHub issue for that! Add a comment to the open “Last Week in Pony” issue.

Caius Durling (caius)

Download All Your Gists March 17, 2019 02:32 PM

Over time I've managed to build up quite the collection of Gists over at Github, including secret ones there's about 1200 currently. Some of these have useful code in, some are just garbage output. I'd quite like a local copy either way, so I can easily search1 across them.

  1. Install the gist command from Github

    brew install gist
  2. Login to your Github Account through the gist tool (it'll prompt for your login credentials, then generate you an API Token to allow it future access.)

    gist --login
  3. Create a folder, go inside it and download all your gists!

    mkdir gist_archive
    cd gist_archive
    for repo in $(gist -l | awk '{ print $1 }'); do git clone $repo 2> /dev/null; done
  4. Now you have a snapshot of all your gists. To update them in future, you can run the above for any new gists, and update all the existing ones with:

    cd gist_archive
    for i in */; do (cd $i && git pull --rebase); done

Now go forth and search out your favourite snippet you saved years ago and forgot about!


  1. ack, ag, grep, ripgrep, etc. Pick your flavour. [return]

Marc Brooker (mjb)

Control Planes vs Data Planes March 17, 2019 12:00 AM

Control Planes vs Data Planes

Are there multiple things here?

If you want to build a successful distributed system, one of the most important things to get right is the block diagram: what are the components, what does each of them own, and how do they communicate to other components. It's such a basic design step that many of us don't think about how important it is, and how difficult and expensive it can be to make changes to the overall architecture once the system is in production. Getting the block diagram right helps with the design of database schemas and APIs, helps reason through the availability and cost of running the system, and even helps form the right org chart to build the design.

One very common pattern when doing these design exercises is to separate components into a control plane and a data plane, recognizing the differences in requirements between these two roles.

No true monoliths

The microservices and SOA design approaches tend to push towards more blocks, with each block performing a smaller number of functions. The monolith approach is the other end of the spectrum, where the diagram consists of a single block. Arguments about these two approaches can be endless, but ultimately not important. It's worth noting, though, that there are almost no true monoliths. Some kinds of concerns are almost always separated out. Here's a partial list:

  1. Storage. Most modern applications separate business logic from storage and caching, and talk through APIs to their storage.
  2. Load Balancing. Distributed applications need some way for clients to distribute their load across multiple instances.
  3. Failure tolerance. Highly available systems need to be able to handle the failure of hardware and software without affecting users.
  4. Scaling. Systems which need to handle variable load may add and remove resources over time.
  5. Deployments. Any system needs to change over time.

Even in the most monolithic application, these are separate components of the system, and need to be built into the design. What's notable here is that these concerns can be broken into two clean categories: data plane and control plane. Along with the monolithic application itself, storage and load balancing are data plane concerns: they are required to be up for any request to succeed, and scale O(N) with the number of requests the system handles. On the other hand, failure tolerance, scaling and deployments are control plane concerns: they scale differently (either with a small multiple of N, with the rate of change of N, or with the rate of change of the software) and can break for some period of time before customers notice.

Two roles: control plane and data plane

Every distributed system has components that fall roughly into these two roles: data plane components that sit on the request path, and control plane components which help that data plane do its work. Sometimes, the control plane components aren't components at all, and rather people and processes, but the pattern is the same. With this pattern worked out, the block diagram of the system starts to look something like this:

Data plane and control plane separated into two blocks

My colleague Colm MacCárthaigh likes to think of control planes from a control theory approach, separating the system (the data plane) from the controller (the control plane). That's a very informative approach, and you can hear him talk about it here:

I tend to take a different approach, looking at the scaling and operational properties of systems. As in the example above, data plane components are the ones that scale with every request1, and need to be up for every request. Control plane components don't need to be up for every request, and instead only need to be up when there is work to do. Similarly, they scale in different ways. Some control plane components, such as those that monitor fleets of hosts, scale with O(N/M), which N is the number of requests and M is the requests per host. Other control plane components, such as those that handle scaling the fleet up and down, scale with O(dN/dt). Finally, control plane components that perform work like deployments scale with code change velocity.

Finding the right separation between control and data planes is, in my experience, one of the most important things in a distributed systems design.

Another view: compartmentalizing complexity

In their classic paper on Chain Replication, van Renesse and Schneider write about how chain replicated systems handle server failure:

In response to detecting the failure of a server that is part of a chain (and, by the fail-stop assumption, all such failures are detected), the chain is reconfigured to eliminate the failed server. For this purpose, we employ a service, called the master

Fair enough. Chain replication can't handle these kinds of failures without adding significant complexity to the protocol. So what do we expect of the master?

In what follows, we assume the master is a single process that never fails.

Oh. Never fails, huh? They then go on to say that they approach this by replicating the master on multiple hosts using Paxos. If they have a Paxos implementation available, then why do they just not use that and not bother with this Chain Replication thing at all? The paper doesn't say2, but I have my own opinion: it's interesting to separate them because Chain Replication offers a different set of performance, throughput, and code complexity trade offs than Paxos3. It is possible to build a single code base (and protocol) which handles both concerns, but at the cost of coupling these two different concerns. Instead, by making the master a separate component, the chain replicated data plane implementation can focus on the things it needs to do (scale, performance, optimizing for every byte). The control plane, which only needs to handle the occasional failure, can focus on what it needs to do (extreme availability, locality, etc). Each of these different requirements adds complexity, and separating them out allows a system to compartmentalize its complexity, and reduce coupling by offering clear APIs and contract between components.

Breaking down the binary

Say you build awesome data plane based on chain replication, and an awesome control plane (master) for that data plane. At first, because of its lower scale, you can operate the control plane manually. Over time, as your system becomes successful, you'll start to have too many instances of the control plane to manage by hand, so you build a control plane for that control plane to automate the management. This is the first way the control/data binary breaks down: at some point control planes need their own control planes. Your controller is somebody else's system under control.

One other way the binary breaks down is with specialization. The master in the chain replicated system handles fault tolerance, but may not handle scaling, or sharding of chains, or interacting with customers to provision chains. In real systems there are frequently multiple control planes which control different aspects of the behavior of a system. Each of these control planes have their own differing requirements, requiring different tools and different expertise. Control planes are not homogeneous.

These two problems highlight that the idea of control planes and data planes may be too reductive to be a core design principle. Instead, it's a useful tool for helping identify opportunities to reduce and compartmentalize complexity by introducing good APIs and contracts, to ensure components have a clear set of responsibilities and ownership, and to use the right tools for solving different kinds of problems. Separating the control and data planes should be a heuristic tool for good system design, not a goal of system design.

Footnotes:

  1. Or potentially with every request. Things like caches complicate this a bit.
  2. It does compare Chain Replication to other solutions, but doesn't specifically talk about the benefits of seperation. Murat Demirbas pointed out that Chain Replication's ability to serve linearizable reads from the tail is important. He also pointed me at the Object Storage on CRAQ paper, which talks about how to serve reads from intermediate nodes. Thanks, Murat!
  3. For one definition of Paxos. Lamport's Vertical Paxos paper sees chain replication as a flavor of Paxos, and more recent work by Heidi Howard et al on Flexible Paxos makes the line even less clear.

March 16, 2019

Richard Kallos (rkallos)

Memento Mori - Seeing the End March 16, 2019 09:30 PM

Memento Mori (translated as “remember death”) is a powerful idea and practice. In this post, I make the case that it’s important to think not just about your death, but to clearly define what it means to be finished in whatever you set out to do.

We are mites on a marble floating in the endless void. Our lifespans are blinks in cosmic history. Furthermore, for many of us, our contributions are likely to be forgotten soon after we rejoin the earth, if not sooner.

This is great news.

Whenever I’m feeling nervous or embarrassed, I start to feel better when I realize that nobody in front of me is going to be alive 100 years from now, and I doubt that they’ll be telling their grandchildren about that time when Richard made a fool of himself, because I try to make a fool of myself often enough that it’s usually not worth telling people about.

Knowing that we are finite is also pretty motivating. I feel less resistance to starting new things. It doesn’t have to be perfect, in fact, it’s probably going to be average. However, it’s my journey, so it’s special to me, and I probably (hopefully?) learned and improved on the way.

It’s important to think about the ends of things, even when we don’t necessarily want things to end. Endings are as much a part of life as beginnings are, to think otherwise is delusion. Endings tend to have a reputation for being sad, but they don’t always have to be.

For example, some developers of open source software get stuck working on their projects for far longer than they expected. It’s unfortunate that creating something that people enjoy can turn into a source of grief and resentment.

Specifying an end of any endeavor is an important task. If no ‘end state’ is declared, it’s possible that a project will continue to take up time and effort, perpetually staying on the back-burner of things you have going on, draining you of resources until you are no longer able to start anything new.

Spending time thinking about what your finished project will look like sets a target for you to achieve, which is a point I’ll elaborate on very soon. This exercise, along with evaluating where you are currently on your path toward achieving your goal/finishing your project, are immensely useful for getting your brain to focus on the intermediate tasks that need to be finished in order to get to that idealized ‘end state’.

All in all, while it’s sometimes nice to simply wander, it’s important to acknowledge that you are always going somewhere, even when you think you’re standing still. You should be the one who decides where you go, not someone else.

March 14, 2019

Derek Jones (derek-jones)

Altruistic innovation and the study of software economics March 14, 2019 02:11 PM

Recently, I have been reading rather a lot of papers that are ostensibly about the economics of markets where applications, licensed under an open source license, are readily available. I say ostensibly, because the authors have some very odd ideas about the activities of those involved in the production of open source.

Perhaps I am overly cynical, but I don’t think altruism is the primary motivation for developers writing open source. Yes, there is an altruistic component, but I would list enjoyment as the primary driver; developers enjoy solving problems that involve the production of software. On the commercial side, companies are involved with open source because of naked self-interest, e.g., commoditizing software that complements their products.

It may surprise you to learn that academic papers, written by economists, tend to be knee-deep in differential equations. As a physics/electronics undergraduate I got to spend lots of time studying various differential equations (each relating to some aspect of the workings of the Universe). Since graduating, I have rarely encountered them; that is, until I started reading economics papers (or at least trying to).

Using differential equations to model problems in economics sounds like a good idea, after all they have been used to do a really good job of modeling how the universe works. But the universe is governed by a few simple principles (or at least the bit we have access to is), and there is lots of experimental data about its behavior. Economic issues don’t appear to be governed by a few simple principles, and there is relatively little experimental data available.

Writing down a differential equation is easy, figuring out an analytic solution can be extremely difficult; the Navier-Stokes equations were written down 200-years ago, and we are still awaiting a general solution (solutions for a variety of special cases are known).

To keep their differential equations solvable, economists make lots of simplifying assumptions. Having obtained a solution to their equations, there is little or no evidence to compare it against. I cannot speak for economics in general, but those working on the economics of software are completely disconnected from reality.

What factors, other than altruism, do academic economists think are of major importance in open source? No, not constantly reinventing the wheel-barrow, but constantly innovating. Of course, everybody likes to think they are doing something new, but in practice it has probably been done before. Innovation is part of the business zeitgeist and academic economists are claiming to see it everywhere (and it does exist in their differential equations).

The economics of Linux vs. Microsoft Windows is a common comparison, i.e., open vs. close source; I have not seen any mention of other open source operating systems. How might an economic analysis of different open source operating systems be framed? How about: “An economic analysis of the relative enjoyment derived from writing an operating system, Linux vs BSD”? Or the joy of writing an editor, which must be lots of fun, given how many have text editors are available.

I have added the topics, altruism and innovation to my list of indicators of poor quality, used to judge whether its worth spending more than 10 seconds reading a paper.

March 13, 2019

Oleg Kovalov (olegkovalov)

Indeed, I should add it. Haven’t used it for a long time. March 13, 2019 05:18 PM

Indeed, I should add it. Haven’t used it for a long time.

Wesley Moore (wezm)

My Rust Powered linux.conf.au e-Paper Badge March 13, 2019 09:39 AM

This week I attended linux.conf.au (for the first time) in Christchurch, New Zealand. It's a week long conference covering Linux, open source software and hardware, privacy, security and much more. The theme this year was IoT. In line with the theme I built a digital conference badge to take to the conference. It used a tri-colour e-Paper display and was powered by a Rust program I built running on Raspbian Linux. This post describes how it was built, how it works, and how it fared at the conference. The source code is on GitHub.

The badge in its final state after the conference. The badge in its final state after the conference

Building

After booking my tickets in October I decided I wanted to build a digital conference badge. I'm not entirely sure what prompted me to do this but it was a combination of seeing projects like the BADGEr in the past, the theme of linux.conf.au 2019 being IoT, and an excuse to write more Rust. Since it was ostensibly a Linux conference it also seemed appropriate for it to run Linux.

Over the next few weeks I collected the parts and adaptors to build the badge. The main components were:

The Raspberry Pi Zero W is a single core 1Ghz ARM SoC with 512Mb RAM, Wi-FI, Bluetooth, microSD card slot, and mini HDMI. The Inky pHAT is a 212x104 pixel tri-colour (red, black, white) e-Paper display. It takes about 15 seconds to refresh the display but it draws very little power in between updates and the image persists even when power is removed.

Support Crates

The first part of the project involved building a Rust driver for the controller in the e-Paper display. That involved determining what controller the display used, as Pimoroni did not document it. Searching online for some of the comments in the Python driver suggested the display was possibly a HINK-E0213A07 from Holitech Co. Further searching based on the datasheet for that display suggested that the controller was a Solomon Systech SSD1675. Cross referencing the display datasheet, SSD1675 datasheet, and the Python source of Pimoroni's Inky pHAT driver suggested I was on the right track.

I set about building the Rust driver for the SSD1675 using the embedded HAL traits. These traits allow embedded Rust drivers to be built against a de facto standard set of traits that allow the driver to be used in any environment that implements the traits. For example I make use of traits for SPI devices, and GPIO pins, which are implemented for Linux, as well as say, the STM32F30x family of microcontrollers. This allows the driver to be written once and used on many devices.

The result was the ssd1675 crate. It's a so called no-std crate. That means it does not use the Rust standard library, instead sticking only to the core library. This allows the crate to be used on devices and microcontrollers without features like file systems, or heap allocators. The crate also makes use of the embedded-graphics crate, which makes it easy to draw text and basic shapes on the display in a memory efficient manner.

While testing the ssd1675 crate I also built another crate, profont, which provides 7 sizes of the ProFont font for embedded graphics. The profont crate was published 24 Nov 2018, and ssd1675 was published a month later on 26 Dec 2018.

The Badge Itself

Now that I had all the prerequisites in place I could start working on the badge proper. I had a few goals for the badge and its implementation:

  • I wanted it to have some interactive component.
  • I wanted there to be some sort of Internet aspect to tie in with the IoT theme of the conference.
  • I wanted the badge to be entirely powered by a single, efficient Rust binary, that did not shell out to other commands or anything like that.
  • Ideally it would be relatively power efficient.

An early revision of the badge from 6 Jan 2019 showing my name, website, badge IP, and kernel info. An early revision of the badge from 6 Jan 2019

I settled on having the badge program serve up a web page with some information about the project, myself, and some live stats of the Raspberry Pi (OS, kernel, uptime, free RAM). The plain text version of the page looked like this:

Hi I'm Wes!

Welcome to my conference badge. It's powered by Linux and
Rust running on a Raspberry Pi Zero W with a tri-colour Inky
pHAT ePaper dispay. The source code is on GitHub:

https://github.com/wezm/linux-conf-au-2019-epaper-badge


Say Hello
---------

12 people have said hi.

Say hello in person and on the badge. To increment the hello
counter on the badge:

    curl -X POST http://10.0.0.18/hi


About Me
--------

I'm a software developer from Melbourne, Australia. I
currently work at GreenSync building systems to help make
better use of renewable energy.

Find me on the Internet at:

   Email: wes@wezm.net
  GitHub: https://github.com/wezm
Mastodon: https://mastodon.social/@wezm
 Twitter: https://twitter.com/wezm
 Website: http://www.wezm.net/


Host Information
----------------

   (_\)(/_)   OS:        Raspbian GNU/Linux
   (_(__)_)   KERNEL:    Linux 4.14.79+
  (_(_)(_)_)  UPTIME:    3m
   (_(__)_)   MEMORY:    430.3 MB free of 454.5 MB
     (__)


              .------------------------.
              |    Powered by Rust!    |
              '------------------------'
                              /
                             /
                      _~^~^~_
                  \) /  o o  \ (/
                    '_   -   _'
                    / '-----' \

The interactive part came in the form of a virtual "hello" counter. Each HTTP POST to the /hi endpoint incremented the count, which was shown on the badge. The badge displayed the URL of the page. The URL was just the badge's IP address on the conference Wi-Fi. To provide a little protection against abuse I added code that only allowed a given IP to increment the count once per hour.

When building the badge software these are some of the details and things I strived for:

  • Handle Wi-Fi going away
  • Handle IP address changing
  • Prevent duplicate submissions
  • Pluralisation of text on the badge and on the web page
  • Automatically shift the text as the count requires more digits
  • Serve plain text and HTML pages:
    • If the web page is requested with an Accept header that doesn't include text/html (E.g. curl) then the response is plain text and the method to, "say hello", is a curl command.
    • If the user agent indicates they accept HTML then the page is HTML and contains a form with a button to, "say hello".
  • Avoid aborting on errors:
    • I kind of ran out of time to handle all errors well, but most are handled gracefully and won't abort the program. In some cases a default is used in the face of an error. In other cases I just resorted to logging a message and carrying on.
  • Keep memory usage low:
    • The web server efficiently discards any large POST requests sent to it, to avoid exhausting RAM.
    • Typical RAM stats showed the Rust program using about 3Mb of RAM.
  • Be relatively power efficient:
    • Use Rust instead of a scripting language
    • Only update the display when something it's showing changes
    • Only check for changes every 15 seconds (the rest of the time that thread just sleeps)
    • Put the display into deep sleep after updating

I used hyper for the HTTP server built into the binary. To get a feel for the limits of the device I did some rudimentary HTTP benchmarking with wrk and concluded that 300 requests per second was was probably going to be fine. ;-)

Running 10s test @ http://10.0.0.18:8080/
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   316.58ms   54.41ms   1.28s    92.04%
    Req/Sec    79.43     43.24   212.00     67.74%
  3099 requests in 10.04s, 3.77MB read
Requests/sec:    308.61
Transfer/sec:    384.56KB

Mounting

When I started the project I imagined it would hang around my neck like a conference lanyard. By the time departure day arrived I still hadn't worked out how this would work in practice (power delivery being a major concern). In the end I settled on attaching it to the strap on my backpack. My bag has lots of webbing so there were plenty of loops to hold it in place. I was also able to use the Velcro covered holes intended for water tubes to get the cable neatly into the bag.

At the Conference

I had everything pretty much working for the start of the conference. Although I did make some improvements and add a systemd unit to automatically start and restart the Rust binary. At this point there were still two unknowns: battery life and how the Raspberry Pi would handle coming in and out of Wi-Fi range. The Wi-Fi turned out fine: It automatically reconnected whenever it came into range of the Wi-Fi.

Badge displaying a count of zero. Ready for day 1

Reception

Day 1 was a success! I had several people talk to me about the badge and increment the counter. Battery life was good too. After 12 hours of uptime the battery was still showing it was half full. Later in the week I left the badge running overnight and hit 24 hours uptime. The battery level indicator was on the last light so I suspect there wasn't much juice left.

Me with badge display showing a hello count of 1. Me after receiving my first hello on the badge

On day 2 I had had several people suggest that I needed a QR code for the URL. Turns out entering an IP address on a phone keyboard is tedious. So that evening I added a QR code to the display. It's dynamically generated and contains the same URL that is shown on the display. There were several good crates to choose from. Ultimately I picked one that didn't have any image dependencies, which allowed me to convert the data into embedded-graphics pixels. The change was a success, most people scanned the QR code from this point on.

Badge display now including QR code. Badge display showing the newly added QR code

On day 2 I also ran into E. Dunham, and rambled briefly about my badge project and that it was built with Rust. To my absolute delight the project was featured in their talk the next day. The project was mentioned and linked on a slide and I was asked to raise my hand in case anyone wanted to chat afterwards.

Photo of E. Dunham's slide with a link to my git repo. Photo of E. Dunham's slide with a link to my git repo

At the end of the talk the audience was encouraged to tell the rest of the room about a Rust project they were working on. Each person that did so got a little plush Ferris. I spoke about Read Rust.

Photo of a small orange plush crab. Plush Ferris

Conclusion

By the end of the conference the badge showed a count of 12. It had worked flawlessly over the five days.

Small projects with a fairly hard deadline are a good way to ensure they're seen through to completion. They're also a great motivator to publish some open source code.

I think I greatly overestimated the number of people that would interact with the badge. Of those that did, I think most tapped the button to increase the counter and didn't read much else on the page. For example no one commented on the system stats at the bottom. I had imagined the badge as a sort of digital business card but this did not really eventuate in practice.

Attaching the Pi and display to my bag worked out pretty well. I did have to be careful when putting my bag on as it was easy to catch on my clothes. Also one day it started raining on the walk back to the accommodation. I had not factored that in at all and given it wasn't super easy to take on and off I ended up shielding it with my hand all the way back.

Would I Do It Again?

Maybe. If I were to do it again I might do something less interactive and perhaps more informational but updated more regularly. I might try to tie the project into a talk submission too. For example, I could have submitted a talk about using the embedded Rust ecosystem on a Raspberry Pi and made reference to the badge in the talk or used it for examples. I think this would give more info about the project to a bunch of people at once and also potentially teach them something at the same time.

All in all it was a fun project and excellent conference. If you're interested, the Rust source for the badge is on GitHub.



Next Post: Rebuilding My Personal Infrastructure With Alpine Linux and Docker

Rebuilding My Personal Infrastructure With Alpine Linux and Docker March 13, 2019 09:37 AM

For more than a decade I have run one or more servers to host a number of personal websites and web applications. Recently I decided it was time to rebuild the servers to address some issues and make improvements. The last time I did this was in 2016 when I switched the servers from Ubuntu to FreeBSD. The outgoing servers were managed with Ansible. After being a Docker skeptic for a long time I have finally come around to it recently and decided to rebuild on Docker. This post aims to describe some of the choices made, and why I made them.

Before we start I'd like to take a moment to acknowledge this infrastructure is built to my values in a way that works for me. You might make different choices and that's ok. I hope you find this post interesting but not prescriptive.

Before the rebuild this is what my infrastructure looked like:

You'll note 3 servers, across 2 countries, and 2 hosting providers. Also the Rust Melbourne server was not managed by Ansible like the other two were.

I had a number of goals in mind with the rebuild:

  • Move everything to Australia (where I live)
  • Consolidate onto one server
  • https enable all websites

I set up my original infrastructure in the US because it was cheaper at the time and most traffic to the websites I host comes from the US. The Wizards Mattermost instance was added later. It's for a group of friends that are all in Australia. Being in the US made it quite slow at times, especially when sharing and viewing images.

Another drawback to administering servers in the US from AU was that it makes the Ansible cycle time of "make a change, run it, fix it, repeat", excruciatingly slow. It had been on my to do list for a long time to move Wizards to Australia but I kept putting it off because I didn't want to deal with Ansible.

While having a single server that does everything wouldn't be the recommended architecture for business systems, for personal hosting where the small chance of downtime isn't going to result in loss of income the simplicity won out, at least for now.

This is what I ended up building. Each box is a Docker container running on the host machine:

Graph of services

I haven't always been in favour of Docker but I think enough time has passed to show that it's probably here to stay. There are some really nice benefits to Docker managed services too. Such as, building locally and then shipping the image to production, and isolation from the host system (in the sense you can just nuke the container and rebuild it if needed).

Picking a Host OS

Moving to Docker unfortunately ruled out FreeBSD as the host system. There is a very old Docker port for FreeBSD but my previous attempts at using it showed that it was not in a good enough state to use for hosting. That meant I needed to find a suitable Linux distro to act as the Docker host.

Coming from FreeBSD I'm a fan of the stable base + up-to-date packages model. For me this ruled out Debian (stable) based systems, which I find often have out-of-date or missing packages -- especially in the latter stages of the release cycle. I did some research to see if there were any distros that used a BSD style model. Most I found were either abandoned or one person operations.

I then recalled that as part of his Sourcehut work, Drew DeVault was migrating things to Alpine Linux. I had played with Alpine in the past (before it became famous in the Docker world), and I consider Drew's use some evidence in its favour.

Alpine describes itself as follows:

Alpine Linux is an independent, non-commercial, general purpose Linux distribution designed for power users who appreciate security, simplicity and resource efficiency.

Now that's a value statement I can get behind! Other things I like about Alpine Linux:

  • It's small, only including the bare essentials:
    • It avoids bloat by using musl-libc (which is MIT licensed) and busybox userland.
    • It has a 37Mb installation ISO intended for virtualised server installations.
  • It was likely to be (and ended up being) the base of my Docker images.
  • It enables a number of security features by default.
  • Releases are made every ~6 months and are supported for 2 years.

Each release also has binary packages available in a stable channel that receives bug fixes and security updates for the lifetime of the release as well as a rolling edge channel that's always up-to-date.

Note that Alpine Linux doesn't use systemd, it uses OpenRC. This didn't factor into my decision at all. systemd has worked well for me on my Arch Linux systems. It may not be perfect but it does do a lot of things well. Benno Rice did a great talk at linux.conf.au 2019, titled, The Tragedy of systemd, that makes for interesting viewing on this topic.

Building Images

So with the host OS selected I set about building Docker images for each of the services I needed to run. There are a lot of pre-built Docker images for software like nginx, and PostgreSQL available on Docker Hub. Often they also have an alpine variant that builds the image from an Alpine base image. I decided early on that these weren't really for me:

  • A lot of them build the package from source instead of just installing the Alpine package.
  • The Docker build was more complicated than I needed as it was trying to be a generic image that anyone could pull and use.
  • I wasn't a huge fan of pulling random Docker images from the Internet, even if they were official images.

In the end I only need to trust one image from Docker Hub: The 5Mb Alpine image. All of my images are built on top of this one image.

Update 2 Mar 2019: I am no longer depending on any Docker Hub images. After the Alpine Linux 3.9.1 release I noticed the official Docker images had not been updated so I built my own. Turns out it's quite simple. Download the miniroot tarball from the Alpine website and then add it to a Docker image:

FROM scratch

ENV ALPINE_ARCH x86_64
ENV ALPINE_VERSION 3.9.1

ADD alpine-minirootfs-${ALPINE_VERSION}-${ALPINE_ARCH}.tar.gz /
CMD ["/bin/sh"]

An aspect of Docker that I don't really like is that inside the container you are root by default. When building my images I made a point of making the entrypoint processes run as a non-privileged user or configure the service drop down to a regular user after starting.

Most services were fairly easy to Dockerise. For example here is my nginx Dockerfile:

FROM alpine:3.9

RUN apk update && apk add --no-cache nginx

COPY nginx.conf /etc/nginx/nginx.conf

RUN mkdir -p /usr/share/www/ /run/nginx/ && \
  rm /etc/nginx/conf.d/default.conf

EXPOSE 80

STOPSIGNAL SIGTERM

ENTRYPOINT ["/usr/sbin/nginx", "-g", "daemon off;"]

I did not strive to make the images especially generic. They just need to work for me. However I did make a point not to bake any credentials into the images and instead used environment variables for things like that.

Let's Encrypt

I've been avoiding Let's Encrypt up until now. Partly because the short expiry of the certificates seems easy to mishandle. Partly because of certbot, the recommended client. By default certbot is interactive, prompting for answers when you run it the first time, it wants to be installed alongside the webserver so it can manipulate the configuration, it's over 30,000 lines of Python (excluding tests, and dependencies), the documentation suggests running magical certbot-auto scripts to install it... Too big and too magical for my liking.

Despite my reservations I wanted to enable https on all my sites and I wanted to avoid paying for certificates. This meant I had to make Let's Encrypt work for me. I did some research and finally settled on acme.sh. It's written in POSIX shell and uses curl and openssl to do its bidding.

To avoid the need for acme.sh to manipulate the webserver config I opted to use the DNS validation method (certbot can do this too). This requires a DNS provider that has an API so the client can dynamically manipulate the records. I looked through the large list of supported providers and settled on LuaDNS.

LuaDNS has a nice git based workflow where you define the DNS zones with small Lua scripts and the records are published when you push to the repo. They also have the requisite API for acme.sh. You can see my DNS repo at: https://github.com/wezm/dns

Getting the acme.sh + hitch combo to play nice proved to be bit of a challenge. acme.sh needs to periodically renew certificates from Let's Encrypt, these then need to be formatted for hitch and hitch told about them. In the end I built the hitch image off my acme.sh image. This goes against the Docker ethos of one service per container but acme.sh doesn't run a daemon, it's periodically invoked by cron so this seemed reasonable.

Docker and cron is also a challenge. I ended up solving that with a simple solution: use the host cron to docker exec acme.sh in the hitch container. Perhaps not "pure" Docker but a lot simpler than some of the options I saw.

Hosting

I've been a happy DigitalOcean customer for 5 years but they don't have a data centre in Australia. Vultr, which have a similar offering -- low cost, high performance servers and a well-designed admin interface -- do have a Sydney data centre. Other obvious options include AWS and GCP. I wanted to avoid these where possible as their server offerings are more expensive, and their platforms have a tendency to lock you in with platform specific features. Also in the case of Google, they are a massive surveillance capitalist that I don't trust at all. So Vultr were my host of choice for the new server.

Having said that, the thing with building your own images is that you need to make them available to the Docker host somehow. For this I used an Amazon Elastic Container Registry. It's much cheaper than Docker Hub for private images and is just a standard container registry so I'm not locked in.

Orchestration

Once all the services were Dockerised, there needed to be a way to run the containers, and make them aware of each other. A popular option for this is Kubernetes and for a larger, multi-server deployment it might be the right choice. For my single server operation I opted for Docker Compose, which is, "a tool for defining and running multi-container Docker applications". With Compose you specify all the services in a YAML file and it takes care of running them all together.

My Docker Compose file looks like this:

version: '3'
services:
  hitch:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/hitch
    command: ["--config", "/etc/hitch/hitch.conf", "-b", "[varnish]:6086"]
    volumes:
      - ./hitch/hitch.conf:/etc/hitch/hitch.conf:ro
      - ./private/hitch/dhparams.pem:/etc/hitch/dhparams.pem:ro
      - certs:/etc/hitch/cert.d:rw
      - acme:/etc/acme.sh:rw
    ports:
      - "443:443"
    env_file:
      - private/hitch/development.env
    depends_on:
      - varnish
    restart: unless-stopped
  varnish:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/varnish
    command: ["-F", "-a", ":80", "-a", ":6086,PROXY", "-p", "feature=+http2", "-f", "/etc/varnish/default.vcl", "-s", "malloc,256M"]
    volumes:
      - ./varnish/default.vcl:/etc/varnish/default.vcl:ro
    ports:
      - "80:80"
    depends_on:
      - nginx
      - pkb
      - binary_trance
      - wizards
      - rust_melbourne
    restart: unless-stopped
  nginx:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/nginx
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
      - ./volumes/www:/usr/share/www:ro
    restart: unless-stopped
  pkb:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/pkb
    volumes:
      - pages:/home/pkb/pages:ro
    env_file:
      - private/pkb/development.env
    depends_on:
      - syncthing
    restart: unless-stopped
  binary_trance:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/binary_trance
    env_file:
      - private/binary_trance/development.env
    depends_on:
      - db
    restart: unless-stopped
  wizards:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/mattermost
    volumes:
      - ./private/wizards/config:/mattermost/config:rw
      - ./volumes/wizards/data:/mattermost/data:rw
      - ./volumes/wizards/logs:/mattermost/logs:rw
      - ./volumes/wizards/plugins:/mattermost/plugins:rw
      - ./volumes/wizards/client-plugins:/mattermost/client/plugins:rw
      - /etc/localtime:/etc/localtime:ro
    depends_on:
      - db
    restart: unless-stopped
  rust_melbourne:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/mattermost
    volumes:
      - ./private/rust_melbourne/config:/mattermost/config:rw
      - ./volumes/rust_melbourne/data:/mattermost/data:rw
      - ./volumes/rust_melbourne/logs:/mattermost/logs:rw
      - ./volumes/rust_melbourne/plugins:/mattermost/plugins:rw
      - ./volumes/rust_melbourne/client-plugins:/mattermost/client/plugins:rw
      - /etc/localtime:/etc/localtime:ro
    depends_on:
      - db
    restart: unless-stopped
  db:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/postgresql
    volumes:
      - postgresql:/var/lib/postgresql/data
    ports:
      - "127.0.0.1:5432:5432"
    env_file:
      - private/postgresql/development.env
    restart: unless-stopped
  syncthing:
    image: 791569612186.dkr.ecr.ap-southeast-2.amazonaws.com/syncthing
    volumes:
      - syncthing:/var/lib/syncthing:rw
      - pages:/var/lib/syncthing/Sync:rw
    ports:
      - "127.0.0.1:8384:8384"
      - "22000:22000"
      - "21027:21027/udp"
    restart: unless-stopped
volumes:
  postgresql:
  certs:
  acme:
  pages:
  syncthing:

Bringing all the services up is one command:

docker-compose -f docker-compose.yml -f production.yml up -d

The best bit is I can develop and test it all in isolation locally. Then when it's working, push to ECR and then run docker-compose on the server to bring in the changes. This is a huge improvement over my previous Ansible workflow and should make adding or removing new services in the future fairly painless.

Closing Thoughts

The new server has been running issue free so far. All sites are now redirecting to their https variants with Strict-Transport-Security headers set and get an A grade on the SSL Labs test. The Wizards Mattermost is much faster now that it's in Australia too.

There is one drawback to this move though: my sites are now slower for a lot of visitors. https adds some initial negotiation overhead and if you're reading this from outside Australia there's probably a bunch more latency than before.

I did some testing with WebPageTest to get a feel for the impact of this. My sites are already quite compact. Firefox tells me this page and all resources is 171KB / 54KB transferred. So there's not a lot of slimming to be done there. One thing I did notice was the TLS negotiation was happening for each of the parallel connections the browser opened to load the site.

Some research suggested HTTP/2 might help as it multiplexes requests on a single connection and only performs the TLS negotiation once. So I decided to live on the edge a little and enable Varnish's experimental HTTP/2 support. Retrieving the site over HTTP/2 did in fact reduce the TLS negotiations to one.

Thanks for reading, I hope the bits didn't take too long to get from Australia to wherever you are. Happy computing!



Previous Post: My Rust Powered linux.conf.au e-Paper Badge
Next Post: A Coding Retreat and Getting Embedded Rust Running on a SensorTag

Oleg Kovalov (olegkovalov)

What I don’t like in your repo March 13, 2019 06:35 AM

What I Don’t Like In Your Repo

Hi everyone, I’m Oleg and I’m yelling at (probably your) repo.

This is a copy of my dialogue with a friend about how to make a good and helpful repo for any community of any size and any programming language.

Let’s start.

README says nothing

But it’s a crucial part of any repo!

It’s the first interaction with your potential user and this is a first impression that you might bring to the user.

After the name (and maybe a logo) it’s a good place to put a few badges like:

  • recent version
  • CI status
  • link to the docs
  • code quality
  • code coverage
  • even the number of users in a chat
  • or just scroll all of them on https://shields.io/

Personal fail. Not so long time ago I did a simple, hacky and a bit funny project in Go which is called sabotage. I put a quote from a song, have added a picture, but… haven’t provided any info what it does.

This takes like 10 minutes to make a simpler intro and explain what I’m sharing and what it can do.

There is no reason why you or I should skip it.

Custom license or no license at all

First and most important: DO. NOT. (RE)INVENT. LICENSE. PLEASE.

When you’re going to create a new shiny new license or make any existent much better, please, ask yourself 17 times: what is the point to do so?

Companies of any size are very conservative in licenses, ’cause it might destroy their business. So if you’re targeting big audience — it’s a dumb way to do so.

There are a lot of guides on how to select the license and living it unlicensed or using an unpopular or funny (like WTFPL) will just be a bad sign for a user.

Feel free to choose one of the most popular:

  • MIT — when you want to give it for free
  • BSD3 — when you want a bit more rights for you
  • Apache 2.0 — when it’s a commercial product
  • GPLv3 — which is also a good option

(that’s might be an opinionated list, but whatever)

No Dockerfile

It’s already 2019 and the containers have won this world.

It’s much simpler for anyone to make a docker pull foo/bar command rather than download all dependencies, configure paths, realise that some things might be incompatible or even be scared to totally destroy their system.

Is there a guarantee that there is no rm -rf in an unverified project? 😈

Adding a simple Dockerfile with everything needed can be done is 30 mins. But this will give your users a safe and fast way to start using, validating or helping to improve your work. A win-win situation.

Changes without pull requests

That might look weird, but give me a second.

When a project is small and there are 0 or few users — that might be okay. It’s easy to follow what happened last days: fixes, new features, etc. But when the scale gets bigger, oh… it becomes a nightmare.

You have pushed few commits into the master, so probably you did it on your computer and no one saw what happened, there wasn’t any feedback. You may break API backward compatibility, forgot to add or remove something, even make useless work (oh, nasty one).

When you’re doing a pull request, some random guru-senior-architect might occasionally check your code and suggest few changes. Sounds unlikely but any additional eyes might uncover bugs or architecture mistakes.

Do not hide your work, isn’t this a reason for open sourcing it?

Bloated dependencies

Maybe it’s just me but I’m very conservative with dependencies.

When I see dozens of deps in the lock file, the first question which comes to my mind is: so, am I ready to fix any failures inside any of them?

Yeah, it works today, maybe it worked 1 week/month/year before, but can you guarantee what will happen tomorrow? I cannot.

No styling or formatting

Different files (sometimes even functions) are written in a different style.

This causes troubles for the contributors, ‘cause one prefers spaces and another prefers tabs. And this is just the simplest example.

So what will be the result:

  • 1 file in one style and another in completely different
  • 1 with { at the end of a line and another { on the new line
  • 1 function in functional style and right below in pure procedural

Which of them is right? — I dunno but this is acceptable if it works but also this horribly distracts readers for no reason.

Simple rule for this: use formatter and linters: eslint, gofmt, rustfmt…oh tons of them! Feel free to configure it as you would like to but keep in mind that the most popular tend to be most natural.

No automatic builds

How you can verify that user can build your code?

The answer is quite simple — build system. TravisCI, GitlabCI, CircleCI and that ‘s only a few of them.

Treat a build system as a silent companion that will check your every commit and will automatically run formatters/linters to ensure that new code has good quality. Sounds amazing, isn’t it?

And adding a simple yaml file which describes how the build should be done in minutes, as always.

No releases or Git tags

Master branch might be broken.

That happening. This is unpleasant stuff but it happens.

Some recent changes might be merged and somehow this causes troubles on the master. How much time it will take to fix for you? few minutes? an hour? a day? Till you’ll be back from vacation? Who knows ‾\_(ツ)_/‾

But when there is a Git tag which points to the time when a project was correct and able to be built, oh, that’s a good thing to have and make the life of your users much better.

Adding release on a Github (similar as Gitlab or any other place) is literally seconds, no reason to omit this step.

No tests

Well, it might be okay.

Of course, having correct tests is a nice thing to have, but probably you’re doing this project after work in your free time or weekend (I’m guilty, I’m doing this so often instead to have a rest).

So don’t be strict to yourself, feel free to share your work and share knowledge. The test can be added later, time with family and friends is more important, the same as mental or physical health.

Conclusion

There are a lot of other things that will make your repo even better, but maybe you will mention them personally in comments?

Twitter: https://twitter.com/oleg_kovalov/status/1105719270116388864

Lobsters: https://lobste.rs/s/6gixqw/what_i_don_t_like_your_repo

HN: https://news.ycombinator.com/item?id=19376264

Reddit: https://www.reddit.com/r/programming/comments/b0isug/what_i_dont_like_in_your_repo/

Thanks.


What I don’t like in your repo was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

March 12, 2019

Kevin Burke (kb)

Phone Number for SFMTA Temporary Sign Office March 12, 2019 04:10 AM

The phone number for the SFMTA Temporary Sign Office is very difficult to find. The SFMTA Temporary Sign web page directs you to 311. 311 does not know the right procedures for the Temporary Sign Office.

The email address on the website is also slow to get back to requests. The Temporary Sign department address listed on the website, at 1508 Bancroft Avenue, is not open to the public — it's just a locked door.

To contact the Temporary Sign Office, call 415-550-2716. This is the direct line to the department. I reached someone in under a minute.

If your event is more than 90 days in the future, don't expect an update. They don't start processing signage applications until 90 days before the event.

Here's a photo of my large son outside of the SFMTA Temporary Sign Office, where I did not find anyone to speak with, but I found the phone number that got me the right phone number to get someone to give me an update on my application.

Using an AWS Aurora Postgres Database as a Source for Database Manager Service March 12, 2019 03:53 AM

Say you have a Aurora RDS PostgreSQL database that you want to use as the source database for Amazon Database Manager Service.

The documentation is unclear on this point so here you go: you can't use an Aurora RDS PostgreSQL database as the source database because Aurora doesn't support the concept of replication slots, which is how Amazon DMS migrates data from one database to another.

Better luck with other migration tools!