@@MrAbrazildo For me, I can never code in C++; It is not at all simple, and there is already enough gibberish to process in C because of pointers syntax. C++ goes nuclear with shitty syntax. I personally prefer C because of simple, functional and fast code. But no problem with C++ either.
@@friedrichmyersDo you mean it's needed to call C f()s because of pointers? As long as I know, C++ rewrote everything to work with iterators, or even higher level. Shitty syntax? It's pretty elegant to me. It inherited the "ambiguity" from C, which is good, leading to more productive work, and added OO features, which allow to configure things to work your way. Miss the ^ from BASIC? No problem: make a class, put your integer in it, and bingo: x ^ y. It could be even more elegant though, if macros receive a strong upgrade.
@@TheVimeagen This was for encapsulation. Lexer struct is defined in .c file. So .h file cannot know what fields are inside, so it could be only pointer. That way user of interface only needs to call create and cleanup functions at the end. He cannot access any fields. He mus call functions inside lexer module to do operations on lexer struct.
yea was wondering about why the author did this, maybe there are reasons I'm not aware of where doing this way has some advantages? also no check on `malloc` returns value being potentially `NULL` :D
@@MrChickenpoulet it's usually unnecessary to check the return value of malloc() because it can only fail if you're out of memory. and if you're out of memory you're fucked anyway so who cares
with regards to pointer placement, due to being able to make multiple variable declarations in a single line (and for lack of a better term pointers not carrying), the asterisks has to be part of the variable name rather than the 'type' to make sense. Think of 'int* a, b;" only a is a pointer here. Not saying this is a a good thing, (C is a bit of an evil step-kid in syntax) but it exists. As such "int *a, b;" makes more logical sense.
@@TheVimeagen Ha, don't disagree. Just sometimes a new companies coding standards will drive you mad. The newer trend of tooling such as clang-format or gofmt (is it still applicable to call it new?) - "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite.". has a lot of benefit.
Am I right in thinking the purpose of this is to assess the performance of the C language? Because this could be sped up significantly by not calling malloc() for every token. Also this is not a great specimen for idiomatic C and has a strong C++ flavour: - Hiding pointer behind a typedef makes all code using that typedef misleading and is widely discouraged - Identifiers starting with underscore are not permitted (both the include guards and private functions violate this) - Name of struct typedef doesn't end in "_t", as is idiomatic for C but not C++ - extern "C" is C++ syntax, and should be in a C++ file, should not be in a C language file - strndup() is not portable and you either need to define the appropriate feature macros to request it from the compiler, or a static implementation in the C file
What do you think is a good alternative to malloc()ing each token, a larger up-front memory pool (that maybe needs resizing later) that you manage for the tokens?
You are right about _t convetion. I completly forget about that. Hiding pointer can be missliding, but I don't think that is big problem here. Lexer struct is defined in .c file, so other users of lexer module do not have direct access to fields of struct. He can only interact with that struct using functions made for that. It's done for encapsulation. Hi should not allocate or free any memory. Just cal appropriate functions and cleanup function at the end. It's not really low level C, and I would not use this approch for embedded systems, but I think it's nice and convenient interface. You are correct about memorry allocation. I will optimize code more when complete parser. Probably one memorry allocation will be better, but implementaion of parser in book holds referect to every token in AST. I don't like to do too much optimisation in advance, that can hurt you latter. You are alo correct about strndup. I don't really see the problem with underscore. I don't think that would impact performance in any way. Linux kernel uses it also. Same about extern C. It's behind preprocessor macro, so only C++ compiler would include that after preprocessing. I think it's nice to have it. That way you can reuse same code in C++ if you need to. It's convenient for user of library. I know that this is not library, but it does not hurt.
@@zachmanifold Better approche would be to havly ony one instance of token if you do not need to hold it. Just change data inside, and that all. Alternative would be memory pooling, but I think that is too early for detailed optimisation. It's really depends on how you use it latter. Implementation in book holds reference to token in AST, so one instance maybe it's not that great idea. If you optimise it too early can hurt you in long run. But definitely can and should be optimized.
@@zivlakmilos yeah, with the memory pool I was assuming we actually needed the individual tokens, but depending on the application if we could get away with reusing the token then that's definitely preferred
Identifiers starting with underscore are allowed if it isn't underscore underscore or underscore capital letter. Only underscore lowercase letter was used here so it's fine
5:35 Correct, it makes `Token` a `SToken*`. This is commonly done when making a type that is meant to be used from an interface where the implementer will decide how big the item is and provide allocation/deallocation features (as here), and especially when the type is opaque. Nice C code, I like it (though I agree that `type* name` is the One True Way). Good use of setting the pointer to NULL in the cleanup - that is 100% a method to avoid use-after-free errors without having to rely on checking something like Valgrind.
Microsoft does that pointer definition a lot in their C Windows API. Personally, I hate when people declare opaque types this way. I think its better to forward declare the typedef in a header and give the implementation in the c file.
That's what I tend to do. But I can see cases where it's helpful, like when the implementation varies by platform. If only this was the most annoying part of the Windows API.
Well to be accurate 'Token' is a 'struct SToken *'. I'd prefer seeing it named pToken or something similar. It's pretty helpful to name things that way.
You are correct. This was for encapsulation. Lexer struct is defined in .c file. So .h file cannot know what fields are inside, so it could be only pointer. That way user of interface only needs to call create and cleanup functions at the end. He cannot access any fields. He mus call functions inside lexer module to do operations on lexer struct. Token is available to user of interface, but I just like consistency.
I'm not a big fan of the use of underscores in the function names here. I get that they're marked static, so its just limited to the scope of the file, which technically is the correct use as stated by the C standard, but libraries (std, posix) and compilers still use and expose such named identifiers at the global level and assume anything with one or more underscored is reserved for them. It's probably best to just stay away from that convention in C.
C compilers only consider starting with double underscores (or single underscore when external I believe) and starting with a single underscore followed by an uppercase letter as reserved.
5:13 SToken is the name of the type for that struct, the *Token is the type of the pointer to a struct SToken, so you will be able to do this "SToken a_instance_of_stoken, another_instance_of_stoken; Token a_pointer_to_stoken, another_pointer_to_stoken;" so is just syntactic sugar to do type you wanted with int* but is illegal with multiple variable names. That can also be declared typedef struct { TokenType type; char *literal; } SToken, *Token; other people to avoid the "hidden pointer" stuff do the other way of this, Token for the struct and PtrToken for the pointer type, or some other naming convention that is explicit about being a pointer. If you have a good IDE you can know what it is by hovering on it, this is why Cormac disses on non-IDEs like vim & emacs. I guess you can use a "go to definition" keyboard shortcut on vim.
1:49 I disagree on this -- I would've written it as "const char * _lexerReadInt" (with spaces on each end) for example, because: - char is a type - * is a specifier/qualifier - char* is not a type per-se - pointer specifier/qualifiers can be applied to any C type - char* a, b, c; leads to programming errors
Real ones know the star goes at the function/variable name because when you declare more then one pointer in a single line you need a star for every variable name. Yes, its stupid and pedantic but thats the way it must be and REAL ONES WILL KNOW... edit: damn I wrote this befor prime explained why they do it that way and now I feel stupid.
Charlie Puth: *perfect pitch* TJ: *puts on Tom's extract* *perfect debug skllz* Also, TJ kind of looks like charlie too (?), just came to mind. Any ops degens...?
1:00, size_t is safer in 2 ways: 1) it's the biggest possible integer type, thus holding possible higher values, less chance to get overflowed. 2) If that happens, the behavior is entirely defined, you know what will happen, much different than signed types, which are not specified, allowing different implementations, equivalent to arbitrary behavior from the programmer's perspective! But signed types are faster. 3:45, much better would be: while (strchr (" \t ", lexer->character)). //From . 4:21, it's so ugly to see a switch like that for C/C++. It's possible to reduce this to 3-5 lines of code, instead of this ugliness. 5:00, I guess it means that when reading the content of pointer Token, the content is the struct SToken. So, whenever you define Token as the type of some variable, that means it'll be a pointer to SToken. So it's not needed to use pointer operator for pointers to this type. i.e.: Token my_ptr_to_SToken = NULL; //It doesn't require *. 5:25, in C++, it's not needed to make a typedef here: once a class/struct is defined, it's name means its type. But bizarre C may be still demanding that nowadays. 7:19, the double pointer means a collection of pointers. You could check "nullness" with only a pointer, but it'd be it alone. They want several tokens together. 7:45, much better: return ch == '_' || isalpha (ch); //From .
Why not use uint16_t (or uint32_t) instead of size_t? Or, if you want to build in some sort of error handling, int16_t (or int32_t)? A 16 bit int should be sufficient for a lexer. It might be best to chunk up input into a well defined buffer of fewer than 65k characters in the first place. That makes uint16_t (or int16_t) sufficient. That will reduce the struct size from the current (at least) 35+padding bytes to a more precise 21+padding bytes. Or, does size_t offer advantages that I’m not aware of (I’m still learning C)?
@@aaronfleisher4694 If 16 bits are safe enough, then I would use it - but I would make a typedef for it (it might needs to change in the future). But when the idea is to have more space on a variable, size_t looks towards the future ("future-proof"), growing automatically according to the platform size.
"It's good. Because it's not C++"
"I feel like this is where C++ shines pretty well"
I also feel like it shines pretty well everywhere that it isn't
C++ is better and safer.
@@MrAbrazildo see you in 10 years, silly goose. Hope it doesn't take you much longer to figure things out.
@@Kknewkles I take entire projects in C++. It's the best, by test.
@@MrAbrazildo For me, I can never code in C++; It is not at all simple, and there is already enough gibberish to process in C because of pointers syntax. C++ goes nuclear with shitty syntax. I personally prefer C because of simple, functional and fast code. But no problem with C++ either.
@@friedrichmyersDo you mean it's needed to call C f()s because of pointers? As long as I know, C++ rewrote everything to work with iterators, or even higher level.
Shitty syntax? It's pretty elegant to me. It inherited the "ambiguity" from C, which is good, leading to more productive work, and added OO features, which allow to configure things to work your way. Miss the ^ from BASIC? No problem: make a class, put your integer in it, and bingo: x ^ y. It could be even more elegant though, if macros receive a strong upgrade.
Oh, I hate the hidden pointer. That feels like a huge foot gun when it comes to ownership…
yepper, super confusing there
@@TheVimeagen This was for encapsulation. Lexer struct is defined in .c file. So .h file cannot know what fields are inside, so it could be only pointer.
That way user of interface only needs to call create and cleanup functions at the end. He cannot access any fields. He mus call functions inside lexer module to do operations on lexer struct.
Token is available to user of interface, so no need to be pointer. But I like consistency.
@@TheVimeagen its not confusing at all if you use a real IDE.
Agreed, typedeffing away a pointer is pure evil
If you use calloc() instead of malloc() you don't have to memset().
yea was wondering about why the author did this, maybe there are reasons I'm not aware of where doing this way has some advantages? also no check on `malloc` returns value being potentially `NULL` :D
@@MrChickenpoulet I always check if malloc fails, but I don't think it can happen on Linux unless you initially ask for a large amount of memory
I still find it more readable doing it explicitly.
@@whoopsimsorry2546 I generally use calloc more because the interface is better
@@MrChickenpoulet it's usually unnecessary to check the return value of malloc() because it can only fail if you're out of memory. and if you're out of memory you're fucked anyway so who cares
"Why do I have a more human color?"
Lmao
classic tj, looking like a lizard
There's a lot of dog whistles in this video.
Going up 3 lines in vim
Talking about whitespace
Lexar is kind of like Texas
Now let's see Paul Allen's code
👨🍳🤌
Perfect use of the meme
with regards to pointer placement, due to being able to make multiple variable declarations in a single line (and for lack of a better term pointers not carrying), the asterisks has to be part of the variable name rather than the 'type' to make sense. Think of 'int* a, b;" only a is a pointer here. Not saying this is a a good thing, (C is a bit of an evil step-kid in syntax) but it exists. As such "int *a, b;" makes more logical sense.
i get it , but i think that there should be some special kind of wrath set aside for those that write this type of code :)
@@TheVimeagen Ha, don't disagree. Just sometimes a new companies coding standards will drive you mad. The newer trend of tooling such as clang-format or gofmt (is it still applicable to call it new?) - "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite.". has a lot of benefit.
when i saw the C++ code review vid i immediately started looking for the C one and sadly didn't find it. well, problem solved
Am I right in thinking the purpose of this is to assess the performance of the C language?
Because this could be sped up significantly by not calling malloc() for every token.
Also this is not a great specimen for idiomatic C and has a strong C++ flavour:
- Hiding pointer behind a typedef makes all code using that typedef misleading and is widely discouraged
- Identifiers starting with underscore are not permitted (both the include guards and private functions violate this)
- Name of struct typedef doesn't end in "_t", as is idiomatic for C but not C++
- extern "C" is C++ syntax, and should be in a C++ file, should not be in a C language file
- strndup() is not portable and you either need to define the appropriate feature macros to request it from the compiler, or a static implementation in the C file
What do you think is a good alternative to malloc()ing each token, a larger up-front memory pool (that maybe needs resizing later) that you manage for the tokens?
You are right about _t convetion. I completly forget about that.
Hiding pointer can be missliding, but I don't think that is big problem here. Lexer struct is defined in .c file, so other users of lexer module do not have direct access to fields of struct. He can only interact with that struct using functions made for that. It's done for encapsulation. Hi should not allocate or free any memory. Just cal appropriate functions and cleanup function at the end. It's not really low level C, and I would not use this approch for embedded systems, but I think it's nice and convenient interface.
You are correct about memorry allocation. I will optimize code more when complete parser. Probably one memorry allocation will be better, but implementaion of parser in book holds referect to every token in AST. I don't like to do too much optimisation in advance, that can hurt you latter.
You are alo correct about strndup.
I don't really see the problem with underscore. I don't think that would impact performance in any way. Linux kernel uses it also.
Same about extern C. It's behind preprocessor macro, so only C++ compiler would include that after preprocessing. I think it's nice to have it. That way you can reuse same code in C++ if you need to. It's convenient for user of library. I know that this is not library, but it does not hurt.
@@zachmanifold Better approche would be to havly ony one instance of token if you do not need to hold it. Just change data inside, and that all. Alternative would be memory pooling, but I think that is too early for detailed optimisation. It's really depends on how you use it latter. Implementation in book holds reference to token in AST, so one instance maybe it's not that great idea. If you optimise it too early can hurt you in long run. But definitely can and should be optimized.
@@zivlakmilos yeah, with the memory pool I was assuming we actually needed the individual tokens, but depending on the application if we could get away with reusing the token then that's definitely preferred
Identifiers starting with underscore are allowed if it isn't underscore underscore or underscore capital letter. Only underscore lowercase letter was used here so it's fine
Global scope identifiers starting with an underscore are reserved in C => instant undefined behavior.
5:35 Correct, it makes `Token` a `SToken*`. This is commonly done when making a type that is meant to be used from an interface where the implementer will decide how big the item is and provide allocation/deallocation features (as here), and especially when the type is opaque.
Nice C code, I like it (though I agree that `type* name` is the One True Way). Good use of setting the pointer to NULL in the cleanup - that is 100% a method to avoid use-after-free errors without having to rely on checking something like Valgrind.
Microsoft does that pointer definition a lot in their C Windows API. Personally, I hate when people declare opaque types this way. I think its better to forward declare the typedef in a header and give the implementation in the c file.
That's what I tend to do. But I can see cases where it's helpful, like when the implementation varies by platform. If only this was the most annoying part of the Windows API.
Well to be accurate 'Token' is a 'struct SToken *'. I'd prefer seeing it named pToken or something similar. It's pretty helpful to name things that way.
@@InfiniteQuest86 Or you could just not typedef pointers
You are correct.
This was for encapsulation. Lexer struct is defined in .c file. So .h file cannot know what fields are inside, so it could be only pointer.
That way user of interface only needs to call create and cleanup functions at the end. He cannot access any fields. He mus call functions inside lexer module to do operations on lexer struct.
Token is available to user of interface, but I just like consistency.
I'm not a big fan of the use of underscores in the function names here. I get that they're marked static, so its just limited to the scope of the file, which technically is the correct use as stated by the C standard, but libraries (std, posix) and compilers still use and expose such named identifiers at the global level and assume anything with one or more underscored is reserved for them. It's probably best to just stay away from that convention in C.
C compilers only consider starting with double underscores (or single underscore when external I believe) and starting with a single underscore followed by an uppercase letter as reserved.
The classic "Start C (end)" marker
Shoutouts to prime and teej flying out for in person pair code reviews :p
Lexar is short for Lex Luthor. He has an mallocious plan hidden in their somewhere
C++ really shines when it's actually c.
I could watch at least an hour of these two running through this code with gdb
This is a super simple code, the developers do they job here
With all the JDSL and Tom's a genius i think you need to make a T-shirt
Bruh, the whole time I thought they were just sitting together like really good friends
5:13 SToken is the name of the type for that struct, the *Token is the type of the pointer to a struct SToken, so you will be able to do this
"SToken a_instance_of_stoken, another_instance_of_stoken;
Token a_pointer_to_stoken, another_pointer_to_stoken;"
so is just syntactic sugar to do type you wanted with int* but is illegal with multiple variable names.
That can also be declared
typedef struct {
TokenType type;
char *literal;
} SToken, *Token;
other people to avoid the "hidden pointer" stuff do the other way of this, Token for the struct and PtrToken for the pointer type, or some other naming convention that is explicit about being a pointer. If you have a good IDE you can know what it is by hovering on it, this is why Cormac disses on non-IDEs like vim & emacs. I guess you can use a "go to definition" keyboard shortcut on vim.
For me personally, C is just the best language of all time.
Agreed 100%
size_t => unsigned int 32/64 bit
in-place object declaration
Put that underscore at the end of the name!
The Slexer tho XDDD I'd do "struct Lexer_t"
1:49 I disagree on this -- I would've written it as "const char * _lexerReadInt" (with spaces on each end) for example, because:
- char is a type
- * is a specifier/qualifier
- char* is not a type per-se
- pointer specifier/qualifiers can be applied to any C type
- char* a, b, c; leads to programming errors
2:49 it's hungarian notation
I should study more
Real ones know the star goes at the function/variable name because when you declare more then one pointer in a single line you need a star for every variable name. Yes, its stupid and pedantic but thats the way it must be and REAL ONES WILL KNOW...
edit: damn I wrote this befor prime explained why they do it that way and now I feel stupid.
why would you call malloc() then call memset()? Couldn't they have achieved the same outcome with calloc()? time: 3:11
Is malloc a macro that wraps actual malloc that checks for a failed allocation?
* to search for the word under your cursor
Charlie Puth: *perfect pitch*
TJ: *puts on Tom's extract* *perfect debug skllz*
Also, TJ kind of looks like charlie too (?), just came to mind. Any ops degens...?
You write industry-standard code.
Looks perfect.
1:00, size_t is safer in 2 ways: 1) it's the biggest possible integer type, thus holding possible higher values, less chance to get overflowed. 2) If that happens, the behavior is entirely defined, you know what will happen, much different than signed types, which are not specified, allowing different implementations, equivalent to arbitrary behavior from the programmer's perspective! But signed types are faster.
3:45, much better would be: while (strchr ("
\t
", lexer->character)). //From .
4:21, it's so ugly to see a switch like that for C/C++. It's possible to reduce this to 3-5 lines of code, instead of this ugliness.
5:00, I guess it means that when reading the content of pointer Token, the content is the struct SToken. So, whenever you define Token as the type of some variable, that means it'll be a pointer to SToken. So it's not needed to use pointer operator for pointers to this type. i.e.: Token my_ptr_to_SToken = NULL; //It doesn't require *.
5:25, in C++, it's not needed to make a typedef here: once a class/struct is defined, it's name means its type. But bizarre C may be still demanding that nowadays.
7:19, the double pointer means a collection of pointers. You could check "nullness" with only a pointer, but it'd be it alone. They want several tokens together.
7:45, much better: return ch == '_' || isalpha (ch); //From .
Why not use uint16_t (or uint32_t) instead of size_t? Or, if you want to build in some sort of error handling, int16_t (or int32_t)? A 16 bit int should be sufficient for a lexer. It might be best to chunk up input into a well defined buffer of fewer than 65k characters in the first place. That makes uint16_t (or int16_t) sufficient. That will reduce the struct size from the current (at least) 35+padding bytes to a more precise 21+padding bytes. Or, does size_t offer advantages that I’m not aware of (I’m still learning C)?
@@aaronfleisher4694 If 16 bits are safe enough, then I would use it - but I would make a typedef for it (it might needs to change in the future). But when the idea is to have more space on a variable, size_t looks towards the future ("future-proof"), growing automatically according to the platform size.
this videos are awesome
I like using (void*)0 instead of NULL
Review nix next!
c is underrated
I don't think it is...
2:17 that's wrong, both foo and a are pointers.
Search word under cursor:
```
local tbi = require "telescope.builtin"
vim.keymap.set("n", "rw", function()
tbi.grep_string {
search = vim.fn.expand "",
}
end, { silent = true})
```
it's called Hungarian notation en.wikipedia.org/wiki/Hungarian_notation
I’m sorry, but did you really just find . | grep compile?
yeah... champ status
@@TheVimeagen 🤣 -name