POKI_PUT_TOC_HERE
od -xcv
and/or cat -e
on your file to check for non-printable characters.
If you’re using Miller version less than 5.0.0 (try
mlr --version
on your system to find out), when the
line-ending-autodetect feature was introduced, please see
here.
head
program. Example: for CSV, Miller’s default record
separator is comma; if your data is tab-delimited, e.g. aTABbTABc
,
then Miller won’t find three fields named a
, b
, and
c
but rather just one named aTABbTABc
. Solution in this
case: mlr --fs tab {remaining arguments ...}
.
Also try od -xcv
and/or cat -e
on your file to check for non-printable characters.
mlr put
and mlr filter
, the default behavior for
scanning input records is to parse them as integer, if possible, then as float,
if possible, else leave them as string:
POKI_RUN_COMMAND{{cat data/scan-example-1.tbl}}HERE
POKI_RUN_COMMAND{{mlr --pprint put '$copy = $value; $type = typeof($value)' data/scan-example-1.tbl}}HERE
The numeric-conversion rule is simple:
"1"
should be int);
if that doesn’t succeed, try to scan as float ("1.0"
should be float);
if that doesn’t succeed, leave the value as a string ("1x"
is string).
'$z = $x +
$y'
without having to write '$z = int($x) + float($y)'
. Also
note that default output format for floating-point numbers created by
put
(and other verbs such as stats1
) is six decimal places;
you can override this using mlr --ofmt
. Also note that Miller uses
your system’s C library functions whenever possible: e.g. sscanf
for converting strings to integer or floating-point.
But now suppose you have data like these:
POKI_RUN_COMMAND{{cat data/scan-example-2.tbl}}HERE
POKI_RUN_COMMAND{{mlr --pprint put '$copy = $value; $type = typeof($value)' data/scan-example-2.tbl}}HERE
The same conversion rules as above are being used. Namely:
sscanf
semantics);
since 0008
doesn't scan as integer (leading 0 requests octal but 8
isn't a valid octal digit), the float scan is tried next and it succeeds;
default floating-point output format is 6 decimal places (override with mlr --ofmt
).
mlr put
and/or mlr filter
.
Then all field values are left as string. You can type-coerce on demand using syntax like
'$z = int($x) + float($y)'
. (See also the
put documentation; see also
https://github.com/johnkerl/miller/issues/150.)
POKI_RUN_COMMAND{{mlr --pprint put -S '$copy = $value; $type = typeof($value)' data/scan-example-2.tbl}}HERE
then
onward:
POKI_RUN_COMMAND{{mlr --icsv --opprint count-distinct -f Status,Payment_Type data/then-example.csv}}HERE
After that, run it with the next then
step included:
POKI_RUN_COMMAND{{mlr --icsv --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv}}HERE
Now if you use then
to include another verb after that, the columns
Status
, Payment_Type
, and count
will be the input to
that verb.
Note, by the way, that you’ll get the same results using pipes:
POKI_RUN_COMMAND{{mlr --csv count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --opprint sort -nr count}}HERE
--implicit-csv-header
, Miller will sequentially assign keys of the
form 1
, 2
, etc. But these are not integer array indices:
they’re just field names taken from the initial field ordering in the
input data.
POKI_RUN_COMMAND{{echo x,y,z | mlr --dkvp cat}}HERE
POKI_RUN_COMMAND{{echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"'}}HERE
POKI_RUN_COMMAND{{echo x,y,z | mlr --nidx cat}}HERE
POKI_RUN_COMMAND{{echo x,y,z | mlr --csv --implicit-csv-header cat}}HERE
POKI_RUN_COMMAND{{echo x,y,z | mlr --dkvp rename 2,999}}HERE
POKI_RUN_COMMAND{{echo x,y,z | mlr --dkvp rename 2,newname}}HERE
POKI_RUN_COMMAND{{echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2}}HERE
strptime
to parse the date field into seconds-since-epoch
and then do numeric comparisons. Simply match your input dataset’s
date-formatting to the strptime
format-string. For example:
POKI_RUN_COMMAND{{mlr --csv filter 'strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")' dates.csv}}HERE
Caveat: localtime-handling in timezones with DST is still a work in progress; see
https://github.com/johnkerl/miller/issues/170.
See also https://github.com/johnkerl/miller/issues/208
— thanks @aborruso!
ssub
function exists precisely for this reason: so you don’t have to escape anything.
put
expression from the shell, and the double quotes within them are
for Miller. To get a single quote in the middle there, you need to actually put it outside the single-quoting
for the shell. The pieces are
$a="It
\'
s OK, I said,
\'
for now
\'
.
x,i,a
were requested but they appear here in the order a,i,x
:
POKI_RUN_COMMAND{{cat data/small}}HERE
POKI_RUN_COMMAND{{mlr cut -f x,i,a data/small}}HERE
The issue is that Miller’s cut
, by default, outputs cut fields in the order they
appear in the input data. This design decision was made intentionally to parallel the *nix system cut
command, which has the same semantics.
The solution is to use the -o
option:
POKI_RUN_COMMAND{{mlr cut -o -f x,i,a data/small}}HERE
NR=1
and NR=2
here??
POKI_RUN_COMMAND{{mlr filter '$x > 0.5' then put '$NR = NR' data/small}}HERE
The reason is that NR
is computed for the original input records and isn’t dynamically
updated. By contrast, NF
is dynamically updated: it’s the number of fields in the
current record, and if you add/remove a field, the value of NF
will change:
POKI_RUN_COMMAND{{echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF'}}HERE
NR
, by contrast (and FNR
as well), retains the value from the original input stream,
and records may be dropped by a filter
within a then
-chain. To recover consecutive record
numbers, you can use out-of-stream variables as follows:
POKI_INCLUDE_AND_RUN_ESCAPED(data/dynamic-nr.sh)HERE
Or, simply use mlr cat -n
:
POKI_RUN_COMMAND{{mlr filter '$x > 0.5' then cat -n data/small}}HERE
-u
is the default.
For example, the right file here has nine records, and the left file should
add in the hostname
column — so the join output should also have
9 records:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint cat data/join-u-left.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsvlite --opprint cat data/join-u-right.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv}}HERE
The issue is that Miller’s join
, by default (before 5.1.0),
took input sorted (lexically ascending) by the sort keys on both the left and
right files. This design decision was made intentionally to parallel the *nix
system join
command, which has the same semantics. The benefit of this
default is that the joiner program can stream through the left and right files,
needing to load neither entirely into memory. The drawback, of course, is that
is requires sorted input.
The solution (besides pre-sorting the input files on the join keys) is to
simply use mlr join -u (which is now the default). This loads the left
file entirely into memory (while the right file is still streamed one line at a
time) and does all possible joins without requiring sorted input:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv}}HERE
General advice is to make sure the left-file is relatively small, e.g.
containing name-to-number mappings, while saving large amounts of data for the
right file.
color
column, we get a row not having the same column names as the
other:
POKI_RUN_COMMAND{{mlr --csv join --ul -j id -f data/color-codes.csv data/color-names.csv}}HERE
To fix this, we can use unsparsify:
POKI_RUN_COMMAND{{mlr --csv join --ul -j id -f data/color-codes.csv then unsparsify --fill-with "" data/color-names.csv}}HERE
Thanks to @aborruso for the tip!
# DKVP x=1,y=2 z=3 # XML <table> <record> <field> <key> x </key> <value> 1 </value> </field> <field> <key> y </key> <value> 2 </value> </field> </record> <record> <field> <key> z </key> <value> 3 </value> </field> </record> </table> # JSON [{"x":1,"y":2},{"z":3}]