Discussion:
bug#30935: gzip -l reports wrong size for decompressed files larger than 4GB
Wolfgang Formann
2018-03-25 08:42:42 UTC
Permalink
Hello!

I am using gzip 1.6 from openSUSE Leap 42.3 with latest patches

$ file /usr/bin/gzip
/usr/bin/gzip: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, BuildID[sha1]=7103d56e17e6f81a52db927e393dce601c3af0e1, stripped

There is a compressed file available at https://data.dnb.de/opendata/GND.rdf.gz which has a size of 1.232.465.678 bytes.
Uncompressed it will have a size of 19.465.374.298

The problem is:
$ gzip -l GND.rdf.gz
compressed uncompressed ratio uncompressed_name
1232465678 2285505114 46.1% GND.rdf

This number 2285505114 is actually the lower 32 bits of the real size 19GB.
$ echo "19465374298-16*1024*1024*1024" | bc
2285505114

Such a behaviour is okay for 32-bit software, 64-bit should show correct numbers.

Thanks
Wolfgang
Mark Adler
2018-03-25 21:05:52 UTC
Permalink
Wolfgang,

The gzip format stores only the low 32 bits of the uncompressed length as the last four bytes of the stream, so it is not possible to show the correct number. At least not without decompressing the whole thing.

There are two other ways that the displayed uncompressed size can be incorrect, even for small files. Those are if a) there is more than one gzip member in the gzip stream, in which case only the uncompressed size of the last member will be shown, or b) if there are junk bytes after the end of the gzip stream, in which case the junk will be shown as the length.

In short, the reported length is informational at best, and should not be trusted if the information is important.The purpose of the length modulo 2^32 being in the trailer is as an additional integrity check along with the CRC. However it was also used for gzip -l, which was perhaps a mistake.

You can get the actual decompressed length only by decompressing, and discarding the uncompressed data if you only want the length. You can either:

gzip -dc file.gz | wc -c

or:

pigz -lt file.gz

The latter will report the members of the gzip stream separately.

Mark
Post by Wolfgang Formann
Hello!
I am using gzip 1.6 from openSUSE Leap 42.3 with latest patches
$ file /usr/bin/gzip
/usr/bin/gzip: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, BuildID[sha1]=7103d56e17e6f81a52db927e393dce601c3af0e1, stripped
There is a compressed file available at https://data.dnb.de/opendata/GND.rdf.gz which has a size of 1.232.465.678 bytes. Uncompressed it will have a size of 19.465.374.298
$ gzip -l GND.rdf.gz
compressed uncompressed ratio uncompressed_name
1232465678 2285505114 46.1% GND.rdf
This number 2285505114 is actually the lower 32 bits of the real size 19GB.
$ echo "19465374298-16*1024*1024*1024" | bc
2285505114
Such a behaviour is okay for 32-bit software, 64-bit should show correct numbers.
Thanks
Wolfgang
Wolfgang Formann
2018-03-25 21:25:42 UTC
Permalink
Mark,

I accept that problem. I would be happy, when a similar statement like yours would be in the man page of gzip.

Wolfgang
Post by Mark Adler
Wolfgang,
The gzip format stores only the low 32 bits of the uncompressed length as the last four bytes of the stream, so it is not possible to show the correct number. At least not without decompressing the whole thing.
There are two other ways that the displayed uncompressed size can be incorrect, even for small files. Those are if a) there is more than one gzip member in the gzip stream, in which case only the uncompressed size of the last member will be shown, or b) if there are junk bytes after the end of the gzip stream, in which case the junk will be shown as the length.
In short, the reported length is informational at best, and should not be trusted if the information is important.The purpose of the length modulo 2^32 being in the trailer is as an additional integrity check along with the CRC. However it was also used for gzip -l, which was perhaps a mistake.
gzip -dc file.gz | wc -c
pigz -lt file.gz
The latter will report the members of the gzip stream separately.
Mark
Post by Wolfgang Formann
Hello!
I am using gzip 1.6 from openSUSE Leap 42.3 with latest patches
$ file /usr/bin/gzip
/usr/bin/gzip: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, BuildID[sha1]=7103d56e17e6f81a52db927e393dce601c3af0e1, stripped
There is a compressed file available at https://data.dnb.de/opendata/GND.rdf.gz which has a size of 1.232.465.678 bytes. Uncompressed it will have a size of 19.465.374.298
$ gzip -l GND.rdf.gz
compressed uncompressed ratio uncompressed_name
1232465678 2285505114 46.1% GND.rdf
This number 2285505114 is actually the lower 32 bits of the real size 19GB.
$ echo "19465374298-16*1024*1024*1024" | bc
2285505114
Such a behaviour is okay for 32-bit software, 64-bit should show correct numbers.
Thanks
Wolfgang
Paul Eggert
2018-03-26 01:36:34 UTC
Permalink
Post by Wolfgang Formann
I accept that problem. I would be happy, when a similar statement like yours
would be in the man page of gzip.
It already is in the gzip manual, which is the main source of detailed info like
that.
Jim Meyering
2018-03-26 01:49:46 UTC
Permalink
tags 30935 notabug
close 30935
stop
Post by Paul Eggert
Post by Wolfgang Formann
I accept that problem. I would be happy, when a similar statement like
yours would be in the man page of gzip.
It already is in the gzip manual, which is the main source of detailed info
like that.
Marking this "issue" as closed in our bug tracker.

Loading...