

Output files, such as log files, from one location to another as a If the copy operation is successful, this option causes S3DistCp to delete the copiedįiles from the source location. For more information, see Using data encryption. To copy an unencrypted object to an encryption-required Amazon S3 bucket, S3DistCp, the objects are automatically unencrypted. outputCodec, the files are copied over with noĮnsures that the target data is transferred using SSL and automatically encrypted inĪmazon S3 using an AWS service-side key. If you choose an output codec, theįilename will be appended with the appropriate extension (e.g. Into output files with LZO compression, or to uncompress the filesĪs part of the copy operation. Option, for example, to convert input files compressed with Gzip Specifies the compression codec to use for the copied files. This optionĪlso respects the -targetSize behavior when used with If you useĭata is appended to files which match the same groups. ItĪppends new file data to existing files. Specifies the behavior of S3DistCp when copying to files from Amazon S3 to HDFS which are already present. ForĮxample, a file concatenated into myfile.gz would be targetSize, they are broken up into part files,Īnd named sequentially with a numeric value appended to the end. If the files concatenated by -groupBy are larger than the value of The data file, thus it is possible that the target file size will This size the actual size of the copied files may be larger or

When -targetSize is set, S3DistCp attempts to match The size, in mebibytes (MiB), of the files to create based on the You do not need to specify -groupBy andĮxample: -groupBy=.*subnetid.*(+-+-+-+).* When -groupBy is specified, only files that match the specified patternĪre copied. The cluster fails on the S3DistCp step and return an error. The regular expression does not include a parenthetical statement, Parenthetical statement being combined into a single output file. Parentheses indicate how files should be grouped, with all of the items that match the

The concatenated filename is the value matched by the regular For example, you could use this option toĬombine all of the log files written in one hour into a single file. String must be enclosed in single quotes (').Įxample: -srcPattern=.*daemons.*-hadoop-.*Įxpression that causes S3DistCp to concatenate files (*), either the regular expression or the entire -args If the regular expression argument contains special characters, such as an asterisk S3DistCp does not support Amazon S3 bucket names that contain the underscore character.Įxpression that filters the copy operation to a subset If this occurs, S3DistCp does not clean up partially copied If S3DistCp is unable to copy some or all of the specified files, the cluster step fails and Information about the Apache DistCp open source project, see the DistCp guide in the Apache Hadoop documentation. It shares the copy,Įrror handling, recovery, and reporting tasks across several servers. Like DistCp, S3DistCp uses MapReduce to copy in a distributed manner. S3DistCp is more scalableĪnd efficient for parallel copying large numbers of objects across buckets and across AWSįor specific commands that demonstrate the flexibility of S3DistCP in real-world scenarios, see Seven tips for using S3DistCp on the AWS Big Data blog. S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. Where it can be processed by subsequent steps in your Amazon EMR cluster. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS The command for S3DistCp in Amazon EMR version 4.0 and later is s3-dist-cp, which you add as a step in a cluster or at the command line. S3DistCp is similar to DistCp, but optimized to work with AWS, Apache DistCp is an open-source tool you can use to copy large amounts of data.
